Tag: AI

  • Self-Instruct Framework, Explained

    Self-Instruct Framework, Explained

    Tsiu-zhen-tsin Dmitrii

    Or how to “eliminate” human annotators

    Image generated by DALL·E

    Motivation

    High-level overview of InstructGPT with human annotated outputs and ranking for supervised learning and reward model training | Source: Training language models to follow instructions with human feedback.

    As Large Language Models (LLMs) revolutionize our life, the growth of instruction-tuned LLMs faces significant challenges: the critical need for vast, varied, and high-quality datasets. Traditional methods, such as employing human annotators to generate datasets — a strategy used in InstructGPT (image above)— face high costs, limited diversity, creativity, and allignment challenges. To address these limitations, the Self-Instruct framework² was introduced. Its core idea is simple and powerful: let language models (LM) generate training data, leading to more cost-effective, diverse and creative datasets.

    Therefore, in this article, I would like to lead you through the framework step-by-step, demonstrating all the details so that after reading it, you will be able to reproduce the results yourself 🙂

    ❗ This article provides all steps from code perspective, so please feel free to visit the original GitHub repository .❗

    Self-Instruct Framework

    A high-level overview of the Self-Instruct framework

    The recipe is relatively straightforward:

    • Step 0 — Define Instruction Data:
       — Add a seed of high-quality and diverse human-written tasks in different domains as tuples (instruction, instances) to the task pool;
    • Step 1 — Instruction Generation:
       — Sample 8 (6 human-written and 2 model-generated) instructions from the task pool;
       — Insert bootstrapped instructions into the prompt in a few-shot way and ask an LM to come up with more instructions;
       — Filter generated instructions out based on ROUGE-metric (a method to evaluate the similarity between text outputs and reference texts) and some heuristics (I will cover this later);
      — Repeat Step 1 until reaching some amount of instructions;
    • Step 2 — Classification Task Identification:
       — For every generated instruction in the task pool, we need to identify its type (classification or non-classification) via a few-shot manner;
    • Step 3 — Instance Generation:
       — Given the instructions and task types, generate instances (inputs and outputs) and filter them out based on heuristics;
    • Step 4 — Finetuning the LM to Follow Instructions:
      — Utilize generated tasks to finetune a pre-trained model.

    Voila, that’s how the Self-Instruct works, but the devil is in the details, so let’s dive into every step!

    Step 0 — Define Instruction Data

    Step 0

    Let’s begin by understanding what is inside the initial “Seed of tasks”: it consists of 175 seed tasks (25 classification and 150 non-classifications) with one instruction and one instance per task in different domains. Each task has an id, name, instruction, instances (input and output), and is_classification binary flag, identifying whether the task has a limited output label space.

    There are some examples of classification and non-classification tasks with empty and non-empty input fields:

    Example of classification task with non-empty input
    Example of non-classification task with empty input

    Therefore, we can see in the first example how the input field clarifies and provides context to the more general instruction, while in the second example, we don’t need an input field as long as the instruction is already self-contained. Also, the first example is the classification task — we can answer it by assigning some labels from limited space, while we can’t do the same with the second example.

    This step is crucial as long as we encourage task diversity via data formats in the dataset and demonstrate correct ways of solving various tasks.

    As long as we define the instruction format, we add them to the task pool to store our final dataset.

    Step 1 — Instruction Generation

    Step 1

    Sampling and prompting

    By adding a human-written seed set of tasks to the task pool, we can start with instructions generation. To do so, we need to sample 8 instructions from the task pool (6 human-written and 2 machine-generated) and encode them into the following prompt:

    Prompt to generate new instructions

    However, in the beginning, we do not have any machine-generated instructions. Therefore, we just replaced them with empty strings in the prompt.

    After generation, we extract instructions from the LM’s response (via regular expressions), filter them out, and add filtered instructions to the task pool:

    Pseudo-code of instruction generation step

    We repeat the instruction generation step until we reach some number of machine-generated instructions (specified at the beginning of the step).

    Filtering

    To obtain a diverse dataset, we need to define somehow which instructions will be added or not to the task pool, and the easiest way is a heuristically chosen set of rules, for instance:

    • Filter out instructions that are too short or too long;
    • Filter based on keywords unsuitable for language models (image, graph, file, plot, …);
    • Filter those starting with punctuation;
    • Filter those starting with non-English characters;
    • Filter those when their ROUGE-L similarity with any existing instruction is higher than 0.7;

    Step 2— Classification Task Identification

    Step 2

    The authors of Self-Instruct noticed that depending on an instruction, the language models can be biased towards one label, especially for classification tasks. Therefore, to eliminate such such, we need to classify every instruction via few-shot prompting:

    Prompt used to classify whether a task instruction is a classification or non-classification task (12 classification and 19 non-classification instructions are used in this template)

    Step 3 — Instance Generation

    Step 3

    After identifying the instruction type, we can finally generate input and output, considering that we have two types of instructions (classification or non-classification). How? Few-shot prompting!

    For non-classification instructions, we ask the model to generate input and only then output (Input-First Approach), but for classification tasks, we ask the model to generate output (class label) first and then condition input generation based on output (Output-First Approach). Compared to Step 0, we don’t restrict the number of generated instances per every instruction.

    Prompt used for the Input-First Approach of instance generation
    Prompt used for the Output-First Approach of instance generation

    After generation, we extract instances and format them (regular expressions); after formatting, we filter them out using some rules, for example:

    • If input and output are the same,
    • If instances are already in the task pool,
    • If the output is empty,
    • These are usually incomplete generations if the input or output ends with a colon;

    And some other heuristics. In the end, we have the following example of a generated task with 1 instruction and 1 instance:

    Instance generation example

    That’s the main idea behind Self-Intsruct!

    Step 4— Finetuning the LM to Follow Instructions

    After completing all previous steps, we can take a pre-trained LM and instruction-tune it on the generated dataset to achieve better metrics.

    Overcoming challenges

    At the beginning of the article, I covered some challenges that “instruction-tuned” LLMs face; let’s see how Self-Instruct enables overcoming them.

    Quantity

    With the help of only 175 initial human-written tasks, 52K instructions and 82K instances were generated:

    Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Diversity

    To investigate how diverse the generated dataset is, authors of Self-Instruct used Berkley Neural Parser to parse instructions and then extract the closest verb to the root and its first direct noun object. 26K out of 52K instructions have a verb-noun format, but the other 26K instructions have more complex structure (e.g., “Classify whether this tweet contains political content or not.”) or are framed as questions (e.g., “Which of these statements are true?”).

    The top 20 most common root verbs (inner circle) and their top 4 direct noun objects (outer circle) in the generated instructions | Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Quality

    To prove that Self-Instruct can generate high-quality tasks, it was randomly selected 200 generated instructions and sampled 1 instance per instruction, and then the author of the framework assessed them, obtaining the following results:

    Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

    As we can see, 92% of all tasks describe a valid task, and 54% — have all valid fields (given that we generated 52K tasks, at least 26K will represent high-quality data, which is fantastic!)

    Costs

    The Self-Instruct framework also introduces significant cost advantages as well. The initial phases of task generation (Steps 1-3 ) amount to a mere $600, while the last step of fine-tuning using the GPT-3 model incurs a cost of $338. It’s vital to keep in mind when we look at results!

    Results

    How Self-Instruct can enhance the ROUGE-L metric on the SuperNI (Super-Natural Instructions) dataset? For that, we can compare the results of 1) off-the-shelf pre-trained LMs without any instruction fine-tuning (Vanilla LMs), 2) Instruction-tuned models (Instruction-tuned w/o SuperNI), and 3) Instruction-tuned models trained on SuperNI (Instruction-tuned w/ SuperNI):

    Evaluation results on unseen tasks from SuperNI | Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

    As we can see, using Self-Instruct demonstrates a 33% absolute improvement over the original model on the dataset (1); simultaneously, it shows that using the framework can also slightly improve metrics after fine-tuning the SuperNI dataset (3).

    Moreover, if we create a new (=unseen) dataset of 252 instructions and 1 instance per instruction and evaluate a selection of instruction-tuned variants, we can see the following results:

    Performance of GPT3 model and its instruction-tuned variants, evaluated by human experts on our 252 user-oriented instructions | Source: Self-Instruct: Aligning Language Models with Self-Generated Instructions

    GPT3 + Self-Instruct shows impressive results compared to other instruction-tuned variants, but there is still a place for improvement compared to InstructGPT (previously available LLMs by OpenAI) variants.

    Enhancements

    The idea behind Self-Instruct is straightforward, but at the same time, it is compelling, so let’s look at how we can use it in different cases.

    Stanford Alpaca³

    In 2023, Alpaca LLM from Stanford gained colossal interest due to affordability, accessibility, and the fact that it was developed for less than $600, and at the same time, it combined LLaMA and Self-Instruct ideas.

    High-level overview of Alpaca | Source: Alpaca: A Strong, Replicable Instruction-Following Model

    Alpaca’s version of Self-Instruct were slightly modified:

    • Step 1 (instruction generation): more aggressive batch decoding was applied, i.e., generating 20 instructions at once
    • Step 2 (classification task): this step was wholly excluded
    • Step 3 (instance generation): only one instance is generated per instruction

    In the end, researchers from Stanford could achieve significant improvements in comparison to the initial set-up in Self-Instruct and based on performed a blind pairwise comparison between text-davinci-003 (InstructGPT-003) and Alpaca 7B: Alpaca wins 90 versus 89 comparisons against text-davinci-003.

    Self-Rewarding Language Models⁴

    Source: Self-Rewarding Language Models

    In 2024, Self-Instruct is a practical framework used in more complex set-ups like in Self-Rewarding Language Models by Meta. As in Self-Instruct, initially, we have a seed set of human-written tasks; we then generate new instructions {xᵢ} and prompt model Mₜ to generate outputs {yᵢ¹, …, yᵢᵏ} and later generate rewards {rᵢ¹, …, rᵢᵏ } — that’s how we could ““eliminate”” human-annotators in InstructGPT by self-instruction process. The last block of Self-Rewarding models is instruction following training — on this step, we compose preference pairs and via DPO train Mₜ₊₁ — next iteration model. Therefore, we can repeat this procedure repeatedly to enrich the dataset and improve the initial pre-trained model.

    Exploring Limitations

    Although Self-Instruct offers an innovative approach to autonomous dataset generation, its reliance on large pre-trained models introduces potential limitations.

    Data quality

    Despite the impressive capability to generate synthetic data, the quality — marked by a 54% validity in the Overcoming Challenges section — remains a concern. It underscores a critical issue: the biases inherent in pre-trained models could replicate, or even amplify, within the generated datasets.

    Tail phenomena

    Instructions vary in frequency: some instructions are frequently requested, while others are rare. Nonetheless, it’s crucial to effectively manage these infrequent requests, as they highlight the brittleness of LLMs in processing uncommon and creative tasks.

    Conclusion

    In conclusion, the Self-Instruct framework represents an advancement in developing instruction-tuned LMs, offering an innovative solution to the challenges of dataset generation. Enabling LLMs to autonomously produce diverse and high-quality data significantly reduces dependency on human annotators, therefore driving down costs.

    Unless otherwise noted, all images are by the author, inspired by Self-Instruct 🙂

    References:

    [1] Ouyang, Long, et al. “Training language models to follow instructions with human feedback.” Advances in Neural Information Processing Systems 35 (2022): 27730–27744

    [2] Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D. and Hajishirzi, H., 2022. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.

    [3] Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P. and Hashimoto, T.B., 2023. Stanford alpaca: An instruction-following llama model.

    [4] Yuan, W., Pang, R.Y., Cho, K., Sukhbaatar, S., Xu, J. and Weston, J., 2024. Self-rewarding language models. arXiv preprint arXiv:2401.10020.


    Self-Instruct Framework, Explained was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Self-Instruct Framework, Explained

    Go Here to Read this Fast! Self-Instruct Framework, Explained

  • Job Search 2.0-Turbo

    Hussein Jundi

    A step-by-step guide on building a team of AI agents that automate and refine the search and selection process matching job seeker’s skills

    Originally appeared here:
    Job Search 2.0-Turbo

    Go Here to Read this Fast! Job Search 2.0-Turbo

  • Environmental Implications of the AI Boom

    Stephanie Kirmer

    The digital world can’t exist without the natural resources to run it. What are the costs of the tech we’re using to build and run AI?

    Photo by ANGELA BENITO on Unsplash

    There’s a core concept in machine learning that I often tell laypeople about to help clarify the philosophy behind what I do. That concept is the idea that the world changes around every machine learning model, often because of the model, so the world the model is trying to emulate and predict is always in the past, never the present or the future. The model is, in some ways, predicting the future — that’s how we often think of it — but in many other ways, the model is actually attempting to bring us back to the past.

    I like to talk about this because the philosophy around machine learning helps give us real perspective as machine learning practitioners as well as the users and subjects of machine learning. Regular readers will know I often say that “machine learning is us” — meaning, we produce the data, do the training, and consume and apply the output of models. Models are trying to follow our instructions, using raw materials we have provided to them, and we have immense, nearly complete control over how that happens and what the consequences will be.

    Another aspect of this concept that I find useful is the reminder that models are not isolated in the digital world, but in fact are heavily intertwined with the analog, physical world. After all, if your model isn’t affecting the world around us, that sparks the question of why your model exists in the first place. If we really get down to it, the digital world is only separate from the physical world in a limited, artificial sense, that of how we as users/developers interact with it.

    This last point is what I want to talk about today — how does the physical world shape and inform machine learning, and how does ML/AI in turn affect the physical world? In my last article, I promised that I would talk about how the limitations of resources in the physical world intersect with machine learning and AI, and that’s where we’re going.

    AI Needs the Physical World

    This is probably obvious if you think about it for a moment. There’s a joke that goes around about how we can defeat the sentient robot overlords by just turning them off, or unplugging the computers. But jokes aside, this has a real kernel of truth. Those of us who work in machine learning and AI, and computing generally, have complete dependence for our industry’s existence on natural resources, such as mined metals, electricity, and others. This has some commonalities with a piece I wrote last year about how human labor is required for machine learning to exist, but today we’re going to go a different direction and talk about two key areas that we ought to appreciate more as vital to our work — mining/manufacturing and energy, mainly in the form of electricity.

    If you go out looking for it, there is an abundance of research and journalism about both of these areas, not only in direct relation to AI, but relating to earlier technological booms such as cryptocurrency, which shares a great deal with AI in terms of its resource usage. I’m going to give a general discussion of each area, with citations for further reading so that you can explore the details and get to the source of the scholarship. It is hard, however, to find research that takes into account the last 18 months’ boom in AI, so I expect that some of this research is underestimating the impact of the new technologies in the generative AI space.

    Mining and Manufacturing

    What goes in to making a GPU chip? We know these chips are instrumental in the development of modern machine learning models, and Nvidia, the largest producer of these chips today, has ridden the crypto boom and AI craze to a place among the most valuable companies in existence. Their stock price went from the $130 a share at the start of 2021 to $877.35 a share in April 2024 as I write this sentence, giving them a reported market capitalization of over $2 trillion. In Q3 of 2023, they sold over 500,000 chips, for over $10 billion. Estimates put their total 2023 sales of H100s at 1.5 million, and 2024 is easily expected to beat that figure.

    GPU chips involve a number of different specialty raw materials that are somewhat rare and hard to acquire, including tungsten, palladium, cobalt, and tantalum. Other elements might be easier to acquire but have significant health and safety risks, such as mercury and lead. Mining these elements and compounds has significant environmental impacts, including emissions and environmental damage to the areas where mining takes place. Even the best mining operations change the ecosystem in severe ways. This is in addition to the risk of what are called “Conflict Minerals”, or minerals that are mined in situations of human exploitation, child labor, or slavery. (Credit where it is due: Nvidia has been very vocal about avoiding use of such minerals, calling out the Democratic Republic of Congo in particular.)

    In addition, after the raw materials are mined, all of these materials have to be processed extremely carefully to produce the tiny, highly powerful chips that run complex computations. Workers have to take on significant health risks when working with heavy metals like lead and mercury, as we know from industrial history over the last 150+ years. Nvidia’s chips are made largely in factories in Taiwan run by a company called Taiwan Semiconductor Manufacturing Company, or TSMC. Because Nvidia doesn’t actually own or run factories, Nvidia is able to bypass criticism about manufacturing conditions or emissions, and data is difficult to come by. The power required to do this manufacturing is also not on Nvidia’s books. As an aside: TSMC has reached the maximum of their capacity and is working on increasing it. In parallel, NVIDIA is planning to begin working with Intel on manufacturing capacity in the coming year.

    After a chip is produced, it can have a lifespan of usefulness that can be significant —3–5 years if maintained well — however, Nvidia is constantly producing new, more powerful, more efficient chips (2 million a year is a lot!) so a chip’s lifespan may be limited by obsolescence as well as wear and tear. When a chip is no longer useful, it goes into the pipeline of what is called “e-waste”. Theoretically, many of the rare metals in a chip ought to have some recycling value, but as you might expect, chip recycling is a very specialized and challenging technological task, and only about 20% of all e-waste gets recycled, including much less complex things like phones and other hardware. The recycling process also requires workers to disassemble equipment, again coming into contact with the heavy metals and other elements that are involved in manufacturing to begin with.

    If a chip is not recycled, on the other hand, it is likely dumped in a landfill or incinerated, leaching those heavy metals into the environment via water, air, or both. This happens in developing countries, and often directly affects areas where people reside.

    Most research on the carbon footprint of machine learning, and its general environmental impact, has been in relation to power consumption, however. So let’s take a look in that direction.

    Electricity

    Once we have the hardware necessary to do the work, the elephant in the room with AI is definitely electricity consumption. Training large language models consumes extraordinary amounts of electricity, but serving and deploying LLMs and other advanced machine learning models is also an electricity sinkhole.

    In the case of training, one research paper suggests that training GPT-3, with 175 billion parameters, runs around 1,300 megawatt hours (MWh) or 1,300,000 KWh of electricity. Contrast this with GPT-4, which uses 1.76 trillion parameters, and where the estimated power consumption of training was between 51,772,500 and 62,318,750 KWh of electricity. For context, an average American home uses just over 10,000 KWh per year. On the conservative end, then, training GPT-4 once could power almost 5,000 American homes for a year. (This is not considering all the power consumed by preliminary analyses or tests that almost certainly were required to prepare the data and get ready to train.)

    Given that the power usage between GPT-3 and GPT-4 training went up approximately 40x, we have to be concerned about the future electrical consumption involved in next versions of these models, as well as the consumption for training models that generate video, image, or audio content.

    Past the training process, which only needs to happen once in the life of a model, there’s the rapidly growing electricity consumption of inference tasks, namely the cost of every time you ask Chat-GPT a question or try to generate a funny image with an AI tool. This power is absorbed by data centers where the models are running so that they can serve results around the globe. The International Energy Agency predicted that data centers alone would consume 1,000 terawatts in 2026, roughly the power usage of Japan.

    Major players in the AI industry are clearly aware of the fact that this kind of growth in electricity consumption is unsustainable. Estimates are that data centers consume between .5% and 2% of all global electricity usage, and potentially could be 25% of US electricity usage by 2030.

    Electrical infrastructure in the United States is not in good condition — we are trying to add more renewable power to our grid, of course, but we’re deservedly not known as a country that manages our public infrastructure well. Texas residents in particular know the fragility of our electrical systems, but across the US climate change in the form of increased extreme weather conditions causes power outages at a growing rate.

    Whether investments in electricity infrastructure have a chance of meeting the skyrocketing demand wrought by AI tools is still to be seen, and since government action is necessary to get there, it’s reasonable to be pessimistic.

    In the meantime, even if we do manage to produce electricity at the necessary rates, until renewable and emission-free sources of electricity are scalable, we’re adding meaningfully to the carbon emissions output of the globe by using these AI tools. At a rough estimate of 0.86 pounds of carbon emissions per KWh of power, training GPT-4 output over 20,000 metric tons of carbon into the atmosphere. (In contrast, the average American emits 13 metric tons per year.)

    Ok, So What?

    As you might expect, I’m not out here arguing that we should quit doing machine learning because the work consumes natural resources. I think that workers who make our lives possible deserve significant workplace safety precautions and compensation commensurate with the risk, and I think renewable sources of electricity should be a huge priority as we face down preventable, human caused climate change.

    But I talk about all this because knowing how much our work depends upon the physical world, natural resources, and the earth should make us humbler and make us appreciate what we have. When you conduct training or inference, or use Chat-GPT or Dall-E, you are not the endpoint of the process. Your actions have downstream consequences, and it’s important to recognize that and make informed decisions accordingly. You might be renting seconds or hours of use of someone else’s GPU, but that still uses power, and causes wear on that GPU that will eventually need to be disposed of. Part of being ethical world citizens is thinking about your choices and considering your effect on other people.

    In addition, if you are interested in finding out more about the carbon footprint of your own modeling efforts, there’s a tool for that: https://www.green-algorithms.org/

    Read more of my work at www.stephaniekirmer.com.


    Environmental Implications of the AI Boom was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Environmental Implications of the AI Boom

    Go Here to Read this Fast! Environmental Implications of the AI Boom

  • Starting ML Product Initiatives on the Right Foot

    Anna Via

    Top 3 lessons learned: the problem, the size, and the data

    Picture by Snapwire, on Pexels

    This blog post is an updated version of part of a conference talk I gave on GOTO Amsterdam last year. The talk is also available to watch online.

    As a Machine Learning Product Manager, I am fascinated by the intersection of Machine Learning and Product Management, particularly when it comes to creating solutions that provide value and positive impact on the product, company, and users. However, managing to provide this value and positive impact is not an easy job. One of the main reasons for this complexity is the fact that, in Machine Learning initiatives developed for digital products, two sources of uncertainty intersect.

    From a Product Management perspective, the field is uncertain by definition. It is hard to know the impact a solution will have on the product, how users will react to it, and if it will improve product and business metrics or not… Having to work with this uncertainty is what makes Product Managers potentially different from other roles like Project Managers or Product Owners. Product strategy, product discovery, sizing of opportunities, prioritization, agile, and fast experimentation, are some strategies to overcome this uncertainty.

    The field of Machine Learning also has a strong link to uncertainty. I always like to say “With predictive models, the goal is to predict things you don’t know are predictable”. This translates into projects that are hard to scope and manage, not being able to commit beforehand to a quality deliverable (good model performance), and many initiatives staying forever as offline POCs. Defining well the problem to solve, initial data analysis and exploration, starting small, and being close to the product and business, are actions that can help tackle the ML uncertainty in projects.

    Mitigating this uncertainty risk from the beginning is key to developing initiatives that end up providing value to the product, company, and users. In this blog post, I’ll deep-dive into my top 3 lessons learned when starting ML Product initiatives to manage this uncertainty from the beginning. These learnings are mainly based on my experience, first as a Data Scientist and now as an ML Product Manager, and are helpful to improve the likelihood that an ML solution will reach production and achieve a positive impact. Get ready to explore:

    • Start with the problem, and define how predictions will be used from the beginning.
    • Start small, and maintain small if you can.
    • Data, data, and data: quality, volume, and historic.

    Start with the problem (and define how predictions will be used)

    Start from the right problem, Steve JohnsonPexels

    I have to admit, I have learned this the hard way. I’ve been involved in projects where, once the model was developed and prediction performance was determined to be “good enough”, the model’s predictions weren’t really usable for any specific use case, or were not useful to help solve any problem.

    There are many reasons this can happen, but the ones I’ve found more frequently are:

    • Solution-driven initiatives: even before GenAI, Machine Learning, and predictive models were “cool” solutions, and because of that some initiatives started from the ML solution: “let’s try to predict churn” (users or clients who abandon a company), “let’s try to predict user segments”… Current GenAI hype has worsened this trend, putting pressure on companies to integrate GenAI solutions “anywhere” they fit.
    • Lack of end-to-end design of the solution: in very few cases, the predictive model is a standalone solution. Usually, though, models and their predictions are integrated into a bigger system to solve a specific use case or enable a new functionality. If this end-to-end solution is not defined from the beginning, it can happen that the model, once already implemented, is found to be useless.

    To start an ML initiative on the right foot, it is key to start with the good problem to solve. This is foundational in Product Management, and recurrently reinforced product leaders like Marty Cagan and Melissa Perri. It includes product discovery (through user interviews, market research, data analysis…), and sizing and prioritization of opportunities (by taking into account quantitative and qualitative data).

    Once opportunities are identified, the second step is to explore potential solutions for the problem, which should include Machine Learning and GenAI techniques, if they can help solve the problem.

    If it is decided to try out a solution that includes the use of predictive models, the third step would be to do an end-to-end definition and design of the solution or system. This way, we can ensure the requirements on how to use the predictions by the system, influence the design and implementation of the predictive piece (what to predict, data to be used, real-time vs batch, technical feasibility checks…).

    However, I’d like to add there might be a notable exception in this topic. Starting from GenAI solutions, instead of from the problem, can make sense if this technology ends up truly revolutionizing your sector or the world as we know it. There are a lot of discussions about this, but I’d say it is not clear yet whether that will happen or not. Up until now, we have seen this revolution in very specific sectors (customer support, marketing, design…) and related to people’s efficiency when performing certain tasks (coding, writing, creating…). For most companies though, unless it’s considered R&D work, delivering short/mid-term value still should mean focusing on problems, and considering GenAI just as any other potential solution to them.

    Start small (and maintain small if you can)

    Tough experiences lead to this learning as well. Those experiences had in common a big ML project defined in a waterfall manner. The kind of project that is set to take 6 months, and follow the ML lifecycle phase by phase.

    Waterfall project planning following the ML Lifecycle phases, image by author

    What could go wrong, right? Let me remind you of my previous quote “With predictive models, the goal is to predict things you don’t know are predictable”! In a situation like this, it can happen that you arrive at month 5 of the project, and during the model evaluation realize there is no way the model is able to predict whatever it needs to predict with good enough quality. Or worse, you arrive at month 6, with a super model deployed in production, and realize it is not bringing any value.

    This risk combines with the uncertainties related to Product, and makes it mandatory to avoid big, waterfall initiatives if possible. This is not something new or related only to ML initiatives, so there is a lot we can learn from traditional software development, Agile, Lean, and other methodologies and mindsets. By starting small, validating assumptions soon and continuously, and iteratively experimenting and scaling, we can effectively mitigate this risk, adapt to insights and be more cost-efficient.

    While these principles are well-established in traditional software and product development, their application to ML initiatives is a bit more complex, as it is not easy to define “small” for an ML model and deployment. There are some approaches, though, that can help start small in ML initiatives.

    Rule-based approaches, simplifying a predictive model through a decision tree. This way, “predictions” can be easily implemented as “if-else statements” in production as part of the functionality or system, without the need to deploy a model.

    Proofs of Concept (POCs), as a way to validate offline the predictive feasibility of the ML solution, and hint on the potential (or not) of the predictive step once in production.

    Minimum Viable Products (MVPs), to first focus on essential features, functionalities, or user segments, and expand the solution only if the value has been proven. For an ML model this can mean, for example, only the most straightforward, priority input features, or predicting only for a segment of data points.

    Buy instead of build, to leverage existing ML solutions or platforms to help reduce development time and initial costs. Only when proved valuable and costs increase too much, might be the right time to decide to develop the ML solution in-house.

    Using GenAI as an MVP, for some use cases (especially if they involve text or images), genAI APIs can be used as a first approach to solve the prediction step of the system. Tasks like classifying text, sentiment analysis, or image detection, where GenAI models deliver impressive results. When the value is validated and if costs increase too much, the team can decide to build a specific “traditional” ML model in-house.

    Note that using GenAI models for image or text classification, while possible and fast, means using a way too big an complex model (expensive, lack of control, hallucinations…) for something that could be predicted with a much simpler and controllable one. A fun analogy would be the idea of delivering a pizza with a truck: it is feasible, but why not just use a bike?

    Data, data, and data (quality, volume, historic)

    Picture by Tima Miroshnichenko, on Pexels

    Data is THE recurring problem Data Scientist and ML teams encounter when starting ML initiatives. How many times have you been surprised by data with duplicates, errors, missing batches, weird values… And how different that is from the toy datasets you find in online courses!

    It can also happen that the data you need is simply not there: the tracking of the specific event was never implemented, collection and proper ETLs where implemented recently… I have experienced how this translates into having to wait some months to be able to start a project with enough historic and volume data.

    All this relates to the adage “Garbage in, garbage out”: ML models are only as good as the data they’re trained on. Many times, solutions have a bigger potential to be improve by improving the data than by improving the models (Data Centric AI). Data needs to be sufficient in volume, historic (data generated during years can bring more value than the same volume generated in just a week), and quality. To achieve that, mature data governance, collection, cleaning, and preprocessing are critical.

    From the ethical AI point of view, data is also a primary source of bias and discrimination, so acknowledging that and taking action to mitigate these risks is paramount. Considering data governance principles, privacy and regulatory compliance (e.g. EU’s GDPR), is also key to ensure a responsible use of data (especially when dealing with personal data).

    With GenAI models this is pivoting: huge volumes of data are already used to train them. When using these types of models, we might not need volume and quality data for training, but we might need it for fine-tuning (see Good Data = Good GenAI), or to construct the prompts (nurture the context, few-shot learning, Retrieval Augmented Generation… — I explained all these concepts in a previous post!).

    It is important to note that by using these models we are losing control of the data used to train them, and we can suffer from the lack of quality or type of data used there: there are many known examples of bias and discrimination in GenAI outputs that can negatively impact our solution. A good example was Bloomberg’s article on how “How ChatGPT is a recruiter’s dream tool — tests show there’s racial bias”. LLM leaderboards testing for biases, or LLMs specifically trained to avoid these biases can be useful in this sense.

    Gender bias example with ChatGPT (prompting on May 1st 2024)

    Wrapping it up

    We started this blogpost discussing what makes ML Product initiatives especially tricky: the combination of the uncertainty related to developing solutions in digital products, with the uncertainty related to trying to predict things through the use of ML models.

    It is comforting to know there are actionable steps and strategies available to mitigate these risks. Yet, perhaps the best ones, are related to starting the initiatives off on the right foot! To do so, it can really help to start with the right problem and an end-to-end design of the solution, reduce initial scope, and prioritize data quality, volume, and historical accuracy.

    I hope this post was useful and that it will help you challenge how you start working in future new initiatives related to ML Products!


    Starting ML Product Initiatives on the Right Foot was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Starting ML Product Initiatives on the Right Foot

    Go Here to Read this Fast! Starting ML Product Initiatives on the Right Foot

  • AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

    AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

    Xin Huang

    Today, we’re excited to announce the availability of Meta Llama 3 inference on AWS Trainium and AWS Inferentia based instances in Amazon SageMaker JumpStart. The Meta Llama 3 models are a collection of pre-trained and fine-tuned generative text models. Amazon Elastic Compute Cloud (Amazon EC2) Trn1 and Inf2 instances, powered by AWS Trainium and AWS […]

    Originally appeared here:
    AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

    Go Here to Read this Fast! AWS Inferentia and AWS Trainium deliver lowest cost to deploy Llama 3 models in Amazon SageMaker JumpStart

  • From Social Science to Data Science

    From Social Science to Data Science

    Matt Chapman

    8 years ago I started my bachelor’s degree in Geography. Now I’m a Data Scientist; this is the story of how (and why) I’ve got here

    Originally appeared here:
    From Social Science to Data Science

    Go Here to Read this Fast! From Social Science to Data Science

  • Revolutionize Customer Satisfaction with tailored reward models for your business on Amazon SageMaker

    Revolutionize Customer Satisfaction with tailored reward models for your business on Amazon SageMaker

    Dinesh Subramani

    As more powerful large language models (LLMs) are used to perform a variety of tasks with greater accuracy, the number of applications and services that are being built with generative artificial intelligence (AI) is also growing. With great power comes responsibility, and organizations want to make sure that these LLMs produce responses that align with […]

    Originally appeared here:
    Revolutionize Customer Satisfaction with tailored reward models for your business on Amazon SageMaker

    Go Here to Read this Fast! Revolutionize Customer Satisfaction with tailored reward models for your business on Amazon SageMaker