Category: Artificial Intelligence

  • How to Create an LLM-Powered app to Convert Text to Presentation Slides: GenSlide — A Step-by-step…

    How to Create an LLM-Powered app to Convert Text to Presentation Slides: GenSlide — A Step-by-step…

    Mehdi Mohammadi

    How to Create an LLM-Powered app to Convert Text to Presentation Slides: GenSlide — A Step-by-step Guide

    Photo by Mitchell Luo on Unsplash

    In this post, I am going to share how you can create a simple yet powerful application that uses LLMs to convert your written content into concise PowerPoint slides. The good part: You even run your own LLM service, so

    • you keep your data private, and
    • no cost to call LLM APIs.

    Getting Started

    Reaching the power of GenSlide is straightforward. Follow these steps to set up and run this tool on your machine.

    Step 1: Create project folder

    Begin by creating the project folder in your local machine:

    mkdir GenSlide

    After completing all the steps, the final file structure would look like this:

    GenSlide
    ├── frontend
    │ ├── llm_call.py
    │ ├── slide_deck.py
    │ ├── slide_gen.py
    │ └── ui.py
    ├── llm-service
    │ ├── consts.py
    │ └── gpt.py
    └── requirements.txt

    The first file we create contains packages list. Create a file named requirements.txt. Add the following package dependencies to the file.

    pillow==10.3.0
    lxml==5.2.2
    XlsxWriter==3.2.0
    python-pptx==0.6.23
    gpt4all==2.7.0
    Flask==2.2.5
    Flask-Cors==4.0.0
    streamlit==1.34.0

    Specifically, we’re leveraging the gpt4all package to run a large language model (LLM) server on a local machine. To dive deeper into gpt4all, refer to their official documentation.

    We also use streamlit package to create the user interface.

    Step 2: Setting Up the Environment

    Next, create a virtual environment and install the necessary packages:

    python -m venv ./venv
    source ./venv/bin/activate
    pip install -r requirements.txt

    Note: Ensure that you are using a Python version other than 3.9.7, as streamlit is incompatible with that version. For this tutorial I used Python version 3.12.

    Step 3: Implement LLM Service

    Our LLM service should be able to receive a text as input and generate a summary of the key points of the text as output. It should organize the output in a list of JSON objects. We’ll specify these details in the prompt definition. Let’s first create a folder for LLM service.

    mkdir llm-service

    We arrange the implementation into two .py files in this folder.

    1. consts.py

    Here we need to define the name of LLM model we want to use. You can see the list of models that can be used here: https://docs.gpt4all.io/gpt4all_python/home.html#load-llm. Meta’s Llama model performs good for this task.

    LLM_MODEL_NAME = "Meta-Llama-3-8B-Instruct.Q4_0.gguf"

    We also define the prompt message here that includes the instructions to the LLM as well as some examples of the desired output. We ask the output to be in JSON format, so it would be easier for us to process and create a presentation.

    PROMPT = """
    Summarize the input text and arrange it in an array of JSON objects to to be suitable for a powerpoint presentation.
    Determine the needed number of json objects (slides) based on the length of the text.
    Each key point in a slide should be limited to up to 10 words.
    Consider maximum of 5 bullet points per slide.
    Return the response as an array of json objects.
    The first item in the list must be a json object for the title slide.
    This is a sample of such json object:
    {
    "id": 1,
    "title_text": "My Presentation Title",
    "subtitle_text": "My presentation subtitle",
    "is_title_slide": "yes"
    }
    And here is the sample of json data for slides:
    {"id": 2, title_text": "Slide 1 Title", "text": ["Bullet 1", "Bullet 2"]},
    {"id": 3, title_text": "Slide 2 Title", "text": ["Bullet 1", "Bullet 2", "Bullet 3"]}

    Please make sure the json object is correct and valid.
    Don't output explanation. I just need the JSON array as your output.
    """

    2. gpt.py

    Here we want to create a Flask application that receives HTTP POST requests from the clients and call the LLM model to extract the summary in a JSON representation.

    First things first; import dependencies.

    from flask import Flask, request
    from flask_cors import CORS
    import traceback
    import logging
    import os
    from consts import LLM_MODEL_NAME, PROMPT

    from gpt4all import GPT4All

    Define the host IP, port, Flask app and allow Cross-Origin Resource Sharing.

    logger = logging.getLogger()

    HOST = '0.0.0.0'
    PORT = 8081

    app = Flask(__name__)
    CORS(app)

    Define a base folder to store the LLM model. Here with “MODEL_PATH” environment variable we overwrite the default location of models set by gpt4all. Now the models will be stored in the project folder under “gpt_models/gpt4all/”. When GPT4All class is instantiated for the first time, it will look for the model_name in the model_path (it’s argument), if not found, will look into MODEL_PATH. If not found, it will start to download the model.

    try:
    base_folder = os.path.dirname(__file__)
    base_folder = os.path.dirname(base_folder)
    gpt_models_folder = os.path.join(base_folder, "gpt_models/gpt4all/")
    if not os.path.exists(gpt_models_folder):
    os.makedirs(gpt_models_folder, exist_ok=True)
    model_folder = os.environ.get("MODEL_PATH", gpt_models_folder)
    llm_model = GPT4All(model_name=LLM_MODEL_NAME, model_path=model_folder)
    except Exception:
    raise ValueError("Error loading LLM model.")

    Define a function to call the generate() function of the LLM model and return the response. We may set optional parameters such as temperature and max_tokens.

    def generate_text(content):
    prompt = PROMPT + f"n{content}"

    with llm_model.chat_session():
    output = llm_model.generate(prompt, temp=0.7, max_tokens=1024)
    output = output.strip()

    return output

    Define a POST API to receive clients’ requests. Requests come in the form of JSON objects {“content”:”…”}. We will use this “content” value and call the generate_text() method defined above. If everything goes well, we send the output along with an 200 HTTP (OK) status code. Otherwise an “Error” message and status code 500 is returned.

    @app.route('/api/completion', methods=['POST'])
    def completion():
    try:
    req = request.get_json()
    words = req.get('content')
    if not words:
    raise ValueError("No input word.")
    output = generate_text(words)
    return output, 200
    except Exception:
    logger.error(traceback.format_exc())
    return "Error", 500

    Run the Flask app.

    if __name__ == '__main__':
    # run web server
    app.run(host=HOST,
    debug=True, # automatic reloading enabled
    port=PORT)

    Step 4: Implement front end

    Frontend is where we get the user’s input and interact with the LLM service and finally create the PowerPoint slides.

    Inside the project folder, create a folder named frontend.

    mkdir frontend

    The implementation falls into 4 Python files.

    1. llm_call.py

    This would be where we send the POST requests to the LLM server. We set our LLM server on localhost port 8081. We enclose the input text into a JSON object with the key “content”. The output of the call should be a JSON string.

    import requests

    URL = "http://127.0.0.1:8081"

    CHAT_API_ENDPOINT = f"{URL}/api/completion"

    def chat_completion_request(content):
    headers = {'Content-type': 'application/json'}
    data = {'content': content}

    req = requests.post(url=CHAT_API_ENDPOINT, headers=headers, json=data)
    json_extracted = req.text
    return json_extracted

    2. slide_deck.py

    Here we use pptx package to create PowerPoint slides. The list of JSON objects contains the information to add slides to the presentation. For detailed information about pptx package refer to their documentation.

    import os

    from pptx import Presentation
    from pptx.util import Inches

    class SlideDeck:

    def __init__(self, output_folder="generated"):
    self.prs = Presentation()
    self.output_folder = output_folder

    def add_slide(self, slide_data):
    prs = self.prs
    bullet_slide_layout = prs.slide_layouts[1]
    slide = prs.slides.add_slide(bullet_slide_layout)
    shapes = slide.shapes

    # Title
    title_shape = shapes.title
    title_shape.text = slide_data.get("title_text", "")

    # Body
    if "text" in slide_data:
    body_shape = shapes.placeholders[1]
    tf = body_shape.text_frame
    for bullet in slide_data.get("text", []):
    p = tf.add_paragraph()
    p.text = bullet
    p.level = 0

    if "p1" in slide_data:
    p = tf.add_paragraph()
    p.text = slide_data.get("p1")
    p.level = 1

    if "img_path" in slide_data:
    cur_left = 6
    for img_path in slide_data.get("img_path", []):
    top = Inches(2)
    left = Inches(cur_left)
    height = Inches(4)
    pic = slide.shapes.add_picture(img_path, left, top, height=height)
    cur_left += 1

    def add_title_slide(self, title_page_data):
    # title slide
    prs = self.prs
    title_slide_layout = prs.slide_layouts[0]
    slide = prs.slides.add_slide(title_slide_layout)
    title = slide.shapes.title
    subtitle = slide.placeholders[1]
    if "title_text" in title_page_data:
    title.text = title_page_data.get("title_text")
    if "subtitle_text" in title_page_data:
    subtitle.text = title_page_data.get("subtitle_text")

    def create_presentation(self, title_slide_info, slide_pages_data=[]):
    try:
    file_name = title_slide_info.get("title_text").
    lower().replace(",", "").replace(" ", "-")
    file_name += ".pptx"
    file_name = os.path.join(self.output_folder, file_name)
    self.add_title_slide(title_slide_info)
    for slide_data in slide_pages_data:
    self.add_slide(slide_data)

    self.prs.save(file_name)
    return file_name
    except Exception as e:
    raise e

    3. slide_gen.py

    Let’s break it into smaller snippets.

    Here, after importing necessary packages, create a folder to store the generated .pptx files.

    import json
    import os

    from slide_deck import SlideDeck
    from llm_call import chat_completion_request

    FOLDER = "generated"

    if not os.path.exists(FOLDER):
    os.makedirs(FOLDER)

    Then define these two methods:

    • A method to invoke chat_completion_request and send the request to the LLM and parse the JSON string,
    • A method that gets the output of previous method and instantiate a SlideDeck to fill in the PowerPoint slides.
    def generate_json_list_of_slides(content):
    try:
    resp = chat_completion_request(content)
    obj = json.loads(resp)
    return obj
    except Exception as e:
    raise e

    def generate_presentation(content):
    deck = SlideDeck()
    slides_data = generate_json_list_of_slides(content)
    title_slide_data = slides_data[0]
    slides_data = slides_data[1:]
    return deck.create_presentation(title_slide_data, slides_data)

    4. ui.py

    We create a simple UI with an input text box. User can type in or copy/paste their text there and hit enter to start the slide generation. Upon completion of slide generation, a message is printed at the end of input text box. streamlit is very handy here.

    import traceback
    import streamlit as st

    from slide_gen import generate_presentation

    def create_ui():
    st.write("""
    # Gen Slides
    ### Generating powerpoint slides for your text
    """)

    content = st.text_area(label="Enter your text:", height=400)
    try:
    if content:
    filename = generate_presentation(content)
    st.write(f"file {filename} is generated.")
    except Exception:
    st.write("Error in generating slides.")
    st.write(traceback.format_exc())

    if __name__ == "__main__":
    create_ui()

    Step 5: Running the LLM Service

    Navigate to the llm-service folder and run the gpt.py file:

    cd llm-service
    python gpt.py

    Note: The first time you run this, the LLM model will be downloaded, which may take several minutes to complete.

    Step 6: Launching the User Interface (UI)

    Now, it’s time to bring up the UI. Navigate to the frontend folder and run the ui.py file using Streamlit:

    cd ..
    cd frontend
    streamlit run ui.py

    This command will launch the UI in your default web browser.

    Creating Your PowerPoint Presentation

    With the UI up and running, follow these simple steps to generate your presentation:

    1. Input Text: In the provided text box, input the content you’d like to transform into a presentation. Here is the sample you may use:

    Artificial Intelligence is an idea that has been captivating society since the mid-20th century.
    It began with science fiction familiarizing the world with the concept but the idea wasn't fully seen in the scientific manner until Alan Turing, a polymath, was curious about the feasibility of the concept.
    Turing's groundbreaking 1950 paper, "Computing Machinery and Intelligence," posed fundamental questions about machine reasoning similar to human intelligence, significantly contributing to the conceptual groundwork of AI.
    The development of AI was not very rapid at first because of the high costs and the fact that computers were not able to store commands.
    This changed during the 1956 Dartmouth Summer Research Project on AI where there was an inspiring call for AI research, setting the precedent for two decades of rapid advancements in the field.

    2. Generate Slides: Once you enter the text (followed by Command ⌘ + Enter keys in Mac), GenSlide will process it and create the presentation .pptxfile.

    3. Access Your Slides: The newly created PowerPoint file will be saved in thefrontend/generatedfolder.

    User Interface to enter the text
    Generated PowerPoint slides

    Congratulations! The ability to automate slide generation is not just a technical achievement; it’s a time-saving marvel for professionals and students alike. For the next steps, the application can be extended to read the text from other formats like PDF files, MS Word documents, web pages and more. I would be happy to hear how you use or extend this project.

    For further enhancement and contributions, feel free to explore the repository on GitHub.


    How to Create an LLM-Powered app to Convert Text to Presentation Slides: GenSlide — A Step-by-step… was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    How to Create an LLM-Powered app to Convert Text to Presentation Slides: GenSlide — A Step-by-step…

    Go Here to Read this Fast! How to Create an LLM-Powered app to Convert Text to Presentation Slides: GenSlide — A Step-by-step…

  • Embracing Simplicity and Composability in Data Engineering

    Embracing Simplicity and Composability in Data Engineering

    Bernd Wessely

    Lessons from 30+ years in data engineering: The overlooked value of keeping it simple

    Image by author

    We have a straightforward and fundamental principle in computer programming: the separation of concerns between logic and data. Yet, when I look at the current data engineering landscape, it’s clear that we’ve strayed from this principle, complicating our efforts significantly — I’ve previously written about this issue.

    There are other elegantly simple principles that we frequently overlook and fail to follow. The developers of the Unix operating system, for instance, introduced well thought-out and simple abstractions for building software products. These principles have stood the test of time, evident in millions of applications built upon them. However, for some reason we often take convoluted detours via complex and often closed ecosystems, loosing sight of the KISS principle and the Unix philosophy of simplicity and composability.

    Why does this happen?

    Let’s explore some examples and delve into a bit of history to better understand this phenomenon. This exploration might help to grasp why we repeatedly fail to keep things simple.

    Databases

    Unix-like systems offer a fundamental abstraction of data as files. In these systems nearly everything related to data is a file, including:

    • Regular Files: Typically text, pictures, programs, etc.
    • Directories: A special type of file containing lists of other files, organizing them hierarchically.
    • Devices: Files representing hardware devices, including block-oriented (disks) and character-oriented devices (terminals).
    • Pipes: Files enabling communication between processes.
    • Sockets: Files facilitating network communication between computer nodes.

    Each application can use common operations that all work similar on these different file types, like open(), read(), write(), close(), and lseek (change the position inside a file). The content of a file is just a stream of bytes and the system has no assumptions about the structure of a file’s content. For every file the system maintains basic metadata about the owner, access rights, timestamps, size, and location of the data-blocks on disks.

    This compact and at the same time versatile abstraction supports the construction of very flexible data systems. It has, for instance, also been used to create the well-known relational database systems, which introduced the new abstraction called relation (or table) for us.

    Unfortunately these systems evolved in ways that moved away from treating relations as files. To access the data in these relations now requires calling the database application, using the structured query language (SQL) which was defined as the new interface to access data. This allowed databases to better control access and offer higher-level abstractions than the file system.

    Was this an improvement in general? For a few decades, we obviously believed in that and relational database systems got all the rage. Interfaces such as ODBC and JDBC standardized access to various database systems, making relational databases the default for many developers. Vendors promoted their systems as comprehensive solutions, incorporating not just data management but also business logic, encouraging developers to work entirely within the database environment.

    A brave man named Carlos Strozzi tried to counteract this development and adhere to the Unix philosophy. He aimed to keep things simple and treat the database as just a thin extension to the Unix file abstraction. Because he didn’t want to force applications to only use SQL for accessing the data, he called it NoSQL RDBMS. The term NoSQL was later taken over by the movement towards alternative data storage models driven by the need to handle increasing data volumes at internet scale. Relational databases were dismissed by the NoSQL community as outdated and incapable to address the needs of modern data systems. A confusing multitude of new APIs occured.

    Ironically, the NoSQL community eventually recognized the value of a standard interface, leading to the reinterpretation of NoSQL as “Not Only SQL” and the reintroduction of SQL interfaces to NoSQL databases. Concurrently, the open-source movement and new open data formats like Parquet and Avro emerged, saving data in plain files compatible with the good old Unix file abstractions. Systems like Apache Spark and DuckDB now use these formats, enabling direct data access via libraries relying solely on file abstractions, with SQL as one of many access methods.

    Ultimately, databases actually didn’t offer the better abstraction for the implementation of all the multifaceted requirements in the enterprise. SQL is a valuable tool but not the only or best option. We had to take the detours via RDBMS and NoSQL databases to end up back at files. Maybe we recognize that simple Unix-like abstractions actually provide a robust foundation for the versatile requirements to data management.

    Don’t get me wrong, databases remain crucial, offering features like ACID, granular access control, indexing, and many other. However, I think that one single monolithic system with a constrained and opinionated way of representing data is not the right way to deal with all that varied requirements at enterprise level. Databases add value but should be open and usable as components within larger systems and architecures.

    New ecosystems everywhere

    Databases are just one example of the trend to create new ecosystems that aim to be the better abstraction for applications to handle data and even logic. A similar phenomenon occured with the big data movement. In an effort to process the enormous amounts of data that traditional databases could apparently no longer handle, a whole new ecosystem emerged around the distributed data system Hadoop.

    Hadoop implemented the distributed file system HDFS, tightly coupled with the processing framework MapReduce. Both components are completely Java-based and run in the JVM. Consequently, the abstractions offered by Hadoop were not seamless extensions to the operating system. Instead, applications had to adopt a completely new abstraction layer and API to leverage the advancements in the big data movement.

    This ecosystem spawned a multitude of tools and libraries, ultimately giving rise to the new role of the data engineer. A new role that seemed inevitable because the ecosystem had grown so complex that regular software engineers could no longer keep up. Clearly, we failed to keep things simple.

    Distributed operating system equivalents

    With the insight that big data can’t be handled by single systems, we witnessed the emergence of new distributed operating system equivalents. This somewhat unwieldy term refers to systems that allocate resources to software components running across a cluster of compute nodes.

    For Hadoop, this role was filled with YARN (Yet Another Resource Negotiator), which managed resource allocation among the running MapReduce jobs in Hadoop clusters, much like an operating system allocates resources among processes running in a single system.

    Consequently, an alternative approach would have been to scale the Unix-like operating systems across multiple nodes while retaining familiar single-system abstractions. Indeed, such systems, known as Single System Image (SSI), were developed independently of the big data movement. This approach abstracted the fact that the Unix-like system ran on many distributed nodes, promising horizontal scaling while evolving proven abstractions. However, the development of these systems proved complex apparently and stagnated around 2015.

    A key factor in this stagnation was likely the parallel development by influential cloud providers, who advanced YARN functionality into a distributed orchestration layer for standard Linux systems. Google, for example, pioneered this with its internal system Borg, which apparently required less effort than rewriting the operating system itself. But once again, we sacrificed simplicity.

    Today, we lack a system that transparently scales single-system processes across a cluster of nodes. Instead, we were blessed (or cursed?) with Kubernetes that evolved from Google’s Borg to become the de-facto standard for a distributed resource and orchestration layer running containers in clusters of Linux nodes. Known for its complexity, Kubernetes requires the learning about Persistent Volumes, Persistent Volume Claims, Storage Classes, Pods, Deployments, Stateful Sets, Replica Sets and more. A totally new abstraction layer that bears little resemblance to the simple, familiar abstractions of Unix-like systems.

    Agility

    It is not only computer systems that suffer from supposed advances that disregard the KISS principle. The same applies to systems that organize the development process.

    Since 2001, we have a lean and well-thougt-out manifesto of principles for agile software development. Following these straightforward principles helps teams to collaborate, innovate, and ultimately produce better software systems.

    However, in our effort to ensure successful application, we tried to prescribe these general principles more precisely, detailing them so much that teams now require agile training courses to fully grasp the complex processes. We finally got overly complex frameworks like SAFe that most agile practitioners wouldn’t even consider agile anymore.

    You do not have to believe in agile principles — some argue that agile working has failed — to see the point I’m making. We tend to complicate things excessively when commercial interests gain upper hand or when we rigidly prescribe rules that we believe must be followed. There’s a great talk on this by Dave Thomas (one of the authors of the manifesto) where he explains what happens when we forget about simplicity.

    Trust in principles and architecture, not products and rituals

    The KISS principle and the Unix philosophy are easy to understand, but in the daily madness of data architecture in IT projects, they can be hard to follow. We have too many tools, too many vendors selling too many products that all promise to solve our challenges.

    The only way out is to truly understand and adhere to sound principles. I think we should always think twice before replacing tried and tested simple abstractions with something new and trendy.

    I’ve written about my personal strategy for staying on top of things and understanding the big picture to deal with the extreme complexity we face.

    Commercialism must not determine decisions

    It is hard to follow the simple principles given by the Unix philosophy when your organization is clamoring for a new giant AI platform (or any other platform for that matter).

    Enterprise Resource Planning (ERP) providers, for instance, made us believe at the time that they could deliver systems covering all relevant business requirements in a company. How dare you contradict these specialists?

    Unified Real-Time (Data) Platform (URP) providers now claim their systems will solve all our data concerns. How dare you not use such a comprehensive system?

    But products are always just a small brick in the overall system architecture, no matter how extensive the range of functionality is advertised.

    Data engineering should be grounded in the same software architecture principles used in software engineering. And software architecture is about balancing trade-offs and maintaining flexibility, focusing on long-term business value. Simplicity and composability can help you maintain this focus.

    Pressure from closed thinking models

    Not only commercialism keeps us from adhering to simplicity. Even open source communities can be dogmatic. While we seek golden rules for perfect systems development, they don’t exist in reality.

    The Python community may say that non-pythonic code is bad. The functional programming community might claim that applying OOP principles will send you to hell. And the protagonists on agile programming may want to convince you that any development following the waterfall approach will doom your project to failure. Of course, they are all wrong in their absolutism, but we often dismiss ideas outside our thinking space as inappropriate.

    We like clear rules that we just have to follow to be successful. At one of my clients, for instance, the software development team had intensely studied software design patterns. Such patterns can be very helpful in finding a tried and tested solution for common problems. But what I actually observed in the team was that they viewed these patterns as rules that they had to adhere to rigidly. Not following these rules was like being a bad software engineer. But this often leaded to overly complex designs for very simple problems. Critical thinking based on sound principles cannot be replaced by rigid adherence to rules.

    In the end, it takes courage and thorough understanding of principles to embrace simplicity and composability. This approach is essential to design reliable data systems that scale, can be maintained, and evolve with the enterprise.

    If you find this information useful, please consider to clap. I would be more than happy to receive your feedback with your opinions and questions.


    Embracing Simplicity and Composability in Data Engineering was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Embracing Simplicity and Composability in Data Engineering

    Go Here to Read this Fast! Embracing Simplicity and Composability in Data Engineering

  • Essential Considerations for Implementing Machine Learning

    Essential Considerations for Implementing Machine Learning

    Conal Henderson

    Is your use case a viable ML product from a traditional ML and production perspective?

    Image by Tara Winstead on Pixels

    Have you ever thought about building a data application, but don’t know the requirements for building an ML system? Or, maybe you’re a senior manager at your company with ambitions to use ML, but you’re not quite sure if your use case is ML-friendly.

    Lots of businesses are struggling to keep up with the exponential growth of AI/ML technology, with many aware that the implications of not factoring AI/ML into their roadmap may be existential.

    Companies see the power of Large Language Models (LLM) and think that AI/ML is a ’silver bullet’ for their problems. Most businesses are spending money on new data teams, computing power, and the latest database technology, but do they know if their problem can be solved using ML?

    I have distilled a checklist to validate whether your ML idea is viable from a traditional ML perspective including:

    1. Do you have the appropriate features to make a prediction?
    2. Are there patterns to learn from in your data?
    3. Do you have enough data for ML to be effective, or can you collect data from sources?
    4. Can your use case be framed as a prediction problem?
    5. Does the data you wish to predict have associated patterns with the training data?

    And, from the viewpoint of productionising ML solutions:

    1. Is your use case repetitive?
    2. Will wrong predictions have drastic consequences for end users?
    3. Is your use case scalable?
    4. Is your use case a problem where patterns continually evolve?

    Traditional Considerations

    Arthur Samuel first popularised the phrase ’Machine Learning’ in 1959 stating it is ”the field of study that gives computers the ability to learn without being explicitly programmed”.

    A more systematic definition of ML is given by Chip Huyen — an AI/ML leader and entrepreneur — in her book ’Designing Machine Learning Systems’ — a must-read for anyone interested in production ML:

    “Machine learning is an approach to (1) learn (2) complex patterns from (3) existing data and use these patterns to make (4) predictions on (5) unseen data.”

    Chip breaks down the components of ML into five chunks, and expands on them by including four modern reasons for ML adoption which we’re going to dissect further below.

    Opportunity to Learn

    Do you have the appropriate features to make a prediction?

    Data is fundamental to ML. It provides both the inputs and outputs producing a prediction reflecting patterns with the data.

    For example, you might be an avid football fan, and you want to predict Premier League player market values based on past performance

    The input data would involve player statistics like goals and assists, and the associated player value. An ML model can learn the patterns from this input data to predict unseen player data.

    Complex Patterns

    Are there patterns to learn from in your data?

    ML is at its best when data is complicated, and a human cannot easily identify the patterns needed to predict an output.

    In the football player market value example, it can be difficult to precisely say the value of a footballer given there are many variables that value depends on. ML models can take value (output) and performance statistics (input), and figure out the valuation automatically.

    Data Availability

    Do you have enough data for ML to be effective, or can you collect data from sources?

    There is an ongoing debate as to whether data or better algorithms lead to greater predictive power. Although, this debate has quietened lately considering the enormous performance leaps taken by LLMs as dataset sizes increase into the hundreds of billions, and even trillions of parameters.

    Data Source: Wikipedia

    Data needs to be readily available for your ML application to learn from. If data is scarce, then ML is likely not the best approach.

    In football, data is constantly being generated on player performance by data vendors such as Opta, Fbref, and Transfermarkt as teams look to apply data-driven decisions to all club aspects from player performance to recruitment.

    However, obtaining data from third parties like Opta is expensive due to the intense data collection process and the high demand for detailed stats to give teams an advantage.

    Problem Solved by Prediction

    Can your use case be framed as a prediction problem?

    We can frame the football player market value example as a prediction problem in several ways.

    Two common strands of ML prediction are regression and classification. Regression returns a continuous prediction (i.e. a number) in the same scale as the input variable (i.e. value). Whereas, classification can return a binary (1 or 0), multi-class (1, 2, 3…n), or multi-label (1, 0, 1, 0, 1) prediction.

    The player value prediction problem can be framed as a regression and multi-class issue. Regression simply returns a number such as predicting £100 million for Jude Bellingham’s value based on his season performance.

    Conversely, if we address this as a classification problem, we can bin valuations into buckets and predict which valuation bucket a player resides in. For instance, predictions buckets could be £1m-£10m, £10m-£30m, and £30m+.

    Similar Unseen Data

    Does the data you wish to predict have associated patterns with the training data?

    The unseen data that you want to predict must share similar patterns with the data used to train the ML model.

    For example, if I use player data from 2004 to train an ML model to predict player valuations. If the unseen data is from 2020, then predictions will not reflect the changes in market valuations across the 16 years from training to predicting.

    Production Considerations

    ML model development is only a small component of a much larger system needed to bring ML to life.

    If you build a model in isolation without an understanding of how it will perform at scale, when it comes to production you may find that your model is not viable.

    It’s important that your ML use case can check production-level criteria.

    Repetitive Task

    Is your use case repetitive?

    It takes repetition of patterns for ML to learn from. Models need to be fed a large number of samples to adequately learn patterns meaning if your prediction target occurs frequently you will likely have good data from which ML can learn the patterns.

    For example, if your use case involves trying to predict something that occurs rarely, like an uncommon medical condition, then there’s likely not enough signal in your data for an ML model to pick up on, leading to a poor prediction.

    This problem is referred to as a class imbalance, and strategies such as over-sampling and under-sampling have been developed to overcome this problem.

    Travis Tang’s article does a good job of explaining class imbalance and remedies for it in more detail here.

    Small Consequence for Wrong Prediction

    Will wrong predictions have drastic consequences for end users?

    ML models will struggle to predict with 100% accuracy every time which means when your model makes a false prediction, does it have a negative impact?

    This is a common problem experienced in the medical sector where false-positive and false-negative rates are a concern.

    A false-positive prediction indicates the presence of a condition when it does not exist. This can lead to inefficient allocation of resources and undue stress on patients.

    Perhaps even worse, a false-negative does not indicate the presence of a condition when it does exist. This can lead to patient misdiagnosis and delay of treatment which may lead to medical complications, and increased long-run costs to treat more severe conditions.

    Scale

    Is your use case scalable?

    Production costs can be incredibly expensive, I found this myself when I hosted an XGBRegressor model on Google’s Vertex AI costing me £11 for 2 days! Admittedly, I should not have left it running, but imagine the costs for large-scale applications.

    A well-known example of a scalable ML solution is Amazon’s product recommendation system which generates 35% of the company’s revenue.

    Although it’s an extreme example, this system leverages and justifies the cost of computing power, data, infrastructure, and talented workers, illustrating the fundamentals of building a scalable ML solution that generates value.

    Evolving Patterns

    Is your use case a problem where patterns continually evolve?

    ML is flexible enough to fit new patterns easily and prevents the need to endlessly hard code new solutions every time the data changes.

    Football player values are constantly changing as tactics evolve leading to changes in what teams want from players meaning features will change in their weighting on predicting values.

    To monitor changes, tools like Mlflow and Weights & Biases help track and log the performance of your models, and update them to match the evolving data patterns.

    Conclusion

    Deciding to use ML for your use case should consider much more than just using some historical data you’ve got, slapping a fancy algorithm on it and hoping for the best.

    It requires thinking about complex patterns if you have data available now and in the future, as well as production concerns like whether the cost of a wrong prediction is cheap? Is my use case scalable? And, are the patterns constantly evolving?

    There are reasons you should NOT use ML, including ethics, cost-effectiveness, and whether a simpler solution will suffice, but we can leave that for another time.

    That’s all for now!

    Thanks for reading! Let me know if I’ve missed anything, and I would love to hear from people about their ML use cases!

    Connect with me on LinkedIn

    References

    Huyen, C. (2022). Designing Machine Learning Systems. Sebastopol, CA: O’Reilly

    Geron, A. (2019). Hands-on machine learning with Scikit-Learn, Keras and TensorFlow: concepts, tools, and techniques to build intelligent systems (2nd ed.). O’Reilly.


    Essential Considerations for Implementing Machine Learning was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Essential Considerations for Implementing Machine Learning

    Go Here to Read this Fast! Essential Considerations for Implementing Machine Learning

  • Building a marketing data science team from scratch

    Building a marketing data science team from scratch

    Jose Parreño

    From scratch to a 6-member team: How I built Skyscanner’s marketing data science team, proving value by being focused and strong…

    Originally appeared here:
    Building a marketing data science team from scratch

    Go Here to Read this Fast! Building a marketing data science team from scratch

  • Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

    Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

    Vicente Cruz Mínguez

    In this post, we explain how Cepsa Química and partner Keepler have implemented a generative AI assistant to increase the efficiency of the product stewardship team when answering compliance queries related to the chemical products they market. To accelerate development, they used Amazon Bedrock, a fully managed service that offers a choice of high-performing foundation models (FMs) from leading AI companies like AI21 Labs, Anthropic, Cohere, Meta, Stability AI, and Amazon through a single API, along with a broad set of capabilities to build generative AI applications with security, privacy and safety.

    Originally appeared here:
    Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

    Go Here to Read this Fast! Cepsa Química improves the efficiency and accuracy of product stewardship using Amazon Bedrock

  • GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs

    GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs

    Xiang Song

    GraphStorm is a low-code enterprise graph machine learning (GML) framework to build, train, and deploy graph ML solutions on complex enterprise-scale graphs in days instead of months. With GraphStorm, you can build solutions that directly take into account the structure of relationships or interactions between billions of entities, which are inherently embedded in most real-world […]

    Originally appeared here:
    GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs

    Go Here to Read this Fast! GraphStorm 0.3: Scalable, multi-task learning on graphs with user-friendly APIs