Overview
Large Language Models (LLMs) are often described as Generative Artificial Intelligence (GenAI) as they indeed have the ability to generate text. The first popular application of LLMs were chatbots with ChatGPT leading the way. Then we extended their horizon to other tasks such as semantic search and retrieval augmented search (RAG). Today, I want to talk about a rising application for LLM which is structuring unstructed data for which I am going to show you an example by structuring raw texts into JSON data.
Using LLMs for data structuration and extraction is a very promising application with a lot of potential. Here’s why:
- Improved accuracy : LLMs understand the nuances of human language. This allows them to identify key information within messy, unstructured text with greater accuracy than traditional rule-based systems
- Automation potential : Extracting information from unstructured data can be a time-consuming and laborious task. LLMs can automate this process, freeing up human resources for other tasks and allowing for faster analysis of larger datasets.
- Adaptability and learning capabilities: LLMs, on the other hand, can be continuously fine-tuned and adapted to handle new data sources and information types. As they are exposed to more unstructured data, they can learn and improve their ability to identify patterns and extract relevant information.
- Business outcome : A vast amount of valuable information resides within unstructured textual data sources like emails, customer reviews, social media conversations, and internal documents. However, this data is often difficult to analyze. LLMs can unlock this hidden potential by transforming unstructured data into a structured format. This allows businesses to leverage powerful analytics tools to identify trends, and gain insights. Essentially, by structuring unstructured data with LLMs, businesses can transform a liability (unusable data) into an asset (valuable insights) that drives better decision-making and improves overall business outcomes.
An example
Recently, I was searching for an open-source recipes dataset for a personal project but I could not find any except for this github repository containing the recipes displayed on publicdomainrecipes.com.
Unfortunately, I needed a dataset that was more exploitable, i.e something closer to tabular data or to a NoSQL document. That’s how I thought about finding a way to transform the raw data into something more suitable to my needs, without spending hours, days and weeks doing it manually.
Let me show you how I used the power of Large Language Models to automate the process of converting the raw text into structured documents.
Dataset
The original dataset is a collection of markdown files. Each file representing a recipe.
As you can see, this is not completely unstructured, there are nice tabular metadata on top of the file, then there are 4 distincts sections:
- An introduction,
- The list of ingredients
- Directions
- Some tips.
Based on this observation, Sebastian Bahr, developed a parser to transform the markdown files into JSON here.
The output of the parser is already more exploitable, besides Sebastian used it to build a recipe recommender chatbot. However, there are still some drawbacks. The ingredients and directions keys contain raw texts that could be better structured.
As-is, some useful information is hidden.
For example, the quantities for the ingredients, the preparation or cooking time for each step.
Code
In the remainder of this article, I’ll show the steps that I undertook to get to JSON documents that look like the one below.
{
"name": "Crêpes",
"serving_size": 4,
"ingredients": [
{
"id": 1,
"name": "white flour",
"quantity": 300.0,
"unit": "g"
},
{
"id": 2,
"name": "eggs",
"quantity": 3.0,
"unit": "unit"
},
{
"id": 3,
"name": "milk",
"quantity": 60.0,
"unit": "cl"
},
{
"id": 4,
"name": "beer",
"quantity": 20.0,
"unit": "cl"
},
{
"id": 5,
"name": "butter",
"quantity": 30.0,
"unit": "g"
}
],
"steps": [
{
"number": 1,
"description": "Mix flour, eggs, and melted butter in a bowl.",
"preparation_time": null,
"cooking_time": null,
"used_ingredients": [1,2,5]
},
{
"number": 2,
"description": "Slowly add milk and beer until the dough becomes fluid enough.",
"preparation_time": 5,
"cooking_time": null,
"used_ingredients": [3,4]
},
{
"number": 3,
"description": "Let the dough rest for one hour.",
"preparation_time": 60,
"cooking_time": null,
"used_ingredients": []
},
{
"number": 4,
"description": "Cook the crêpe in a flat pan, one ladle at a time.",
"preparation_time": 10,
"cooking_time": null,
"used_ingredients": []
}
]
}
The code to reproduce the tutorial is on GitHub here.
I relied on two powerful libraries langchain for communicating with LLM providers and pydantic to format the output of the LLMs.
First, I defined the two main components of a recipe with the Ingredientand Stepclasses.
In each class, I defined the relevant attributes and provided a description of the field and examples. Those are then fed to the LLMs by langchain leading to better results.
"""`schemas.py`"""
from pydantic import BaseModel, Field, field_validator
class Ingredient(BaseModel):
"""Ingredient schema"""
id: int = Field(
description="Randomly generated unique identifier of the ingredient",
examples=[1, 2, 3, 4, 5, 6],
)
name: str = Field(
description="The name of the ingredient",
examples=["flour", "sugar", "salt"]
)
quantity: float | None = Field(
None,
description="The quantity of the ingredient",
examples=[200, 4, 0.5, 1, 1, 1],
)
unit: str | None = Field(
None,
description="The unit in which the quantity is specified",
examples=["ml", "unit", "l", "unit", "teaspoon", "tablespoon"],
)
@field_validator("quantity", mode="before")
def parse_quantity(cls, value: float | int | str | None):
"""Converts the quantity to a float if it is not already one"""
if isinstance(value, str):
try:
value = float(value)
except ValueError:
try:
value = eval(value)
except Exception as e:
print(e)
pass
return value
class Step(BaseModel):
number: int | None = Field(
None,
description="The position of the step in the recipe",
examples=[1, 2, 3, 4, 5, 6],
)
description: str = Field(
description="The action that needs to be performed during that step",
examples=[
"Preheat the oven to 180°C",
"Mix the flour and sugar in a bowl",
"Add the eggs and mix well",
"Pour the batter into a greased cake tin",
"Bake for 30 minutes",
"Let the cake cool down before serving",
],
)
preparation_time: int | None = Field(
None,
description="The preparation time mentioned in the step description if any.",
examples=[5, 10, 15, 20, 25, 30],
)
cooking_time: int | None = Field(
None,
description="The cooking time mentioned in the step description if any.",
examples=[5, 10, 15, 20, 25, 30],
)
used_ingredients: list[int] = Field(
[],
description="The list of ingredient ids used in the step",
examples=[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]],
)
class Recipe(BaseModel):
"""Recipe schema"""
name: str = Field(
description="The name of the recipe",
examples=[
"Chocolate Cake",
"Apple Pie",
"Pasta Carbonara",
"Pumpkin Soup",
"Chili con Carne",
],
)
serving_size: int | None = Field(
None,
description="The number of servings the recipe makes",
examples=[1, 2, 4, 6, 8, 10],
)
ingredients: list[Ingredient] = []
steps: list[Step] = []
total_preparation_time: int | None = Field(
None,
description="The total preparation time for the recipe",
examples=[5, 10, 15, 20, 25, 30],
)
total_cooking_time: int | None = Field(
None,
description="The total cooking time for the recipe",
examples=[5, 10, 15, 20, 25, 30],
)
comments: list[str] = []
Technical Details
- It is important to not have a model which is too strict here otherwise, the pydantic validation of the JSON outputted by the LLM will fail. A good way to give some flexibility is too provide default values like None or empty lists [] depending on the targeted output type.
- Note the field_validatoron the quantityattribute of the Ingredient , is there to help the engine parse quantities. It was not initially there but by doing some trials, I found out that the LLM was often providing quantities as strings such as 1/3or 1/2 .
- The used_ingredientsallow to formally link the ingredients to the relevant steps of the recipes.
The model of the output being defined the rest of the process is pretty smooth.
In a prompt.pyfile, I defined a create_promptfunction to easily generate prompts. A “new” prompt is generated for every recipe. All prompts have the same basis but the recipe itself is passed as a variable to the base prompt to create a new one.
""" `prompt.py`
The import statements and the create_prompt function have not been included
in this snippet.
"""
# Note : Extra spaces have been included here for readability.
DEFAULT_BASE_PROMPT = """
What are the ingredients and their associated quantities
as well as the steps to make the recipe described
by the following {ingredients} and {steps} provided as raw text ?
In particular, please provide the following information:
- The name of the recipe
- The serving size
- The ingredients and their associated quantities
- The steps to make the recipe and in particular, the duration of each step
- The total duration of the recipe broken
down into preparation, cooking and waiting time.
The totals must be consistent with the sum of the durations of the steps.
- Any additional comments
{format_instructions}
Make sure to provide a valid and well-formatted JSON.
"""
The communication with the LLM logic was defined in therunfunction of thecore.pyfile, that I won’t show here for brevity.
Finally, I combined all those components in mydemo.ipynbnotebook whose content is shown below.
# demo.ipynb
import os
from pathlib import Path
import pandas as pd
from langchain.output_parsers import PydanticOutputParser
from langchain_mistralai.chat_models import ChatMistralAI
from dotenv import load_dotenv
from core import run
from prompt import DEFAULT_BASE_PROMPT, create_prompt
from schemas import Recipe
# End of first cell
# Setup environment
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY") #1
# End of second cell
# Load the data
path_to_data = Path(os.getcwd()) / "data" / "input" #2
df = pd.read_json("data/input/recipes_v1.json")
df.head()
# End of third cell
# Preparing the components of the system
llm = ChatMistralAI(api_key=MISTRAL_API_KEY, model_name="open-mixtral-8x7b")
parser = PydanticOutputParser(pydantic_object=Recipe)
prompt = create_prompt(
DEFAULT_BASE_PROMPT,
parser,
df["ingredients"][0],
df["direction"][0]
)
#prompt
# End of fourth cell
# Combining the components
example = await run(llm, prompt, parser)
#example
# End of fifth cell
I used MistralAI as a LLM provider, with their open-mixtral-8x7bmodel which is a very good open-source alternative to OpenAI. langchainallows you to easily switch provider given you have created an account on the provider’s platform.
If you are trying to reproduce the results:
- (#1) — Make sure you have a MISTRAL_API_KEYin a .env file or in your OS environment.
- (#2) — Be careful to the path to the data. If you clone my repo, this won’t be an issue.
Running the code on the entire dataset cost less than 2€.
The structured dataset resulting from this code can be found here in my repository.
I am happy with the results but I could still try to iterate on the prompt, my field descriptions or the model used to improve them. I might try MistralAI newer model, the open-mixtral-8x22b or try another LLM provider by simply changing 2 or 3 lines of code thanks to langchain .
When I am ready, I can get back to my original project. Stay tuned if you want to know what it was. In the meantime, let me know in the comments what would you do with the final dataset ?
Conclusion
Large Language Models (LLMs) offer a powerful tool for structuring unstructured data. Their ability to understand and interpret human language nuances, automate laborious tasks, and adapt to evolving data make them an invaluable resource in data analysis. By unlocking the hidden potential within unstructured textual data, businesses can transform this data into valuable insights, driving better decision-making and business outcomes. The example provided, of transforming raw recipes data into a structured format, is just one of the countless possibilities that LLMs offer.
As we continue to explore and develop these models, we can expect to see many more innovative applications in the future. The journey of harnessing the full potential of LLMs is just beginning, and the road ahead promises to be an exciting one.
The (lesser known) rising application of LLMs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
The (lesser known) rising application of LLMs
Go Here to Read this Fast! The (lesser known) rising application of LLMs