Tag: AI

How to Expose Delta Tables via REST APIs
René Bremer
Three architectures discussed and tested to serve delta tables

Expose data inside out — image by Joshua Sortino on Unsplash

1. Introduction

Delta tables in a medallion architecture are generally used to create data products. These data products are used for data science, data analytics, and reporting. However, a common question is to also expose data products via REST APIs. The idea is to embed these APIs in web apps with more strict performance requirements. Important questions are as follows:
- Is reading data from delta tables fast enough to serve web apps?
- Is a compute layer needed to make solution more scalable?
- Is a storage layer needed to achieve strict performance requirements?
To deep-dive on these questions, three architectures are evaluated as follows: architecture A — libraries in API, architecture B — compute layer, and architecture C — storage layer. See also image below.

Three architectures to expose delta tables — image by author

In the remainder of the blog post, the three architectures are described, deployed and tested. Then a conclusion is made.

2. Architecture description

2.1 Architecture A: Libraries in API using DuckDB and PyArrow

In this architecture, APIs are directly connecting to delta tables and there is no compute layer in between. This implies that data is analyzed by using the memory and compute of the API itself. To improve performance, Python libraries of embedded database DuckDB and PyArrow are used. These libraries make sure that only relevant data is loaded (e.g. only columns that are needed by the API).

Architecture A: Libraries in API — image by author

The pro of this architecture is that data does not have to be duplicated and there is no layer needed in between the API and the delta tables. This means less moving parts.

The con of this architecture is that it’s harder to scale and all of the work needs to be done in the compute and memory of the API itself. This is especially challenging if a lot of data needs to be analyzed. This can come from many records, large columns and/or a lot of concurrent requests.

2.2 Architecture B: Compute layer using Synapse, Databricks, or Fabric

In this architecture, APIs are connecting to a compute layer and not directly to the delta tables. This compute layer fetches data from delta tables and then analyzes the data. The compute layer can be Azure Synapse, Azure Databricks, or Microsoft Fabric and typically scales well. The data is not duplicated to the compute layer, though caching can be applied in the compute layer. In the remaining of this blog there is tested with Synapse Serverless.

Architecture B: Compute layer — image by author

The pro of this architecture is that the data does not have to be duplicated and the architecture scales well. Furthermore, it can be used to crunch large datasets.

The con of this architecture is that an additional layer is needed between the API and the delta tables. This means that more moving parts have to be maintained and secured.

2.3 Architecture C: Optimized storage layer using Azure SQL or Cosmos DB

In this architecture, APIs are not connecting to delta tables, but to a different storage layer in which the delta tables are duplicated. The different storage layer can be Azure SQL or Cosmos DB. The storage layer can be optimized for fast retrieval of data. In the remainder of this blog there is tested with Azure SQL.

Architecture C: Optimized storage layer — image by author

The pro of this architecture is that the storage layer can be optimized to read data fast using indexes, partitioning and materialized views. This is typically a requirement in scenarios of request-response web apps.

The con of this architecture is that data needs to be duplicated and an additional layer is needed between the API and the delta tables. This means that more moving parts need to be maintained and secured.

In the remainder of the blog the architectures are deployed and tested.

3. Deployment and testing of architectures

3.1 Deploying architectures

To deploy the architectures, a GitHub project is created that deploys the three solutions as discussed in the previous chapter. The project can be found in the link below:
```
https://github.com/rebremer/expose-deltatable-via-restapi
```
The following will be deployed when the GitHub project is executed:
- A delta table originating from standard test dataset WideWorldImporterdDW full. The test dataset consists of 50M records and 22 columns with 1 large description column.
- All architectures: Azure Function acting as API.
- Architecture B: Synapse Serverless acting as compute layer.
- Architecture C: Azure SQL acting as optimized storage layer.
Once deployed, tests can be executed. The tests are described in the next paragraph.

3.2 Testing architectures

To test the architecture, different types of queries and different scaling will be applied. The different type of queries can be described as follows:
- Look up of 20 records with 11 small columns (char, integer, datetime).
- Look up of 20 records with 2 columns including a large description column that contains more than 500 characters per field.
- Aggregation of data using group by, having, max, average.
The queries are depicted below.
```
-- Query 1: Point look up 11 columns without large texts
SELECT SaleKey, TaxAmount, CityKey, CustomerKey, BillToCustomerKey, SalespersonKey, DeliveryDateKey, Package 
FROM silver_fact_sale
WHERE CityKey=41749 and SalespersonKey=40 and CustomerKey=397 and TaxAmount > 20
-- Query 2: Description column with more than 500 characters
SELECT SaleKey, Description 
FROM silver_fact_sale
WHERE CityKey=41749 and SalespersonKey=40 and CustomerKey=397 and TaxAmount > 20
-- Query 3: Aggregation
SELECT MAX(DeliveryDateKey), CityKey, AVG(TaxAmount)
FROM silver_fact_sale
GROUP BY CityKey
HAVING COUNT(CityKey) > 10
```
The scaling can be described as follows:
- For architecture A, the data processing will be done in the API itself. This means that the compute and memory of the API is used via its app service plan. These will be tested with both SKU Basic (1 core and 1.75 GB memory) and SKU P1V3 SKU (2 cores, 8 GB memory). For architecture B and C, this is not relevant, since the processing is done elsewhere.
- For architecture B, Synapse Serverless is used. Scaling will be done automatically.
- For architecture C, an Azure SQL database of standard tier is taken with 125 DTUs. There will be tested without an index and with an index on CityKey.
In the next paragraph the results are described.

3.3 Results

After deployment and testing the architectures, the results can be obtained. This is a summary of the results:

Test results summary

Architecture A cannot be deployed with SKU B1. In case it is SKU P1V3 is used, then results can be calculated within 15 seconds in case the column size is not too big. Notice that all data is analyzed in the API app service plan. If too much data is loaded (either via many rows, large columns and/or many concurrent requests), this architecture is hard to scale.

Architecture B using Synapse Serverless performs within 10–15 seconds. The compute is done on Synapse Serverless which is scaled automatically to fetch and analyze the data. Performance is consistent for all three types of queries.

Architecture C using Azure SQL performs best when indexes are created. For look up queries 1 and 2, the API responds in around 1 seconds. Query 3 requires a full table scan and there performance is more or less equal to other solutions.

3. Conclusion

Delta tables in a medallion architecture are generally used to create data products. These data products are used for data science, data analytics, and reporting. However, a common question is to also expose delta tables via REST APIs. In this blog post, three architectures are described with its pros and cons.

Architecture A: Libraries in API using DuckDB and PyArrow.
In this architecture, APIs are directly connecting to delta tables and there is no layer in between. This implies that all data is analyzed in memory and compute of the Azure Function.
- The pro of this architecture is that no additional resources are needed. This means less moving parts that need to be maintained and secured.
- The con of this architecture is that it does not scale well since all data needs to be analyzed in the API itself. Therefore, it shall only be used for small amounts of data.
Architecture B: Compute layer using Synapse, Databricks or Fabric.
In this architecture, APIs are connecting to a compute layer. This compute layer fetches and analyzes data from delta tables.
- The pro of this architecture is that it scales well and data is not duplicated. It works well for queries that do aggregations and crunch large datasets.
- The con of this architecture is that it is not possible to get responses within 5 seconds for look up queries consistently. Also, additional resources need to be secured and maintained.
Architecture C: Optimized storage layer using Azure SQL or Cosmos DB.
In this architecture, APIs are connecting to an optimized storage layer. Delta tables are duplicated to this storage layer in advance and the storage layer is used to fetch and analyze the data.
- The pro of this architecture is that it can be optimized for fast querying of look ups using indexes, partitioning, materialized views. This is often a requirement for request-response web apps.
- The con of this architecture is that data is duplicated to a different storage layer, which needs to be kept in sync. Also, additional resources need to be secured and maintained.
Unfortunately, there is no silver bullet solution. This article aimed to give guidance in choosing the best architecture to expose delta tables via REST APIs.

How to Expose Delta Tables via REST APIs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
How to Expose Delta Tables via REST APIs

Go Here to Read this Fast! How to Expose Delta Tables via REST APIs
May 13, 2024
Don’t Let the Python dir() Function Trick You!

Marcin Kozak

The dir() function returns an object’s attributes. Unfortunately, it doesn’t reveal everything — discover how to find all attributes…

Continue reading on Towards Data Science »

Originally appeared here:
Don’t Let the Python dir() Function Trick You!

Go Here to Read this Fast! Don’t Let the Python dir() Function Trick You!

May 13, 2024
How to Build Neural Networks for Node Classification from Tabular Data

Claudia Ng

Learn how to build graph-based neural networks from a .csv file with PyTorch Geometric in 12 minutes

Continue reading on Towards Data Science »

Originally appeared here:
How to Build Neural Networks for Node Classification from Tabular Data

Go Here to Read this Fast! How to Build Neural Networks for Node Classification from Tabular Data

May 13, 2024
Evaluation of generative AI techniques for clinical report summarization

Ekta Walia Bhullar

In this post, we provide a comparison of results obtained by two such techniques: zero-shot and few-shot prompting. We also explore the utility of the RAG prompt engineering technique as it applies to the task of summarization. Evaluating LLMs is an undervalued part of the machine learning (ML) pipeline.

Originally appeared here:
Evaluation of generative AI techniques for clinical report summarization

Go Here to Read this Fast! Evaluation of generative AI techniques for clinical report summarization

May 13, 2024

The (lesser known) rising application of LLMs

Overview

Large Language Models (LLMs) are often described as Generative Artificial Intelligence (GenAI) as they indeed have the ability to generate text. The first popular application of LLMs were chatbots with ChatGPT leading the way. Then we extended their horizon to other tasks such as semantic search and retrieval augmented search (RAG). Today, I want to talk about a rising application for LLM which is structuring unstructed data for which I am going to show you an example by structuring raw texts into JSON data.

Using LLMs for data structuration and extraction is a very promising application with a lot of potential. Here’s why:

Improved accuracy : LLMs understand the nuances of human language. This allows them to identify key information within messy, unstructured text with greater accuracy than traditional rule-based systems
Automation potential : Extracting information from unstructured data can be a time-consuming and laborious task. LLMs can automate this process, freeing up human resources for other tasks and allowing for faster analysis of larger datasets.
Adaptability and learning capabilities: LLMs, on the other hand, can be continuously fine-tuned and adapted to handle new data sources and information types. As they are exposed to more unstructured data, they can learn and improve their ability to identify patterns and extract relevant information.
Business outcome : A vast amount of valuable information resides within unstructured textual data sources like emails, customer reviews, social media conversations, and internal documents. However, this data is often difficult to analyze. LLMs can unlock this hidden potential by transforming unstructured data into a structured format. This allows businesses to leverage powerful analytics tools to identify trends, and gain insights. Essentially, by structuring unstructured data with LLMs, businesses can transform a liability (unusable data) into an asset (valuable insights) that drives better decision-making and improves overall business outcomes.

An example

Recently, I was searching for an open-source recipes dataset for a personal project but I could not find any except for this github repository containing the recipes displayed on publicdomainrecipes.com.

Unfortunately, I needed a dataset that was more exploitable, i.e something closer to tabular data or to a NoSQL document. That’s how I thought about finding a way to transform the raw data into something more suitable to my needs, without spending hours, days and weeks doing it manually.

Let me show you how I used the power of Large Language Models to automate the process of converting the raw text into structured documents.

Dataset

The original dataset is a collection of markdown files. Each file representing a recipe.

<a href="https://medium.com/media/e78c841328793fe811b1cfafa7c62e17/href">https://medium.com/media/e78c841328793fe811b1cfafa7c62e17/href</a>

As you can see, this is not completely unstructured, there are nice tabular metadata on top of the file, then there are 4 distincts sections:

An introduction,
The list of ingredients
Directions
Some tips.

Based on this observation, Sebastian Bahr, developed a parser to transform the markdown files into JSON here.

The output of the parser is already more exploitable, besides Sebastian used it to build a recipe recommender chatbot. However, there are still some drawbacks. The ingredients and directions keys contain raw texts that could be better structured.
As-is, some useful information is hidden.
For example, the quantities for the ingredients, the preparation or cooking time for each step.

Code

In the remainder of this article, I’ll show the steps that I undertook to get to JSON documents that look like the one below.

{
    "name": "Crêpes",
    "serving_size": 4,
    "ingredients": [
        {
            "id": 1,
            "name": "white flour",
            "quantity": 300.0,
            "unit": "g"
        },
        {
            "id": 2,
            "name": "eggs",
            "quantity": 3.0,
            "unit": "unit"
        },
        {
            "id": 3,
            "name": "milk",
            "quantity": 60.0,
            "unit": "cl"
        },
        {
            "id": 4,
            "name": "beer",
            "quantity": 20.0,
            "unit": "cl"
        },
        {
            "id": 5,
            "name": "butter",
            "quantity": 30.0,
            "unit": "g"
        }
    ],
    "steps": [
        {
            "number": 1,
            "description": "Mix flour, eggs, and melted butter in a bowl.",
            "preparation_time": null,
            "cooking_time": null,
            "used_ingredients": [1,2,5]
        },
        {
            "number": 2,
            "description": "Slowly add milk and beer until the dough becomes fluid enough.",
            "preparation_time": 5,
            "cooking_time": null,
            "used_ingredients": [3,4]
        },
        {
            "number": 3,
            "description": "Let the dough rest for one hour.",
            "preparation_time": 60,
            "cooking_time": null,
            "used_ingredients": []
        },
        {
            "number": 4,
            "description": "Cook the crêpe in a flat pan, one ladle at a time.",
            "preparation_time": 10,
            "cooking_time": null,
            "used_ingredients": []
        }
    ]
}

The code to reproduce the tutorial is on GitHub here.

I relied on two powerful libraries langchain for communicating with LLM providers and pydantic to format the output of the LLMs.

First, I defined the two main components of a recipe with the Ingredientand Stepclasses.

In each class, I defined the relevant attributes and provided a description of the field and examples. Those are then fed to the LLMs by langchain leading to better results.

"""`schemas.py`"""

from pydantic import BaseModel, Field, field_validator

class Ingredient(BaseModel):
    """Ingredient schema"""

    id: int = Field(
        description="Randomly generated unique identifier of the ingredient",
        examples=[1, 2, 3, 4, 5, 6],
    )
    name: str = Field(
        description="The name of the ingredient", 
        examples=["flour", "sugar", "salt"]
    )
    quantity: float | None = Field(
        None,
        description="The quantity of the ingredient",
        examples=[200, 4, 0.5, 1, 1, 1],
    )
    unit: str | None = Field(
        None,
        description="The unit in which the quantity is specified",
        examples=["ml", "unit", "l", "unit", "teaspoon", "tablespoon"],
    )

    @field_validator("quantity", mode="before")
    def parse_quantity(cls, value: float | int | str | None):
        """Converts the quantity to a float if it is not already one"""

        if isinstance(value, str):
            try:
                value = float(value)
            except ValueError:
                try:
                    value = eval(value)
                except Exception as e:
                    print(e)
                    pass

        return value


class Step(BaseModel):
    number: int | None = Field(
        None,
        description="The position of the step in the recipe",
        examples=[1, 2, 3, 4, 5, 6],
    )
    description: str = Field(
        description="The action that needs to be performed during that step",
        examples=[
            "Preheat the oven to 180°C",
            "Mix the flour and sugar in a bowl",
            "Add the eggs and mix well",
            "Pour the batter into a greased cake tin",
            "Bake for 30 minutes",
            "Let the cake cool down before serving",
        ],
    )
    preparation_time: int | None = Field(
        None,
        description="The preparation time mentioned in the step description if any.",
        examples=[5, 10, 15, 20, 25, 30],
    )
    cooking_time: int | None = Field(
        None,
        description="The cooking time mentioned in the step description if any.",
        examples=[5, 10, 15, 20, 25, 30],
    )
    used_ingredients: list[int] = Field(
        [],
        description="The list of ingredient ids used in the step",
        examples=[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12]],
    )


class Recipe(BaseModel):
    """Recipe schema"""

    name: str = Field(
        description="The name of the recipe",
        examples=[
            "Chocolate Cake",
            "Apple Pie",
            "Pasta Carbonara",
            "Pumpkin Soup",
            "Chili con Carne",
        ],
    )
    serving_size: int | None = Field(
        None,
        description="The number of servings the recipe makes",
        examples=[1, 2, 4, 6, 8, 10],
    )
    ingredients: list[Ingredient] = []
    steps: list[Step] = []
    total_preparation_time: int | None = Field(
        None,
        description="The total preparation time for the recipe",
        examples=[5, 10, 15, 20, 25, 30],
    )
    total_cooking_time: int | None = Field(
        None,
        description="The total cooking time for the recipe",
        examples=[5, 10, 15, 20, 25, 30],
    )
    comments: list[str] = []

Technical Details

It is important to not have a model which is too strict here otherwise, the pydantic validation of the JSON outputted by the LLM will fail. A good way to give some flexibility is too provide default values like None or empty lists [] depending on the targeted output type.
Note the field_validatoron the quantityattribute of the Ingredient , is there to help the engine parse quantities. It was not initially there but by doing some trials, I found out that the LLM was often providing quantities as strings such as 1/3or 1/2 .
The used_ingredientsallow to formally link the ingredients to the relevant steps of the recipes.

The model of the output being defined the rest of the process is pretty smooth.

In a prompt.pyfile, I defined a create_promptfunction to easily generate prompts. A “new” prompt is generated for every recipe. All prompts have the same basis but the recipe itself is passed as a variable to the base prompt to create a new one.

""" `prompt.py`

The import statements and the create_prompt function have not been included 
in this snippet.
"""
# Note : Extra spaces have been included here for readability.

DEFAULT_BASE_PROMPT = """
What are the ingredients and their associated quantities 
as well as the steps to make the recipe described 
by the following {ingredients} and {steps} provided as raw text ?

In particular, please provide the following information:
- The name of the recipe
- The serving size
- The ingredients and their associated quantities
- The steps to make the recipe and in particular, the duration of each step
- The total duration of the recipe broken 
down into preparation, cooking and waiting time. 
The totals must be consistent with the sum of the durations of the steps. 
- Any additional comments

{format_instructions}
Make sure to provide a valid and well-formatted JSON.

"""

The communication with the LLM logic was defined in therunfunction of thecore.pyfile, that I won’t show here for brevity.

Finally, I combined all those components in mydemo.ipynbnotebook whose content is shown below.

# demo.ipynb
import os
from pathlib import Path

import pandas as pd
from langchain.output_parsers import PydanticOutputParser
from langchain_mistralai.chat_models import ChatMistralAI
from dotenv import load_dotenv

from core import run
from prompt import DEFAULT_BASE_PROMPT, create_prompt
from schemas import Recipe
 # End of first cell

# Setup environment
load_dotenv()
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY") #1
 # End of second cell
 
# Load the data
path_to_data = Path(os.getcwd()) / "data" / "input" #2
df = pd.read_json("data/input/recipes_v1.json")
df.head()
 # End of third cell

# Preparing the components of the system
llm = ChatMistralAI(api_key=MISTRAL_API_KEY, model_name="open-mixtral-8x7b")
parser = PydanticOutputParser(pydantic_object=Recipe)
prompt = create_prompt(
    DEFAULT_BASE_PROMPT, 
    parser, 
    df["ingredients"][0], 
    df["direction"][0]
)
#prompt  
  # End of fourth cell

# Combining the components  
example = await run(llm, prompt, parser)
#example
 # End of fifth cell

I used MistralAI as a LLM provider, with their open-mixtral-8x7bmodel which is a very good open-source alternative to OpenAI. langchainallows you to easily switch provider given you have created an account on the provider’s platform.

If you are trying to reproduce the results:

(#1) — Make sure you have a MISTRAL_API_KEYin a .env file or in your OS environment.
(#2) — Be careful to the path to the data. If you clone my repo, this won’t be an issue.

Running the code on the entire dataset cost less than 2€.
The structured dataset resulting from this code can be found here in my repository.
I am happy with the results but I could still try to iterate on the prompt, my field descriptions or the model used to improve them. I might try MistralAI newer model, the open-mixtral-8x22b or try another LLM provider by simply changing 2 or 3 lines of code thanks to langchain .

When I am ready, I can get back to my original project. Stay tuned if you want to know what it was. In the meantime, let me know in the comments what would you do with the final dataset ?

Conclusion

Large Language Models (LLMs) offer a powerful tool for structuring unstructured data. Their ability to understand and interpret human language nuances, automate laborious tasks, and adapt to evolving data make them an invaluable resource in data analysis. By unlocking the hidden potential within unstructured textual data, businesses can transform this data into valuable insights, driving better decision-making and business outcomes. The example provided, of transforming raw recipes data into a structured format, is just one of the countless possibilities that LLMs offer.

As we continue to explore and develop these models, we can expect to see many more innovative applications in the future. The journey of harnessing the full potential of LLMs is just beginning, and the road ahead promises to be an exciting one.

The (lesser known) rising application of LLMs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
The (lesser known) rising application of LLMs

Go Here to Read this Fast! The (lesser known) rising application of LLMs

May 13, 2024

Hypothesis Testing Explained (How I Wish It Was Explained to Me)

Samuele Mazzanti

Most resources focus on things like Confidence and Power. But they don’t really matter: here is what you should care about

Continue reading on Towards Data Science »

Originally appeared here:
Hypothesis Testing Explained (How I Wish It Was Explained to Me)

Go Here to Read this Fast! Hypothesis Testing Explained (How I Wish It Was Explained to Me)

May 13, 2024
Data Science skills 101: How to solve any problem

Josh Taylor

Six simple techniques you can apply in real life. Part I.

Continue reading on Towards Data Science »

Originally appeared here:
Data Science skills 101: How to solve any problem

Go Here to Read this Fast! Data Science skills 101: How to solve any problem

May 13, 2024
Crafting Time Series GIFs of Satellite Images with Python

Conor O’Sullivan

Witness the evolution of our world with 40 years of Landsat data and minimal manual work

Continue reading on Towards Data Science »

Originally appeared here:
Crafting Time Series GIFs of Satellite Images with Python

Go Here to Read this Fast! Crafting Time Series GIFs of Satellite Images with Python

May 12, 2024
Kolmogorov-Arnold Networks: the latest advance in Neural Networks, simply explained

Theo Wolf

The new type of network that is making waves in the ML world.

Continue reading on Towards Data Science »

Originally appeared here:
Kolmogorov-Arnold Networks: the latest advance in Neural Networks, simply explained

Go Here to Read this Fast! Kolmogorov-Arnold Networks: the latest advance in Neural Networks, simply explained

May 12, 2024
Python Decoration Is Very Useful, But When To Use Them?

Christopher Tao

Summarised five typical usage patterns of Python decorators

Continue reading on Towards Data Science »

Originally appeared here:
Python Decoration Is Very Useful, But When To Use Them?

Go Here to Read this Fast! Python Decoration Is Very Useful, But When To Use Them?

May 12, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Tag: AI

Three architectures discussed and tested to serve delta tables

1. Introduction

2. Architecture description

2.1 Architecture A: Libraries in API using DuckDB and PyArrow

2.2 Architecture B: Compute layer using Synapse, Databricks, or Fabric

2.3 Architecture C: Optimized storage layer using Azure SQL or Cosmos DB

3. Deployment and testing of architectures

3.1 Deploying architectures

3.2 Testing architectures

3.3 Results

3. Conclusion

Overview

An example

Dataset

Code

Conclusion