Tag: tech

OpenAI’s Sora video generation AI model arrives globally later today

Igor Bonifacic

Following an early preview at the start of the year, Sora, OpenAI’s long-awaited video generation model, is ready for public use. If you’re a ChatGPT Plus or Pro subscriber in the US or “most other countries” where the chatbot is available, you can begin experimenting with the tool starting later today, OpenAI announced on Monday. A more powerful model powers the product than the one OpenAI showed off in February. Sora Turbo is significantly faster, according to the company, though OpenAI cautions the new model still has limitations. “It often generates unrealistic physics and struggles with complex actions over long durations,” says the company.

When users first visit the dedicated landing page OpenAI has set up for Sora, they’ll be greeted with a feed of videos the model has created for other people. By clicking on a video, you’ll be able to see the exact prompt someone gave Sora to generate the footage you see. From here, you can also decide to re-cut a video, blend it into a clip you’re working on, or remix it. In this initial release, OpenAI is limiting Sora to generating videos that are up to 1080p and 20 seconds long.

ChatGPT Plus subscribers can use Sora to create up to 50 videos at 480p per month. Alternatively, Plus users can generate fewer (and shorter) videos at 720p. OpenAI says the Pro plan affords 10 times as much usage, at higher resolutions and longer durations. “We’re working on tailored pricing for different types of users, which we plan to make available early next year,” the company adds.

For safety purposes, each video features a visible watermark by default and contains C2PA metadata to assist with identification. OpenAI says it will block users from using Sora to create child sexual abuse materials (CSAM) and sexual deepfakes. More broadly, the company plans to limit uploads of people until it has time to refine its safeguards against deepfakes.

Even if you don’t have a ChatGPT subscription, you can still visit the Sora website to see what other people are using the tool to create. During today’s livestream, OpenAI CEO Sam Altman said it may take some time before Sora arrives in Europe and the UK.

This article originally appeared on Engadget at https://www.engadget.com/ai/openais-sora-video-generation-ai-model-arrives-globally-later-today-182613208.html?src=rss

Go Here to Read this Fast!

OpenAI’s Sora video generation AI model arrives globally later today

Originally appeared here:

OpenAI’s Sora video generation AI model arrives globally later today

December 9, 2024
OpenAI’s Sora video generation AI model arrives globally later today

Igor Bonifacic

Following an early preview at the start of the year, Sora, OpenAI’s long-awaited video generation model, is ready for public use. If you’re a ChatGPT Plus or Pro subscriber in the US or “most other countries” where the chatbot is available, you can begin experimenting with the tool starting later today, OpenAI announced on Monday. A more powerful model powers the product than the one OpenAI showed off in February. Sora Turbo is significantly faster, according to the company, though OpenAI cautions the new model still has limitations. “It often generates unrealistic physics and struggles with complex actions over long durations,” says the company.

When users first visit the dedicated landing page OpenAI has set up for Sora, they’ll be greeted with a feed of videos the model has created for other people. By clicking on a video, you’ll be able to see the exact prompt someone gave Sora to generate the footage you see. From here, you can also decide to re-cut a video, blend it into a clip you’re working on, or remix it. In this initial release, OpenAI is limiting Sora to generating videos that are up to 1080p and 20 seconds long.

ChatGPT Plus subscribers can use Sora to create up to 50 videos at 480p per month. Alternatively, Plus users can generate fewer (and shorter) videos at 720p. OpenAI says the Pro plan affords 10 times as much usage, at higher resolutions and longer durations. “We’re working on tailored pricing for different types of users, which we plan to make available early next year,” the company adds.

For safety purposes, each video features a visible watermark by default and contains C2PA metadata to assist with identification. OpenAI says it will block users from using Sora to create child sexual abuse materials (CSAM) and sexual deepfakes. More broadly, the company plans to limit uploads of people until it has time to refine its safeguards against deepfakes.

Even if you don’t have a ChatGPT subscription, you can still visit the Sora website to see what other people are using the tool to create. During today’s livestream, OpenAI CEO Sam Altman said it may take some time before Sora arrives in Europe and the UK.

This article originally appeared on Engadget at https://www.engadget.com/ai/openais-sora-video-generation-ai-model-arrives-globally-later-today-182613208.html?src=rss

Go Here to Read this Fast!

OpenAI’s Sora video generation AI model arrives globally later today

Originally appeared here:

OpenAI’s Sora video generation AI model arrives globally later today

December 9, 2024
Everything you need to know about Micron’s “game-changer” 6550 ION SSD

The powerful SSD is designed specifically to contend with hefty AI workloads

Go Here to Read this Fast! Everything you need to know about Micron’s “game-changer” 6550 ION SSD

Originally appeared here:
Everything you need to know about Micron’s “game-changer” 6550 ION SSD

December 9, 2024
Ten months after first tease, OpenAI launches Sora video generation publicly

Benj Edwards and Kyle Orland

It’s a big launch, but AI video-synthesis competition has heated up over the past 10 months.

Go Here to Read this Fast! Ten months after first tease, OpenAI launches Sora video generation publicly

Originally appeared here:
Ten months after first tease, OpenAI launches Sora video generation publicly

December 9, 2024
Why Data Scientists Need These Software Engineering Skills

Egor Howell

Learn these things to become a more well-rounded data scientist

Continue reading on Towards Data Science »

Originally appeared here:
Why Data Scientists Need These Software Engineering Skills

Go Here to Read this Fast! Why Data Scientists Need These Software Engineering Skills

December 9, 2024
A Beginner’s Journey into Key Mathematical Concepts: Applied Data Analysis Simplified

Sarah Lea

Understanding key concepts such as Monte Carlo Methods, Bayes’ Theorem or Gradient Descent can be overwhelming for beginners…

Continue reading on Towards Data Science »

Originally appeared here:
A Beginner’s Journey into Key Mathematical Concepts: Applied Data Analysis Simplified

Go Here to Read this Fast! A Beginner’s Journey into Key Mathematical Concepts: Applied Data Analysis Simplified

December 9, 2024
Accelerating ML experimentation with enhanced security: AWS PrivateLink support for Amazon SageMaker with MLflow

Xiaoyu Xing

With access to a wide range of generative AI foundation models (FM) and the ability to build and train their own machine learning (ML) models in Amazon SageMaker, users want a seamless and secure way to experiment with and select the models that deliver the most value for their business. In the initial stages of an ML […]

Originally appeared here:
Accelerating ML experimentation with enhanced security: AWS PrivateLink support for Amazon SageMaker with MLflow

Go Here to Read this Fast! Accelerating ML experimentation with enhanced security: AWS PrivateLink support for Amazon SageMaker with MLflow

December 9, 2024
LLMs for Coding in 2024: Price, Performance, and the Battle for the Best
Ruben Broekx
Evaluating the current LLM landscape based both benchmarks and real-world insights to help you make informed choices.

Image generated by Flux.1 – Schnell

The landscape of Large Language Models (LLMs) for coding has never been more competitive. With major players like Alibaba, Anthropic, Google, Meta, Mistral, OpenAI, and xAI all offering their own models, developers have more options than ever before.

But how can you choose the best LLM for your coding use case?

In this post, I provide an in-depth analysis of the top LLMs available through public APIs. I focus on their performance in coding tasks as measured by benchmarks like HumanEval, and their observed real-world performance as reflected by their respective Elo scores.

Whether you’re working on a personal project or integrating AI into your development workflow, understanding the strengths and weaknesses of these models will help you make a more informed decision.

Disclaimer: challenges when comparing LLMs

Comparing LLMs is hard. Models frequently receive updates that have a significant influence on their performance — say for example OpenAI’s updates from GPT-4 to GPT-4-turbo to GPT-4o to the o1 models. However, even minor updates have an effect — GPT-4o, for example, received already 3 updates after its release on May 13th!

Additionally, the stochastic nature of these models means their performance can vary across different runs, leading to inconsistent results in studies. Finally, some companies may tailor benchmarks and configurations — such as specific Chain-of-Thought techniques — to showcase their models in the best light, which skew comparisons and mislead conclusions.

Conclusion: comparing LLM performance is hard.

This post represents a best-effort comparison of various models for coding tasks based on the information available. I welcome any feedback to improve the accuracy of this analysis!

Evaluating LLMs: HumanEval and Elo scores

As hinted at in the disclaimer above, to properly understand how LLMs perform in coding tasks, it’s advisable to evaluate them from multiple perspectives.

Benchmarking through HumanEval

Initially, I tried to aggregate results from several benchmarks to see which model comes out on top. However, this approach had as core problem: different models use different benchmarks and configurations. Only one benchmark seemed to be the default for evaluating coding performance: HumanEval. This is a benchmark dataset consisting of human-written coding problems, evaluating a model’s ability to generate correct and functional code based on specified requirements. By assessing code completion and problem-solving skills, HumanEval serves as a standard measure for coding proficiency in LLMs.

The voice of the people through Elo scores

While benchmarks give a good view of a model’s performance, they should also be taken with a grain of salt. Given the vast amounts of data LLMs are trained on, some of a benchmark’s content (or highly similar content) might be part of that training. That’s why it’s beneficial to also evaluate models based on how well they perform as judged by humans. Elo ratings, such as those from Chatbot Arena (coding only), do just that. These are scores derived from head-to-head comparisons of LLMs in coding tasks, evaluated by human judges. Models are pitted against each other, and their Elo scores are adjusted based on wins and losses in these pairwise matches. An Elo score shows a model’s relative performance compared to others in the pool, with higher scores indicating better performance. For example, a difference of 100 Elo points suggests that the higher-rated model is expected to win about 64% of the time against the lower-rated model.

Current state of model performance

Now, let’s examine how these models perform when we compare their HumanEval scores with their Elo ratings. The following image illustrates the current coding landscape for LLMs, where the models are clustered by the companies that created them. Each company’s best performing model is annotated.

Figure 1: Elo score by HumanEval — colored by company. X- and y-axis ticks show all models released by each company, with the best performing model shown in bold.

OpenAI’s models are at the top of both metrics, demonstrating their superior capability in solving coding tasks. The top OpenAI model outperforms the best non-OpenAI model — Anthropic’s Claude Sonnet 3.5 — by 46 Elo points , with an expected win rate of 56.6% in head-to-head coding tasks , and a 3.9% difference in HumanEval. While this difference isn’t overwhelming, it shows that OpenAI still has the edge. Interestingly, the best model is o1-mini, which scores higher than the larger o1 by 10 Elo points and 2.5% in HumanEval.

Conclusion: OpenAI continues to dominate, positioning themselves at the top in benchmark performance and real-world usage. Remarkably, o1-mini is the best performing model, outperforming its larger counterpart o1.

Other companies follow closely behind and seem to exist within the same “performance ballpark”. To provide a clearer sense of the difference in model performance, the following figure shows the win probabilities of each company’s best model — as indicated by their Elo rating.

Figure 2: Win probability of each company’s best (coding) model — as illustrated by the Elo ratings’ head-to-head battle win probabilities.

Mismatch between benchmark results and real-world performance

From Figure 1, one thing that stands out is the misalignment between HumanEval (benchmark) and the Elo scores (real-world performance). Some models — like Mistral’s Mistral Large — have significantly better HumanEval scores relative to their Elo rating. Other models — like Google’s Gemini 1.5 Pro — have significantly better Elo ratings relative to the HumanEval score they obtain.

It’s hard to know when to trust benchmarks, as the benchmark data might as well be included in the model’s training dataset. This can lead to (overfitted) models that memorize and repeat the answer to a coding question, rather than understand and actually solve the problem.

Similarly, It’s also problematic to take the Elo ratings as a ground truth, given that these are scores obtained by a crowdsourcing effort. By doing so, you add a human bias to the scoring, favoring models that output in a specific style, take a specific approach, … over others, which does not always align with a factually better model.

Conclusion: better benchmark results don’t always reflect better real-world performance. It’s advised to look at both independently.

The following image shows the disagreement between HumanEval and Elo scores. All models are sorted based on their respective scores, ignoring “how much better” one model is compared to another for simplicity. It shows visually which models perform better on benchmarks than in real life and vice-versa.

Figure 3: Misalignment in HumanEval and Elo scores — colored by company. Scores are transformed to ranks for simplicity, going from worst (left) to best (right) on each metric respectively.

Figure 4 further highlights the difference between benchmarking and real-world performance by simplifying the comparison even further. Here, the figure shows the relative difference in rank, indicating when a model is likely overfitting the benchmark or performs better than reported. Some interesting conclusions can be drawn here:
- Overfitting on benchmark: Alibaba and Mistral both stick out for systematically creating models that perform better on benchmarks than in real life. Their most recent models, Alibaba’s Qwen 2.5 Coder (-20.0%) and Mistral’s Mistral Large (-11.5%) follow this pattern, too.
- Better than reported: Google stands out for producing models that perform significantly better than reported, with its newest Gemini 1.5 Pro model on top with a difference of +31.5%. Their focus on “honest training and evaluation” is evident in their model reporting and the dicision to develop their own Natural2Code benchmark instead of using HumanEval. “Natural2Code is a code generation benchmark across Python, Java, C++, JS, Go . Held out dataset HumanEval-like, not leaked on the web” ~ Google in the Gimini 1.5 release.
- Well balanced: It’s very interesting and particular how well and consistently Meta nails the balance between benchmark a real-world performance. Of course, given that the figure displays rank over score, this stability also depends on the performance of other models.
Figure 4: Performance difference going from HumanEval to Elo scores — colored by company. Negative scores indicate better HumanEval than Elo (overfitting on benchmark) where positive scores indicate better Elo than HumanEval (better performing than reported).

Conclusion: Alibaba and Mistral tend to create models that overfit on the benchmark data.

Conclusion: Google’s models are underrated in benchmark results, due to their focus on fair training and evaluation.

Balancing performance and price: the models that provide the best bang for buck

When choosing an LLM as your coding companion, performance isn’t the only factor to consider. Another important dimension to consider is price. This section re-evaluates the different LLMs and compares how well they fare when evaluated on performance — as indicated by their Elo rating — and price.

Before starting the comparison, it’s worth noting of the odd one out: Meta. Meta’s Llama models are open-source and not hosted by Meta themselves. However, given their popularity, I include them. The price attached to these models is the best pay-as-you-go price offered by the big three cloud vendors (Google, Microsoft, Amazon) — which usually comes down to AWS’s price.

Figure 5 compares the different models and shows the Pareto front. Elo ratings are used to represent model performance, this seemed the best choice given it’s evaluated by humans and doesn’t include an overfitting bias. Next, the pay-as-you-go API price is used with the displayed price being the average of input- and output-token cost for a total of one million generated tokens.

Figure 5: Model coding performance (Elo rating) by API price — colored by company. The models that make up the Pareto front are annotated.

The Pareto front is made up of models coming from only two companies: OpenAI and Google. As mentioned in the previous sections, OpenAI’s models dominate in performance, and they appear to be fairly priced too. Meanwhile, Google seems to focus on lighter weight — thus cheaper — models that still perform well. This makes sense, given their focus on on-device LLM use-cases which hold great strategic value for their mobile operating system (Android).

Conclusion: the Pareto front is made up of models coming from either OpenAI (high performance) or Google (good value for money).

The next figure shows a similar trend when using HumanEval instead of Elo scores to represent coding performance. Some observations:
- Anthropic’s Claude 3.5 Haiku is the only notable addition, as this model does not yet have an Elo rating. Could it be a potential contender for middle-priced, high-performance models?
- The differences for Google’s Gemini 1.5 Pro and Mistral’s Mistral Large are explained in the previous section that compared HumanEval scores with Elo ratings.
- Given that Google’s Gemini 1.5 Flash 8B does not have a HumanEval score, it is excluded from this figure.
Figure 6: Model coding performance (HumanEval score) by API price — colored by company. The models that make up the Pareto front are annotated.

Shifting through the data: additional insights and trends

To conclude, I will discuss some extra insights worth noting in the current LLM (coding) landscape. This section explores three key observations: the steady improvement of models over time, the continued dominance of proprietary models, and the significant impact even minor model updates can have. All the observations stem from the Elo rating by price comparison shown in Figure 5.

Models are getting better and cheaper

The following figure illustrates how new models continue to achieve higher accuracy while simultaneously driving down costs. It’s remarkable to see how three time segments — 2023 and before, H1 of 2024, and H2 of 2024 — each define their own Pareto front and occupy almost completely distinct segments. Curious to see how this will continue to progress in 2025!

Figure 7: Evolution of time as indicated by three different time segments — 2023 and before, H1 of 2024, and H2 of 2024.

Conclusion: models get systematically better and cheaper, a trend observed with almost every new model release.

Proprietary models remain in power

The following image shows which of the analyzed models are proprietary and which are open-source. We see that proprietary models continue to dominate the LLM coding landscape. The Pareto front is still driven by these “closed-source” models, both on the high-performing and low-cost ends.

However, open-source models are closing the gap. It’s interesting to see, though, that for each open-source model, there is a proprietary model with the same predictive performance that is significantly cheaper. This suggests that the proprietary are either more lighterweight or better optimized, thus requiring less computational power — though this is just a personal hunch.

Figure 8: Proprietary versus open-source models.

Conclusion: proprietary models continue to hold the performance-cost Pareto front.

Even minor model updates have an effect

The following and final image illustrates how even minor updates to the same models can have an impact. Most often, these updates bring a performance boost, improving the models gradually over time without a major release. Occasionally though, a model’s performance might drop for coding tasks following a minor update, but this is almost always accompanied by a reduction in price. This is likely because the models were optimized in some way, such as through quantization or pruning parts of their network.

Figure 9: Evolution of model performance and price for minor model updates.

Conclusion: minor model updates almost always improve performance or push down cost.

Conclusion: key takeaways of LLMs for coding

The LLM landscape for coding is rapidly evolving, with newer models regularly pushing the Pareto front toward better-performing and/or cheaper options. Developers must stay informed about the latest models to identify those that offer the best capabilities within their budget. Recognizing the misalignment between real-world results and benchmarks is essential to making informed decisions. By carefully weighing performance against cost, developers can choose the tools that best meet their needs and stay ahead in this dynamic field.

Here’s a quick overview of all the conclusions made in this post:
- Comparing LLM performance is hard.
- OpenAI continues to dominate, positioning themselves at the top in benchmark performance and real-world usage. Remarkably, o1-mini is the best performing model, outperforming its larger counterpart o1.
- Better benchmark results don’t always reflect better real-world performance. It’s advised to look at both independently.
- Alibaba and Mistral tend to create models that overfit on the benchmark data.
- Google’s models are underrated in benchmark results, due to their focus on fair training and evaluation.
- The Pareto front is made up of models coming from either OpenAI (high performance) or Google (good value for money).
- Models get systematically better and cheaper, a trend observed with almost every new model release.
- Proprietary models continue to hold the performance-cost Pareto front.
- Minor model updates almost always improve performance or push down cost.
Found this useful? Feel free to follow me on LinkedIn to see my next explorations!

The images shown in this article were created by myself, the author, unless specified otherwise.

LLMs for Coding in 2024: Price, Performance, and the Battle for the Best was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
LLMs for Coding in 2024: Price, Performance, and the Battle for the Best

Go Here to Read this Fast! LLMs for Coding in 2024: Price, Performance, and the Battle for the Best
December 9, 2024
Can LLMs talk SQL, SPARQL, Cypher, and MongoDB Query Language (MQL) equally well?
Jonathan Fürst
Are LLMs Better at Generating SQL, SPARQL, Cypher, or MongoDB Queries?

Our NeurIPS’24 paper sheds light on this underinvestigated topic with a new and unique public dataset and benchmark.

(Image by author)

Many recent works have been focusing on how to generate SQL from a natural language question using an LLM. However, there is little understanding of how well LLMs can generate other database query languages in a direct comparison. To answer the question, we created a completely new dataset and benchmark of 10K question-query pairs covering four databases and query languages. We evaluated several relevant closed and open-source LLMs from OpenAI, Google, and Meta together with common in-context-learning (ICL) strategies. The corresponding paper “SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark” [1] is published at NeurIPS 2024 in the Dataset and Benchmark track (https://arxiv.org/abs/2411.05521).
All code and data are available at https://github.com/jf87/SM3-Text-to-Query to enable you to test your own Text-to-Query method across four query languages. But before we look at Text-to-Query, let’s first take a step back and examine the more common paradigm of Text-to-SQL.

What is Text-to-SQL?

Text-to-SQL (also called NL-to-SQL) systems translate the provided natural language question into a corresponding SQL query. SQL has served as a primary query language for structured data sources (relational model), offering a declarative interface for application developers to access information. Text-to-SQL systems thus aim to enable non-SQL expert users to access and fulfill their information needs by simply making their requests in natural language.

Figure 1. Overview Text-to-SQL. Users ask questions in natural language, which are then translated to the corresponding SQL query. The query is executed against a relational database such as PostgreSQL, and the response is returned to the users. (Image by author)

Text-to-SQL methods have recently increased in popularity and made substantial progress in terms of their generation capabilities. This can easily be seen from Text-to-SQL accuracies reaching 90% on the popular benchmark Spider (https://yale-lily.github.io/spider) and up to 74% on the more recent and more complex BIRD benchmark (https://bird-bench.github.io/). At the core of this success lie the advancements in
transformer-based language models, from Bert [2] (340M parameters) and Bart [ 3 ] (148M parameters) to T5 [4 ] (3B parameters) to the advent of Large Language Models (LLMs), such as OpenAI’s GPT models, Anthropic Claude models or Meta’s LLaMA models (up to 100s of billions of parameters).

Beyond Relational Databases: Document & Graph Model

While many structured data sources inside companies and organizations are indeed stored in a relational database and accessible through the SQL query language, there are other core database models (also often referred to as NoSQL) that come with their own benefits and drawbacks in terms of ease of data modeling, query performance, and query simplicity:
- Relational Database Model. Here, data is stored in tables (relations) with a fixed, hard-to-evolve schema that defines tables, columns, data types, and relationships. Each table consists of rows (records) and columns (attributes), where each row represents a unique instance of the entity described by the table (for example, a patient in a hospital), and each column represents a specific attribute of that entity. The relational model enforces data integrity through constraints such as primary keys (which uniquely identify each record) and foreign keys (which establish relationships between tables). Data is accessed through SQL. Popular relational databases include PostgreSQL, MySQL, and Oracle Database.
- Document Database Model. Here, data is stored in a document structure (hierarchical data model) with a flexible schema that is easy to evolve. Each document is typically represented in formats such as JSON or BSON, allowing for a rich representation of data with nested structures. Unlike relational databases, where data must conform to a predefined schema, document databases allow different documents within the same collection to have varying fields and structures, facilitating rapid development and iteration. This flexibility means that attributes can be added or removed without affecting other documents, making it suitable for applications where requirements change frequently. Popular document databases include MongoDB, CouchDB, and Amazon DocumentDB.
- Graph Database Model. Here, data is represented as nodes (entities) and edges (relationships) in a graph structure, allowing for the modeling of complex relationships and interconnected data. This model provides a flexible schema that can easily accommodate changes, as new nodes and relationships can be added without altering existing structures. Graph databases excel at handling queries involving relationships and traversals, making them ideal for applications such as social networks, recommendation systems, and fraud detection. Popular graph databases include Neo4j, Amazon Neptune, and ArangoDB.
From Text-to-SQL to Text-to-Query

The choice of database and the underlying core data model (relational, document, graph) has a large impact on read/write performance and query complexity. For example, the graph model naturally represents many-to-many relationships, such as connections between patients, doctors, treatments, and medical conditions. In contrast, relational databases require potentially expensive join operations and complex queries. Document databases have only rudimentary support for many-to-many relationships and aim at scenarios where data is not highly interconnected and stored in collections of documents with a flexible schema.

Figure 2. Differences across query languages and database systems for the same user request. (Image by author)

While these differences have been a known fact in database research and industry, their implications for the growing number of Text-to-Query systems have surprisingly not been investigated so far.

SM3-Text-to-Query Benchmark

SM3-Text-to-Query is a new dataset and benchmark that enables the evaluation across four query languages (SQL, MongoDB Query Language, Cypher, and SPARQL) and three data models (relational, graph, document).

Figure 3. SM3-Text-to-Query Benchmark Construction. Combining synthetic patient data generation with ETL processes for four databases makes it possible to create arbitrarily large synthetic datasets. (Image by author)

SM3-Text-to-Query is constructed from synthetic patient data created with Synthea. Synthea is an open-source synthetic patient generator that produces realistic electronic health record (EHR) data. It simulates patients’ medical histories over time, including various demographics, diseases, medications, and treatments. This created data is then transformed and loaded into four different database systems: PostgreSQL, MongoDB, Neo4J, and GraphDB (RDF).

Based on a set of > 400 manually created template questions and the generated data, 10K question-query pairs are generated for each of the four query languages (SQL, MQL, Cypher, and SPARQL). However, based on the synthetic data generation process, adding additional template questions or generating your own patient data is also easily possible (for example, adapted to a specific region or in another language). It would even be possible to construct a (private) dataset with actual patient data.

Text-to-Query Results

So, how do current LLMs perform in the generation across the four query languages? There are three main lessons that we can learn from the reported results.

Lesson 01: Schema information helps for all query languages but not equally well.

Schema information helps for all query languages, but its effectiveness varies significantly. Models leveraging schema information outperform those that don’t — even more in one-shot scenarios where accuracy plummets otherwise. For SQL, Cypher, and MQL, it can more than double the performance. However, SPARQL shows only a small improvement. This suggests that LLMs may already be familiar with the underlying schema (SNOMED CT, https://www.snomed.org), which is a common medical ontology.

Figure 4. Impact of Schema Information on Execution Accuracy. (Image by author)

Lesson 02: Adding examples improves accuracy through in-context learning (ICL) for all LLMs and query languages; however, the rate of improvement varies greatly across query languages.

Examples enhance accuracy through in-context learning (ICL) across all LLMs and query languages. However, the degree of improvement varies greatly. For SQL, the most popular query language, larger LLMs (GPT-3.5, Llama3–70b, Gemini 1.0) already show a solid baseline accuracy of around 40% with zero-shot schema input, gaining only about 10% points with five-shot examples. However, the models struggle significantly with less common query languages such as SPARQL and MQL without examples. For instance, SPARQL’s zero-shot accuracy is below 4%. Still, with five-shot examples, it skyrockets to 30%, demonstrating that ICL supports models to generate more accurate queries when provided with relevant examples.

Figure 5. Impact of In-Context-Learning (ICL) through Few-shot Examples. (Image by author)

Lesson 03: LLMs have varying levels of training knowledge across different query languages

LLMs exhibit differing levels of proficiency across query languages. This is likely rooted in their training data sources. An analysis of Stack Overflow posts supports this assumption. There is a big contrast in the post-frequency for the different query languages:
- [SQL]: 673K posts
- [SPARQL]: 6K posts
- [MongoDB, MQL]: 176K posts
- [Cypher, Neo4J]: 33K posts
This directly correlates with the zero-shot accuracy results, where SQL leads with the best model accuracy of 47.05%, followed by Cypher and MQL at 34.45% and 21.55%. SPARQL achieves just 3.3%. These findings align with existing research [5], indicating that the frequency and recency of questions on platforms like Stack Overflow significantly impact LLM performance. An intriguing exception arises with MQL, which underperforms compared to Cypher, likely due to the complexity and length of MQL queries.

Conclusion

SM3-Text-to-query is the first dataset that targets the cross-query language and cross-database model evaluation of the increasing number of Text-to-Query systems that are fueled by rapid progress in LLMs. Existing works have mainly focused on SQL. Other important query languages are underinvestigated. This new dataset and benchmark allow a direct comparison of four relevant query languages for the first time, making it a valuable resource for both researchers and practitioners who want to design and implement Text-to-Query systems.

The initial results already provide many interesting insights, and I encourage you to check out the full paper [1].

Try it yourself

All code and data are open-sourced on https://github.com/jf87/SM3-Text-to-Query. Contributions are welcome. In a follow-up post, we will provide some hands-on instructions on how to deploy the different databases and try out your own Text-to-Query method.

[1] Sivasubramaniam, Sithursan, Cedric Osei-Akoto, Yi Zhang, Kurt Stockinger, and Jonathan Fuerst. “SM3-Text-to-Query: Synthetic Multi-Model Medical Text-to-Query Benchmark.” In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
[2] Devlin, Jacob. “Bert: Pre-training of deep bidirectional transformers for language understanding.” arXiv preprint arXiv:1810.04805 (2018).
[3]Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
[4] Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. “Exploring the limits of transfer learning with a unified text-to-text transformer.” Journal of machine learning research 21, no. 140 (2020): 1–67.
[5] Kabir, Samia, David N. Udo-Imeh, Bonan Kou, and Tianyi Zhang. “Is stack overflow obsolete? an empirical study of the characteristics of chatgpt answers to stack overflow questions.” In Proceedings of the CHI Conference on Human Factors in Computing Systems, pp. 1–17. 2024.

Can LLMs talk SQL, SPARQL, Cypher, and MongoDB Query Language (MQL) equally well? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Can LLMs talk SQL, SPARQL, Cypher, and MongoDB Query Language (MQL) equally well?

Go Here to Read this Fast! Can LLMs talk SQL, SPARQL, Cypher, and MongoDB Query Language (MQL) equally well?
December 9, 2024
Here’s where to get Apple’s M3 MacBook Air 16GB for $899 with delivery by Christmas

Delivery dates are already slipping on Apple’s popular M3 MacBook Air with 16GB RAM. Here’s where to pick it up for as low as $899, with delivery by Christmas.

Get an M3 MacBook Air for just $899 with delivery by Christmas – Image credit: Apple

Amazon is reporting delivery after Christmas for its M3 MacBook Air 13-inch 16GB/256GB listing, but Best Buy currently has units in stock at a $200 discount off MSRP with free shipping or store pickup.

Buy from $899

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

Here’s where to get Apple’s M3 MacBook Air 16GB for $899 with delivery by Christmas

Originally appeared here:

Here’s where to get Apple’s M3 MacBook Air 16GB for $899 with delivery by Christmas

December 9, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Tag: tech

Evaluating the current LLM landscape based both benchmarks and real-world insights to help you make informed choices.

Disclaimer: challenges when comparing LLMs

Evaluating LLMs: HumanEval and Elo scores

Benchmarking through HumanEval

The voice of the people through Elo scores

Current state of model performance

Mismatch between benchmark results and real-world performance

Balancing performance and price: the models that provide the best bang for buck

Shifting through the data: additional insights and trends

Models are getting better and cheaper

Proprietary models remain in power

Even minor model updates have an effect

Conclusion: key takeaways of LLMs for coding

Are LLMs Better at Generating SQL, SPARQL, Cypher, or MongoDB Queries?

Our NeurIPS’24 paper sheds light on this underinvestigated topic with a new and unique public dataset and benchmark.

What is Text-to-SQL?

Beyond Relational Databases: Document & Graph Model

From Text-to-SQL to Text-to-Query

SM3-Text-to-Query Benchmark

Text-to-Query Results

Conclusion

Try it yourself