Evaluating LLMs in Cypher Statement Generation

Step-by-step tutorial for assessing the accuracy of generated Cypher Statements

Shortly after large language models (LLMs) became popular, we realized they were decent at translating natural language to database queries such as SQL and Cypher. To enable the LLM to create tailored queries for your particular database, you must provide its schema and, optionally, a few example queries. With this information, the LLM can generate database queries based on natural language input.

While LLMs show great potential at translating natural language to database queries, they are far from perfect. Therefore, it is essential to understand how well they perform using an evaluation process. Luckily, the process of generating SQL statements has been researched by academia in studies like the Spider. We will use the following metrics to evaluate the Cypher generation ability of LLMs.

Jaro-Winkler: This is a text similarity metric based on edit distance. We compare the produced Cypher query to a correct Cypher query, and measure how different the strings are, by how much one would have to edit one query to be the same as the other query
Pass@1: This score is 1.0 if the produced query returns the same results from the database as the correct Cypher would, and 0.0 otherwise.
Pass@3: Similar to Pass@1, but instead of generating 1 query, we generate 3 queries. If any of them produce the same results as the correct query the score is 1.0, otherwise 0.0
Jaccard similarity: It measures the Jaccard similarity between the response returned by the produced Cypher and the correct Cypher’s response. This metric is used in mind to capture examples where the model may return almost correct results.

Code is available on GitHub as was developed in collaboration with Adam Schill Collberg.

As you can observe, the focus is on evaluating responses from the database and not the actual Cypher statement itself. One reason is that a Cypher statement can be written multiple ways to retrieve identical information. We don’t care which syntax the LLM prefers; we only care that it produces correct responses. Additionally, we don’t have a strong preference for how the LLM names the column in responses, and therefore, we don’t want to evaluate its column naming abilities, etc…

Test dataset

The test dataset consists of question and relevant Cypher statement pairs.

<a href="https://medium.com/media/66a0b35a6c806177fdeb7d60009e3915/href">https://medium.com/media/66a0b35a6c806177fdeb7d60009e3915/href</a>

You can use an LLM to generate suggestions for the testing dataset. However, you need to manually validate the examples as the LLMs make mistakes and aren’t 100% reliable. If they were, we wouldn’t need to test them anyway. As we are evaluating based on database results and not the Cypher statements themselves, we need to have a running database with relevant information that we can use. In this blog post, we will use the recommendations project in Neo4j Sandbox. The recommendations project uses the MovieLens dataset, which contains movies, actors, ratings, and more information. The recommendations project is also available as readonly access on the demo server, which means you don’t have to create a new database instance if you don’t want to.

Recommendations project graph schema. Image by the author.

In this example, I used GPT-4 to come up with suggestions for the training dataset, and then went through them and corrected them where needed. We will use only 27 testing pairs. In practice, you probably want to use at least a couple hundred examples.

data = [
    {
        "question": "How many movies were released in 1995?",
        "cypher": "MATCH (m:Movie) WHERE m.Year = 1995 RETURN count(*) AS result",
    },
    {
        "question": "Who directed the movie Inception?",
        "cypher": "MATCH (m:Movie {title: 'Inception'})<-[:DIRECTED]-(d) RETURN d.name",
    },
    {
        "question": "Which actors played in the movie Casino?",
        "cypher": "MATCH (m:Movie {title: 'Casino'})<-[:ACTED_IN]-(a) RETURN a.name",
    },
    {
        "question": "How many movies has Tom Hanks acted in?",
        "cypher": "MATCH (a:Actor {name: 'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN count(m)",
    },
    {
        "question": "List all the genres of the movie Schindler's List",
        "cypher": "MATCH (m:Movie {title: 'Schindler\'s List'})-[:IN_GENRE]->(g:Genre) RETURN g.name",
    },
    ...
]

Generating Cypher statements

We will be using LangChain to generate Cypher statements. The Neo4jGraph object in LangChain establishes the connection to Neo4j and retrieves its schema information.

graph = Neo4jGraph()
print(graph.schema)
# Node properties are the following:
# Movie {posterEmbedding: LIST, url: STRING, runtime: INTEGER, revenue: INTEGER, budget: INTEGER, plotEmbedding: LIST, imdbRating: FLOAT, released: STRING, countries: LIST, languages: LIST, plot: STRING, imdbVotes: INTEGER, imdbId: STRING, year: INTEGER, poster: STRING, movieId: STRING, tmdbId: STRING, title: STRING}
# Genre {name: STRING}
# User {userId: STRING, name: STRING}
# Actor {url: STRING, bornIn: STRING, bio: STRING, died: DATE, born: DATE, imdbId: STRING, name: STRING, poster: STRING, tmdbId: STRING}
# Director {url: STRING, bornIn: STRING, born: DATE, died: DATE, tmdbId: STRING, imdbId: STRING, name: STRING, poster: STRING, bio: STRING}
# Person {url: STRING, bornIn: STRING, bio: STRING, died: DATE, born: DATE, imdbId: STRING, name: STRING, poster: STRING, tmdbId: STRING}
# Relationship properties are the following:
# RATED {rating: FLOAT, timestamp: INTEGER}
# ACTED_IN {role: STRING}
# DIRECTED {role: STRING}
# The relationships are the following:
# (:Movie)-[:IN_GENRE]->(:Genre)
# (:User)-[:RATED]->(:Movie)
# (:Actor)-[:ACTED_IN]->(:Movie)
# (:Actor)-[:DIRECTED]->(:Movie)
# (:Director)-[:DIRECTED]->(:Movie)
# (:Director)-[:ACTED_IN]->(:Movie)
# (:Person)-[:ACTED_IN]->(:Movie)
# (:Person)-[:DIRECTED]->(:Movie)

The schema contains node labels, their properties, and the corresponding relationships. Next, we will use the LangChain expression language to define a prompt sent to the LLM with instructions to translate the natural language to a Cypher statement that retrieves relevant information to answer the question. Visit the official documentation for more details on the LangChain expression language.

cypher_template = """Based on the Neo4j graph schema below, 
write a Cypher query that would answer the user's question.
Return only Cypher statement, no backticks, nothing else.
{schema}

Question: {question}
Cypher query:"""  # noqa: E501

cypher_prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            "Given an input question, convert it to a Cypher query. No pre-amble.",
        ),
        ("human", cypher_template),
    ]
)

cypher_chain = (
    RunnablePassthrough.assign(
        schema=lambda _: graph.get_schema,
    )
    | cypher_prompt
    | llm.bind(stop=["nCypherResult:"])
    | StrOutputParser()
)

If you are familiar with conversational LLMs, you can spot the system and human message definitions. As you can observe, we put both the graph schema as well as user question into the human message. The exact prompt engineering instructions to generate Cypher statements is not a solved problem, which means that there could be some improvements made here. Using an evaluation process, you could see what works best for the particular LLM. In this example, we are using gpt-4-turbo.

We can test the Cypher generation with the following example:

response = cypher_chain.invoke(
    {
        "question": "How many movies have the keyword 'love' in the title and a runtime under 2 hours?"
    }
)
print(response)
# MATCH (m:Movie) 
# WHERE m.title CONTAINS 'love' AND m.runtime < 120 
# RETURN count(m) as NumberOfMovies

We can observe that gpt-4-turbo is somewhat decent at translating natural language to Cypher statements. Let’s now define the evaluation process.

# Iterate over each row with tqdm to show a progress bar
for index, row in tqdm(df.iterrows(), total=df.shape[0]):
    # Fetch data based on the test Cypher statement
    true_data = graph.query(row["cypher"])
    # Generate 3 Cypher statement and fetch data
    example_generated_cyphers = []
    example_eval_datas = []
    for _ in range(3):
        cypher = cypher_chain.invoke({"question": row["question"]})
        example_generated_cyphers.append(cypher)
        # Fetch data based on the generated Cypher statement
        try:
            example_eval_datas.append(graph.query(cypher))
        except ValueError:  # Handle syntax error
            example_eval_datas.append([{"id": "Cypher syntax error"}])

    # These metrics require only the first cypher/response
    jaro_winkler = get_jw_distance(row["cypher"], example_generated_cyphers[0])
    pass_1 = (
        1
        if df_sim_pair(
            (row["cypher"], true_data),
            (example_generated_cyphers[0], example_eval_datas[0]),
        )
        == 1
        else 0
    )
    jaccard = df_sim_pair(
        (row["cypher"], true_data),
        (example_generated_cyphers[0], example_eval_datas[0]),
    )
    # Pass@3 check all 3 responses
    pass_3 = 1 if any(
        df_sim_pair((row["cypher"], true_data), (gen_cypher, eval_data)) == 1
        for gen_cypher, eval_data in zip(example_generated_cyphers, example_eval_datas)
    ) else 0

    # Append the results to their respective lists
    generated_cyphers.append(example_generated_cyphers)
    true_datas.append(true_data)
    eval_datas.append(example_eval_datas)
    jaro_winklers.append(jaro_winkler)
    pass_1s.append(pass_1)
    pass_3s.append(pass_3)
    jaccards.append(jaccard)

Running this code took about 5 minutes as we need to generate 81 responses to calculate the pass@3 metric.

The code is slightly lengthy. However, the gist is quite simple to understand. We iterate over all the rows in the data frame that stores the testing examples. Next, we generate three Cypher queries for each training example and retrieve corresponding data from the database. What follows is then calculating the relevant metrics and storing them in lists so that we can evaluate and visualize them. We didn’t include the helper functions in the blog post because, in my opinion, reviewing each metric code implementation isn’t relevant. All of these functions, however, are included in the notebook.

Let’s now evaluate the results.

Evaluation of LLM-generated Cypher statements. Image by the author.

The evaluation is based on four different metrics:

Jaro-Winkler: This metric shows a high average of 0.89, indicating that the LLMs generated Cypher queries that are very similar to the correct Cypher queries on a string level.
Pass@1: The average score here is 0.48, suggesting that nearly half of the generated Cypher queries returned the exact same results as the correct query when each query is evaluated independently.
Pass@3: With an average of 0.63, this metric indicates an improvement over Pass@1. This suggests that while the LLM may not always get the query right on the first attempt, it often comes up with a correct version within three tries.
Jaccard Similarity: The average score of 0.53 is the lowest among the metrics but still indicates that more than half of the time, the sets of results from the LLM-generated Cypher queries share more than half of their elements with the sets from the correct queries.

Overall, these metrics suggest that LLMs are decent at generating Cypher queries that are similar to the correct ones and often produce functionally equivalent results, especially when given multiple attempts. However, there is still room for improvement, particularly in generating the correct query on the first attempt. Additionally there is also room for improvement in the evaluation process. Let’s take a look at one example:

row = df.iloc[24]
# Print the desired information
print("Question:", row["question"], "n")
print("True Cypher:", row["cypher"], "n")
print("Generated Cypher", row["generated_cypher"][0], "n")

# Question: Which directors have never had a movie with a rating below 6.0? 
# True Cypher: 
# MATCH (d:Director)-[:DIRECTED]->(m:Movie) 
# WITH d, MIN(m.imdbRating) AS lowestRating WHERE lowestRating >= 6.0 
# RETURN d.name, lowestRating 

# Generated Cypher
# MATCH (d:Director)-[:DIRECTED]->(m:Movie)
# WHERE NOT EXISTS {
#     MATCH (d)-[:DIRECTED]->(m2:Movie)
#     WHERE m2.imdbRating < 6.0
# }
# RETURN DISTINCT d.name

For the question of which directors never had a movie with a rating below 6.0, the LLM did a decent job of getting the results. It used a different approach than in the test dataset, but that’s not a problem, as it should get the same results. However, we returned both the title and the movie’s rating in the testing data. On the other hand, the LLM produced only the titles and not the ratings. We can’t blame it, as it simply followed the instructions. Still, you must know that the pass@1 score is 0 in this example, while the Jaccard similarity is only 0.5. Therefore, you have to very careful how you construct the testing dataset, both how you define the prompts as well as the corresponding Cypher statements.

Another characteristic of LLMs is that they are non-deterministic, meaning you can get different results on every run. Let’s run the evaluation three times in sequence now. This evaluation takes around 15 minutes.

The bar chart highlights the non-deterministic nature of LLMs. The Jaro-Winkler scores are consistently high across all runs, showing minor fluctuations between 0.88 and 0.89, which indicates stable string similarity of the generated queries. However, for Pass@1, there’s a notable variation, with the first run scoring 0.52, and subsequent runs showing scores of 0.59 and 0.48. Pass@3 scores exhibit less variance, hovering around 0.56 to 0.63, suggesting that multiple attempts yield more consistent correct results.

Conclusion

Through this blog post, we’ve learned that LLMs like GPT-4 have a promising capacity for generating Cypher queries, yet the technology isn’t foolproof. The evaluation framework presented offers a detailed, quantitative evaluation of LLM performance, allowing you to continuously experiment and update prompt engineering and other steps needed to generate valid and accurate Cypher statements. Additionally, it shows how the non-deterministic nature of LLMs affects the performance from one evaluation to another. Therefore, you can expect similar non-deterministic behavior in production.

The code is available on GitHub.

Dataset

F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4: 19:1–19:19. https://doi.org/10.1145/2827872

Evaluating LLMs in Cypher Statement Generation was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Originally appeared here:
Evaluating LLMs in Cypher Statement Generation

Go Here to Read this Fast! Evaluating LLMs in Cypher Statement Generation

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Related Stories

Attackers retain old scamming tricks with new twists — and consumers must stay informed to protect themselves

Get four Apple AirTags for just $73 with this Black Friday deal

I tested Beats’ new Pill speaker and it delivered gloriously smooth sound (and it’s on sale for Black Friday)

You may have missed

Attackers retain old scamming tricks with new twists — and consumers must stay informed to protect themselves

Ethereum whales accumulate RCO Finance after 2025 predictions hint at a 19,405% rally

Shibarium TVL surges: Impact on Shiba Inu’s burn rate, SHIB’s price

Fantom surges 20% in 24 hours: Next key levels for FTM