Don’t complicate things with graph DBs, QLs, or graph analytics.
When RAG developers decide to try graph RAG — that is, to build a knowledge graph and integrate it into their RAG (retrieval-augmented generation) system — they have a lot of options and choices to make, according to the internet. There are lots of articles, guides, and how-to’s presenting different tools for working with graph RAG and graphs in general. So some developers dive right in, thinking they need to integrate and configure a laundry list of graph tools and techniques in order to do graph RAG properly. When searching how to get started, you would typically find articles suggesting that you need some or all of the following:
- knowledge graphs — to connect key terms and concepts that semantic search doesn’t capture
- keyword and entity extraction tools — for building the knowledge graph
- graph traversal algorithms — for exploring connections in the graph
- property graph implementations — for enriching graph structure and traversal methods
- graph databases (DBs) — for storing and interacting with graphs, and advanced graph analytics
- graph query languages (QLs) — for sophisticated querying of graph nodes and edges
- graph node embedding algorithms — for embedding graph objects into searchable vector spaces
- vector stores — for storing and searching documents embedded in semantic vector space
Certainly, a case can be made that each of these tools and implementations can be very helpful for specific graph use cases. But for any developer starting a typical graph RAG use case, the simple fact remains: most “graph” tools were designed and built long before the generative AI revolution. GenAI use cases are fundamentally different from traditional graph use cases, and requires a different approach, even if some tools can be shared between the two.
The above list of suggested tools for graph RAG includes some that are generally unnecessary for typical GenAI use cases. And, beyond being unnecessary, adding some of these tools can over-complicate things — leading to increased development time, higher costs, and additional maintenance overhead that could have been avoided. Keeping the tech stack simple by focusing on the essentials enhances efficiency and lets you leverage the power of graph RAG without the bloat.
One popular misconception is that you need a graph DB to do graph RAG. Graph DBs and graph query languages (graph QLs) are powerful tools for graph analytics and deep graph algorithms, but graph RAG and GenAI applications don’t typically benefit from these types of traditional graph analytics. Graph DBs can support graph RAG, but they also add unnecessary complexity to the stack. We dive into this topic more below.
In this article, we discuss the software needs of various use cases involving graphs, focusing on GenAI use cases and applications, and minimizing additional effort and complexity when moving from plain RAG to graph RAG. In most cases, we don’t need an extensive list of tools; adopting a few key technologies aligned with our goals not only simplifies our work but often achieves better results.
GenAI use cases for graphs
Semantic vector search is powerful for finding documents that are contextually similar to a query. However, there are situations where this method falls short, especially when the required information is non-semantic or when deeper insights into the data are necessary. Graph RAG technologies can complement the capabilities of vector search by leveraging non-semantic information — such as in the following common use cases:
Leveraging non-semantic information in documents
While semantic search excels in identifying documents based on contextual similarity, it often misses non-semantic cues crucial for comprehensive data analysis. Graphs can incorporate and utilize non-semantic information such as metadata, which can include links, specialized terms and definitions, cross-references, glossaries, and document structure such as titles, headings, and sub-section content.. Additionally, graphs can connect entities, keywords, and concepts that have been extracted or inferred from texts.
Community summarization
When the goal is to summarize the content from a community or a specific group of interconnected entities, graph-based approaches can be indispensable. Graphs can identify clusters or communities within the data, summarizing prevalent themes or discussions across multiple documents or contributors.
Neighborhood exploration
Exploring the “neighborhood” or immediate connections of a particular node or query in a graph can reveal relationships and insights that are not evident through semantic search alone. Contextual exploration allows for traversing from a starting node to explore adjacent nodes (documents, terms, or concepts) to discover related information that adds depth to the initial query.
Why GenAI is different from traditional graph use cases
Before there was generative AI, there were knowledge graphs and graph DBs. These graph tools pre-date GenAI by many years, and some associated technologies were designed for very different use cases. These technologies were primarily aimed at structured data exploration, not the unstructured text processing and semantic understanding that GenAI excels at.
The shift from traditional graph use cases to generative AI is a significant change in data handling techniques. Traditional graphs are excellent for clear, defined relationships, but they often lack the flexibility needed for the nuanced demands of generative AI.
Traditional graph tools were built for huge, complex graphs
Knowledge graphs are often the aggregation of large amounts of data from various sources, linking complex and interdependent relationships across a wide spectrum of data points. A huge number of nodes and edges, coupled with the complexity of their connections, can make data processing and analysis tasks computationally intensive and time-consuming.
This is why graph databases (graph DBs) were originally created. They provide optimized storage solutions and processing capabilities designed to manage extensive networks of nodes and edges efficiently. Alongside graph DBs, graph query languages (graph QLs) have been designed to facilitate sophisticated query operations on these large graphs and their subgraphs. These tools excel at executing operations that involve deep traversals, pattern matching, and dynamic data aggregation, which are typical in graph analytics. Common use cases for graph DBs and graph analytics include social network analysis, recommendation systems, fraud detection, and complex network management. In these scenarios, the ability to quickly and efficiently analyze complex relationships within large sets of data is crucial.
Some canonical use cases for graph DBs and QLs:
- Centrality analysis — Identify the most influential people within a social network. Involves centrality measures such as Degree Centrality, Betweenness Centrality, and Eigenvector Centrality
- Community detection — Segment the network into communities or clusters where members are more densely connected internally than with the rest of the network. Involves graph clustering algorithms and edge-betweenness community detection.
- Pathfinding — Find the shortest path between two nodes to understand the degrees of separation between individuals. Involves algorithms like Dijkstra’s or A* (A-star) for shortest path calculations.
Of course, there are many other use cases of sophisticated graph querying and graph analytics that traditional graph tools were designed for and excel at. But, the examples given here, as well as many others, are very different from the graph use cases we see today in GenAI applications.
Knowing all of this, why would we start building a graph RAG system using a graph DB that added vector storage and search as a secondary feature… when modern vector stores are perfectly capable of supporting all of the graph operations that we need for graph RAG? We shouldn’t, and we dig more into how vector stores work with graph operations in the next section.
Both graph RAG and vector search operate locally
Previously, I listed “neighborhood exploration” as one application for graphs in GenAI use cases, but conceptually speaking, it can be considered a broad umbrella term under which you can find virtually all graph use cases within GenAI. In other words, when we use graphs with GenAI, we are almost certainly exploring only neighborhoods — and very rarely a whole graph or large parts of graphs. At most, we explore subgraphs that are quite small relative to the whole graph.
In graph theory, a “neighborhood” refers to the set of nodes adjacent to a given node within a graph, as defined by direct links or edges. So, retrieving neighbors of a node in a knowledge graph should result in a set of items or concepts that are directly related to the starting node. Similarly, in vector search, standard implementations return “approximate nearest neighbors” (ANN) in semantic vector space, meaning that the documents in the results set are those most closely related to the query, in a semantic sense. (ANN is “approximate” because making it exact is much slower and more expensive.)
So, both vector search and graph traversal a few steps from a starting node are both looking for “nearest neighbors”, where “nearest” has a different meaning in each of the two cases. Vector search finds the nearest semantic neighbors and graph traversal finds graph neighbors — which, if integrated well, can pull together documents that are related in both semantic ways and a wide variety of non-semantic ways that are limited only by how you construct your knowledge graph.
The important point here is to note that graph RAG is entirely concerned with exploring local neighborhoods, whether graph or vector — just like RAG always has on the purely vector side.The implication is that our graph RAG software stack should be built on a foundation that excels at local neighborhood search and retrieval, because all of our queries in GenAI apps are focused on specific areas of knowledge that do not require comprehensive explorations or analytics of the entire knowledge graph.
A-la-carte graph tools: adopt only what you need
Returning to the “laundry list” of graph tools from the beginning of this article, let’s have a closer look at when you might want to adopt them as part of your graph RAG stack, or not.
Knowledge graphs
- When to adopt — Always, in some form. A knowledge graph is a core part of graph RAG.
- When to avoid — Never, unless getting rid of graph RAG in favor of plain RAG.
Entity and keyword extraction tools
- When to adopt — When building a knowledge graph directly from textual content where automated extraction can efficiently populate your graph with relevant entities and keywords.
- When to avoid — If your data doesn’t lend itself well to automated extraction or when alternative methods like document linking, manual curation, or specialized parsers better suit your data and use case.
Graph traversal algorithms
- When to adopt — Always. A simple graph traversal algorithm is necessary for graph RAG, e.g. typically a simple walk of depth 1–3 from the starting node.
- When to avoid — While basic traversal is necessary, avoid overly complex algorithms unless your use case specifically demands advanced graph navigational capabilities.
Property graph implementations
- When to adopt — When your project requires sophisticated modeling of complex relationships and properties within edges that go well beyond basic linkage.
- When to avoid — For most standard graph RAG implementations where such complexity in relationship modeling isn’t required. Simpler graph models typically suffice.
Graph databases
- When to adopt — When dealing with extensive, complex queries and needing to perform advanced graph analytics and traversals that surpass the capabilities of standard systems.
- When to avoid — If your graph RAG system does not engage in complex, extensive, graph-specific operations. Adopting a graph database in such scenarios can lead to unnecessary system complexity and resource allocation.
Graph query languages (Graph QLs)
- When to adopt — If adopting graph DBs. When complex querying of graph data is critical for your application, allowing sophisticated manipulation and retrieval of interconnected data.
- When to avoid — For simpler graph RAG setups where basic retrieval methods suffice, incorporating a graph QL might over-complicate the architecture.
Graph node embedding algorithms
- When to adopt — When you have a graph, and want to convert graph nodes into vectors. This is a specialized use case with advantages and disadvantages. See the popular algorithm node2vec.
- When to avoid — If your system does not require searching graph nodes as vectors.
Vector stores
- When to adopt: Always. Necessary, as they serve as the foundation for storing and searching high-dimensional vector representations crucial for RAG systems.
- When to avoid — Never.
Each component’s inclusion should align with the specific needs and complexities of your graph RAG system, ensuring that every adopted technology adds value and enhances system performance without unnecessary complexity.
Requirements of a minimal graph RAG system
Considering the above notes on graph tools and techniques, these are the core components required for any graph RAG system:
- Vector store — Essential for any RAG framework, the vector store is even more crucial in graph RAG for maintaining the scalability and efficiency of document retrieval. Vector stores provide the infrastructure for storing and searching through documents embedded in a semantic vector space, which is fundamental to the retrieval process in RAG systems.
- Knowledge graph — The defining concept of graph RAG vs plain RAG, the knowledge graph links key terms and concepts that semantic vector search might miss. This graph is vital for expanding the context and enhancing the relational data available to the RAG system, thus justifying its central role in graph RAG.
- Graph traversal — A simple graph traversal algorithm is necessary to navigate the knowledge graph. This component doesn’t need to be overly complex, as graph RAG primarily requires exploring local neighborhoods or small subgraphs directly related to the query, rather than deep or wide-ranging graph navigations.
For specialized use cases, or if the minimal implementation isn’t performing well enough, more graph tools and capabilities can be added — some important considerations are outlined in the next section.
Start with vector, add “graph” as needed — not the other way around
When working with GenAI use cases, the foundations of knowledge are in vector space. We use vector-optimized tools like vector stores because they operate directly with the language of LLMs and other GenAI models — vectors. Our implementations of GenAI applications should be vector-first, because the most important vector operations (e.g. approximate nearest neighbor search) are expensive in both time and money, so we should optimize these for performance and efficiency. Adding graph to a GenAI application should be just that: adding graph capabilities to your existing vector-optimized infrastructure. Moving from vector-optimized to graph-native infrastructure may be needed in some specific use cases, but in the vast majority of cases it complicates the tech stack and makes deployment more challenging.
When starting with a typical graph RAG implementation and considering the addition of more complex graph tools and capabilities, it is important to carefully evaluate the particular challenges and requirements of the use cases, rather than the common notion that more sophisticated or complex graph tools are inherently better for any graph use case.
Here some some key considerations:
- Locality of graph operations — In graph RAG, graph operations are predominantly local, involving only simple traversals within immediate neighborhoods and small subgraphs. This approach typically does not benefit from complex graph algorithms that might overcomplicate the retrieval process.
- Capability of vector stores for graph operations — Modern vector stores are quite capable of performing necessary graph operations, especially when the operations are not overly complex. This allows for a seamless integration where vector and graph technologies complement each other without the need for a separate graph database.
- Scalability and efficiency of modern vector stores — Vector stores are designed to handle large-scale document data sets with high efficiency, making them ideal for the backbone of a RAG system where quick retrieval is paramount. Using graph capabilities directly within the vector store can also accommodate necessary graph operations without sacrificing performance.
- Complexity of graph DBs, QLs, and analytics — Introducing a graph database into the stack can complicate the software architecture unnecessarily. Given that the graph requirements in graph RAG typically do not require sophisticated large-graph operations, leveraging the existing capabilities of vector stores to handle these needs can be more efficient and keeps the system architecture simpler.
Each addition should be considered carefully to ensure it directly addresses a specific need without introducing undue complexity or overhead. This strategic approach ensures that enhancements are justified by tangible improvements in functionality or performance.
Simple ways to start doing graph RAG
For a straight-forward and illustrative example of how to do graph RAG without any specialized graph tools beyond an open-source graph vector store implementation in LangChain, see my previous article in Towards Data Science. Or, for a broader view of how to get started, see the this guide to graph RAG.
by Brian Godsey, Ph.D. (LinkedIn) — mathematician, data scientist and engineer // AI and ML products at DataStax // Wrote the book Think Like a Data Scientist
A Graph Too Far: Graph RAG Doesn’t Require Every Graph Tool was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
A Graph Too Far: Graph RAG Doesn’t Require Every Graph Tool
Go Here to Read this Fast! A Graph Too Far: Graph RAG Doesn’t Require Every Graph Tool