Common challenges and architectural components to enable scaling
1. Introduction
1.1. Overview of RAG
Those of you who have been immersed in generative AI and its large-scale applications outside of personal productivity apps have likely come across the notion of Retrieval Augmented Generation or RAG. The RAG architecture consists of two key components—the retrieval component which uses vector databases to do an index based search on a large corpus of documents. This is then sent over to a large language model (LLM) to generate a grounded response based on the richer context in the prompt.
Whether you are building customer-facing chatbots to answer repetitive questions and reduce workload from customer service agents, or building a co-pilot for engineers to help them navigate complex user manuals step-by-step, RAG has become a key archetype of the application of LLMs. This has enabled LLMs to provide a contextually relevant response based on ground truth of hundreds or millions of documents, reducing hallucinations and improving the reliability of LLM-based applications.
1.2. Why scale from Proof of Concept(POC) to production
If you are asking this question, I might challenge you to answer why are you even building a POC if there is no intent of getting it to production. Pilot purgatory is a common risk with organisations that start to experiment, but then get stuck in experimentation mode. Remember that POCs are expensive, and true value realisation only happens once you go into production and do things at scale- either freeing up resources, making them more efficient, or creating additional revenue streams.
2. Key challenges in scaling RAG
2.1. Performance
Performance challenges in RAGs come in various flavours. The speed of retrieval is generally not the primary challenge unless your knowledge corpus has millions of documents, and even then it can be solved by setting up the right infrastructure- of course, we are limited by inference times. The second performance problem we encounter is around getting the “right” chunks to be fed to the LLMs for generation, with a high level of precision and recall. The poorer the retrieval process is, the less contextually relevant the LLM response will be.
2.2. Data Management
We have all heard the age-old saying “garbage in garbage out (GIGO)”. RAG is nothing but a set of tools we have at our disposal, but the real value comes from the actual data. As RAG systems work with unstructured data, it comes with its own set of challenges including but not limited to- version control of documents, and format conversion (e.g. pdf to text), among others.
2.3. Risk
One of the biggest reasons corporations hesitate to move from testing the waters to jumping in is the possible risks that come with using AI based systems. Hallucinations are definitely lowered with the use of RAG, but are still non-zero. There are other associated risks including risks for bias, toxicity, regulatory risks etc. which could have long term implications.
2.4. Integration into existing workflows
Building an offline solution is easier, but bringing in the end users’ perspective is crucial to make sure the solution does not feel like a burden. No users want to go to another screen to use the “new AI feature”- users want the AI features built into their existing workflows so the technology is assistive, and not disruptive to the day-to-day.
2.5. Cost
Well, this one seems sort of obvious, doesn’t it? Organisations are implementing GenAI use cases so that they can create business impact. If the benefits are lower than we planned, or there are cost overruns, the impact would be severely diminished, or also completely negated.
3. Architectural components needed for Scaling
It would be unfair to only talk about challenges if we don’t talk about the “so what do we do”. There are a few essential components you can add to your architecture stack to overcome/diminish some of the problems we outlined above.
3.1. Scalable vector databases
A lot of teams, rightfully, start with open-source vector databases like ChromaDB, which are great for POCs as they are easy to use and customise. However, it may face challenges with large-scale deployments. This is where scalable vector databases come in (such as Pinecone, Weaviate, Milvus, etc.) which are optimised for high-dimensional vector searches, enabling fast (sub-millisecond), accurate retrieval even as the dataset size increases into the millions or billions of vectors as they use Approximate Nearest Neighbour search techniques. These vector databases have APIs, plugins, and SDKs that allow for easier workflow integration and they are also horizontally scalable. Depending on the platform one is working on- it might make sense to explore vector databases offered by Databricks or AWS.
3.2. Caching Mechanisms
The concept of caching has been around almost as long as the internet, dating back to the 1960’s. The same concept applies to GenerativeAI as well—If there are a large number of queries, maybe in the millions (very common in the customer service function), it is likely that many queries are the same or extremely similar. Caching allows one to avoid sending a request to the LLM if we can instead return a response from a recent cached response. This serves two purposes- reduced costs, as well as better response times for common queries.
This can be implemented as a memory Cache (in-memory caches like Redis or Memcached), Disk Cache for less frequent queries or distributed Cache (Redis Cluster). Some model providers like Anthropic offer prompt caching as part of their APIs.
3.3. Advanced Search Techniques
While not as crisply an architecture component, multiple techniques can help elevate the search to enhance both efficiency and accuracy. Some of these include:
- Hybrid Search: Instead of relying only on semantic search(using vector databases), or keyword search, use a combination to boost your search.
- Re-ranking: Use a LLM or SLM to calculate a relevancy score for the query with each search result, and re-rank them to extract and share only the highly relevant ones. This is particularly useful for complex domains, or domains where one may have many documents being returned. One example of this is Cohere’s Rerank.
3.4. Responsible AI layer
Your Responsible AI modules have to be designed to mitigate bias, ensure transparency, align with your organisation’s ethical values, continuously monitor for user feedback and track compliance to regulation among other things, relevant to your industry/function. There are many ways to go about it, but fundamentally this has to be enabled programmatically, with human oversight. A few ways it can be done that can be done:
- Pre-processing: Filter user queries before they are ever sent over to the foundational model. This may include things like checking for bias, toxicity, un-intended use etc.
- Post-processing: Apply another set of checks after the results come back from the FMs, before exposing them to the end users.
These checks can be enabled as small reusable modules you buy from an external provider, or build/customise for your own needs. One common way organisations have approached this is to use carefully engineered prompts and foundational models to orchestrate a workflow and prevent a result reaching the end user till it passes all checks.
3.5. API Gateway
An API Gateway can serve multiple purposes helping manage costs, and various aspects of Responsible AI:
- Provide a unified interface to interact with foundational models, experiment with them
- Help develop a fine-grained view into costs and usage by team/use case/cost centre — including rate-limiting, speed throttling, quota management
- Serve as a responsible AI layer, filtering out in-intended requests/data before they ever hit the models
- Enable audit trails and access control
4. Is this enough, or do we need more?
Of course not. There are a few other things that also need to be kept in mind, including but not limited to:
- Does the use case occupy a strategic place in your roadmap of use cases? This enables you to have leadership backing, and right investments to support the development and maintenance.
- A clear evaluation criterion to measure the performance of the application, against dimensions of accuracy, cost, latency and responsible AI
- Improve business processes to keep knowledge up to date, maintain version control etc.
- Architect the RAG system so that it only accesses documents based on the end user permission levels, to prevent unauthorised access.
- Use design thinking to integrate the application into the workflow of the end user e.g. if you are building a bot to answer technical questions over Confluence as the knowledge base, should you build a separate UI, or integrate this with Teams/Slack/other applications users already use?
5. Conclusion
RAGs are a prominent use case archetype, and one of the first few ones that organisations try to implement. Scaling RAG from POC to production comes with its challenges, but with careful planning and execution, many of these can be overcome. Some of these can be solved by tactical investment in the architecture and technology, some require better strategic direction and tactful planning. As LLM inference costs continue to drop, either owing to reduced inference costs or heavier adoption of open-source models, cost barriers may not be a concern for many new use cases.
Scaling RAG from POC to Production was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Scaling RAG from POC to Production
Go Here to Read this Fast! Scaling RAG from POC to Production