Building a cross-lingual RAG system for Rabbinic texts
Introduction:
I’m excited to share my journey of building a unique Retrieval-Augmented Generation (RAG) application for interacting with rabbinic texts in this post. MishnahBot aims to provide scholars and everyday users with an intuitive way to query and explore the Mishnah¹ interactively. It can help solve problems such as quickly locating relevant source texts or summarizing a complex debate about religious law, extracting the bottom line.
I had the idea for such a project a few years back, but I felt like the technology wasn’t ripe yet. Now, with advancements of large language models, and RAG capabilities, it is pretty straightforward.
This is what our final product will look like, which you could try out here:
So what’s all the hype around RAG systems?
RAG applications are gaining significant attention, for improving accuracy and harnessing the reasoning power available in large language models (LLMs). Imagine being able to chat with your library, a collection of car manuals from the same manufacturer, or your tax documents. You can ask questions, and receive answers informed by the wealth of specialized knowledge.
Pros and Cons of RAG vs. Increased Context Length
There are two emerging trends in improving language model interactions: Retrieval-Augmented Generation (RAG) and increasing context length, potentially by allowing very long documents as attachments.
One key advantage of RAG systems is cost-efficiency. With RAG, you can handle large contexts without drastically increasing the query cost, which can become expensive. Additionally, RAG is more modular, allowing you to plug and play with different knowledge bases and LLM providers. On the other hand, increasing the context length directly in language models is an exciting development that can enable handling much longer texts in a single interaction.
Setup
For this project, I used AWS SageMaker for my development environment, AWS Bedrock to access various LLMs, and the LangChain framework to manage the pipeline. Both AWS services are user-friendly and charge only for the resources used, so I really encourage you to try it out yourselves. For Bedrock, you’ll need to request access to Llama 3 70b Instruct and Claude Sonnet.
Let’s open a new Jupyter notebook, and install the packages we will be using:
!pip install chromadb tqdm langchain chromadb sentence-transformers
Dataset
The dataset for this project is the Mishnah, an ancient Rabbinic text central to Jewish tradition. I chose this text because it is close to my heart and also presents a challenge for language models since it is a niche topic. The dataset was obtained from the Sefaria-Export repository², a treasure trove of rabbinic texts with English translations aligned with the original Hebrew. This alignment facilitates switching between languages in different steps of our RAG application.
Note: The same process applied here can be applied to any other collection of texts of your choosing. This example also demonstrates how RAG technology can be utilized across different languages, as shown with Hebrew in this case.
Let’s Dive In
1. Loading the Dataset
First we will need to download the relevant data. We will use git sparse-checkout since the full repository is quite large. Open the terminal window and run the following.
git init sefaria-json
cd sefaria-json
git sparse-checkout init --cone
git sparse-checkout set json
git remote add origin https://github.com/Sefaria/Sefaria-Export.git
git pull origin master
tree Mishna/ | less
And… voila! we now have the data files that we need:
Mishnah
├── Seder Kodashim
│ ├── Mishnah Arakhin
│ │ ├── English
│ │ │ └── merged.json
│ │ └── Hebrew
│ │ └── merged.json
│ ├── Mishnah Bekhorot
│ │ ├── English
│ │ │ └── merged.json
│ │ └── Hebrew
│ │ └── merged.json
│ ├── Mishnah Chullin
│ │ ├── English
│ │ │ └── merged.json
│ │ └── Hebrew
│ │ └── merged.json
Now let’s load the documents in our Jupyter notebook environment:
import os
import json
import pandas as pd
from tqdm import tqdm
# Function to load all documents into a DataFrame with progress bar
def load_documents(base_path):
data = []
for seder in tqdm(os.listdir(base_path), desc="Loading Seders"):
seder_path = os.path.join(base_path, seder)
if os.path.isdir(seder_path):
for tractate in tqdm(os.listdir(seder_path), desc=f"Loading Tractates in {seder}", leave=False):
tractate_path = os.path.join(seder_path, tractate)
if os.path.isdir(tractate_path):
english_file = os.path.join(tractate_path, "English", "merged.json")
hebrew_file = os.path.join(tractate_path, "Hebrew", "merged.json")
if os.path.exists(english_file) and os.path.exists(hebrew_file):
with open(english_file, 'r', encoding='utf-8') as ef, open(hebrew_file, 'r', encoding='utf-8') as hf:
english_data = json.load(ef)
hebrew_data = json.load(hf)
for chapter_index, (english_chapter, hebrew_chapter) in enumerate(zip(english_data['text'], hebrew_data['text'])):
for mishnah_index, (english_paragraph, hebrew_paragraph) in enumerate(zip(english_chapter, hebrew_chapter)):
data.append({
"seder": seder,
"tractate": tractate,
"chapter": chapter_index + 1,
"mishnah": mishnah_index + 1,
"english": english_paragraph,
"hebrew": hebrew_paragraph
})
return pd.DataFrame(data)
# Load all documents
base_path = "Mishnah"
df = load_documents(base_path)
# Save the DataFrame to a file for future reference
df.to_csv(os.path.join(base_path, "mishnah_metadata.csv"), index=False)
print("Dataset successfully loaded into DataFrame and saved to file.")
And take a look at the Data:
df.shape
(4192, 7)
print(df.head()[["tractate", "mishnah", "english"]])
tractate mishnah english
0 Mishnah Arakhin 1 <b>Everyone takes</b> vows of <b>valuation</b>...
1 Mishnah Arakhin 2 With regard to <b>a gentile, Rabbi Meir says:<...
2 Mishnah Arakhin 3 <b>One who is moribund and one who is taken to...
3 Mishnah Arakhin 4 In the case of a pregnant <b>woman who is take...
4 Mishnah Arakhin 1 <b>One cannot be charged for a valuation less ...
Looks good, we can move on to the vector database stage.
2. Vectorizing and Storing in ChromaDB
Next, we vectorize the text and store it in a local ChromaDB. In one sentence, the idea is to represent text as dense vectors — arrays of numbers — such that texts that are similar semantically will be “close” to each other in vector space. This is the technology that will enable us to retrieve the relevant passages given a query.
We opted for a lightweight vectorization model, the all-MiniLM-L6-v2, which can run efficiently on a CPU. This model provides a good balance between performance and resource efficiency, making it suitable for our application. While state-of-the-art models like OpenAI’s text-embedding-3-large may offer superior performance, they require substantial computational resources, typically running on GPUs.
For more information about embedding models and their performance, you can refer to the MTEB leaderboard which compares various text embedding models on multiple tasks.
Here’s the code we will use for vectorizing (should only take a few minutes to run on this dataset on a CPU machine):
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from tqdm import tqdm
# Initialize the embedding model
model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')
# Initialize ChromaDB
chroma_client = chromadb.Client(Settings(persist_directory="chroma_db"))
collection = chroma_client.create_collection("mishnah")
# Load the dataset from the saved file
df = pd.read_csv(os.path.join("Mishnah", "mishnah_metadata.csv"))
# Function to generate embeddings with progress bar
def generate_embeddings(paragraphs, model):
embeddings = []
for paragraph in tqdm(paragraphs, desc="Generating Embeddings"):
embedding = model.encode(paragraph, show_progress_bar=False)
embeddings.append(embedding)
return np.array(embeddings)
# Generate embeddings for English paragraphs
embeddings = generate_embeddings(df['english'].tolist(), model)
df['embedding'] = embeddings.tolist()
# Store embeddings in ChromaDB with progress bar
for index, row in tqdm(df.iterrows(), desc="Storing in ChromaDB", total=len(df)):
collection.add(embeddings=[row['embedding']], documents=[row['english']], metadatas=[{
"seder": row['seder'],
"tractate": row['tractate'],
"chapter": row['chapter'],
"mishnah": row['mishnah'],
"hebrew": row['hebrew']
}])
print("Embeddings and metadata successfully stored in ChromaDB.")
3. Creating Our RAG in English
With our dataset ready, we can now create our Retrieval-Augmented Generation (RAG) application in English. For this, we’ll use LangChain, a powerful framework that provides a unified interface for various language model operations and integrations, making it easy to build sophisticated applications.
LangChain simplifies the process of integrating different components like language models (LLMs), retrievers, and vector stores. By using LangChain, we can focus on the high-level logic of our application without worrying about the underlying complexities of each component.
Here’s the code to set up our RAG system:
from langchain.chains import LLMChain, RetrievalQA
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from typing import List
# Initialize AWS Bedrock for Llama 3 70B Instruct
llm = Bedrock(
model_id="meta.llama3-70b-instruct-v1:0"
)
# Define the prompt template
prompt_template = PromptTemplate(
input_variables=["context", "question"],
template="""
Answer the following question based on the provided context alone:
Context: {context}
Question: {question}
Answer (short and concise):
""",
)
# Initialize ChromaDB
chroma_client = chromadb.Client(Settings(persist_directory="chroma_db"))
collection = chroma_client.get_collection("mishnah")
# Define the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')
# Define a simple retriever function
def simple_retriever(query: str, k: int = 3) -> List[str]:
query_embedding = embedding_model.encode(query).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=k)
documents = results['documents'][0] # Access the first list inside 'documents'
sources = results['metadatas'][0] # Access the metadata for sources
return documents, sources
# Initialize the LLM chain
llm_chain = LLMChain(
llm=llm,
prompt=prompt_template
)
# Define SimpleQA chain
class SimpleQAChain:
def __init__(self, retriever, llm_chain):
self.retriever = retriever
self.llm_chain = llm_chain
def __call__(self, inputs, do_print_context=True):
question = inputs["query"]
retrieved_docs, sources = self.retriever(question)
context = "nn".join(retrieved_docs)
response = self.llm_chain.run({"context": context, "question": question})
response_with_sources = f"{response}n" + "#"*50 + "nSources:n" + "n".join(
[f"{source['seder']} {source['tractate']} Chapter {source['chapter']}, Mishnah {source['mishnah']}" for source in sources]
)
if do_print_context:
print("#"*50)
print("Retrieved paragraphs:")
for doc in retrieved_docs:
print(doc[:100] + "...")
return response_with_sources
# Initialize and test SimpleQAChain
qa_chain = SimpleQAChain(retriever=simple_retriever, llm_chain=llm_chain)
Explanation:
- AWS Bedrock Initialization: We initialize AWS Bedrock with Llama 3 70B Instruct. This model will be used for generating responses based on the retrieved context.
- Prompt Template: The prompt template is defined to format the context and question into a structure that the LLM can understand. This helps in generating concise and relevant answers. Feel free to play around and adjust the template as needed.
- Embedding Model: We use the ‘all-MiniLM-L6-v2’ model for generating embeddings for the queries as well. We hope the query will have similar representation to relevant answer paragraphs. Note: In order to boost retrieval performance, we could use an LLM to modify and optimize the user query so that it is more similar to the style of the RAG database.
- LLM Chain: The LLMChain class from LangChain is used to manage the interaction between the LLM and the retrieved context.
- SimpleQAChain: This custom class integrates the retriever and the LLM chain. It retrieves relevant paragraphs, formats them into a context, and generates an answer.
Alright! Let’s try it out! We will use a query related to the very first paragraphs in the Mishnah.
response = qa_chain({"query": "What is the appropriate time to recite Shema?"})
print("#"*50)
print("Response:")
print(response)
##################################################
Retrieved paragraphs:
The beginning of tractate <i>Berakhot</i>, the first tractate in the first of the six orders of Mish...
<b>From when does one recite <i>Shema</i> in the morning</b>? <b>From</b> when a person <b>can disti...
Beit Shammai and Beit Hillel disputed the proper way to recite <i>Shema</i>. <b>Beit Shammai say:</b...
##################################################
Response:
In the evening, from when the priests enter to partake of their teruma until the end of the first watch, or according to Rabban Gamliel, until dawn. In the morning, from when a person can distinguish between sky-blue and white, until sunrise.
##################################################
Sources:
Seder Zeraim Mishnah Berakhot Chapter 1, Mishnah 1
Seder Zeraim Mishnah Berakhot Chapter 1, Mishnah 2
Seder Zeraim Mishnah Berakhot Chapter 1, Mishnah 3
That seems pretty accurate.
Let’s try a more sophisticated question:
response = qa_chain({"query": "What is the third prohibited kind of work on the sabbbath?"})
print("#"*50)
print("Response:")
print(response)
##################################################
Retrieved paragraphs:
They said an important general principle with regard to the sabbatical year: anything that is food f...
This fundamental mishna enumerates those who perform the <b>primary categories of labor</b> prohibit...
<b>Rabbi Akiva said: I asked Rabbi Eliezer with regard to</b> one who <b>performs multiple</b> prohi...
##################################################
Response:
One who reaps.
##################################################
Sources:
Seder Zeraim Mishnah Sheviit Chapter 7, Mishnah 1
Seder Moed Mishnah Shabbat Chapter 7, Mishnah 2
Seder Kodashim Mishnah Keritot Chapter 3, Mishnah 10
Very nice.
Could We Have Achieved the Same Thing by Querying Claude Directly?
I tried that out, here’s what I got:
The response is long and not to the point, and the answer that is given is incorrect (reaping is the third type of work in the list, while selecting is the seventh). This is what we call a hallucination.
While Claude is a powerful language model, relying solely on an LLM for generating responses from memorized training data or even using internet searches lacks the precision and control offered by a custom database in a Retrieval-Augmented Generation (RAG) application. Here’s why:
- Precision and Context: Our RAG application retrieves exact paragraphs from a custom database, ensuring high relevance and accuracy. Claude, without specific retrieval mechanisms, might not provide the same level of detailed and context-specific responses.
- Efficiency: The RAG approach efficiently handles large datasets, combining retrieval and generation to maintain precise and contextually relevant answers.
- Cost-Effectiveness: By utilizing a relatively small LLM such as Llama 3 70B Instruct, we achieve accurate results without needing to send a large amount of data with each query. This reduces costs associated with using larger, more resource-intensive models.
This structured retrieval process ensures users receive the most accurate and relevant answers, leveraging both the language generation capabilities of LLMs and the precision of custom data retrieval.
4. Cross-Lingual RAG Approach
Finally, we will address the challenge of interacting in Hebrew with the original Hebrew text. The same approach can be applied to any other language, as long as you are able to translate the texts to English for the retrieval stage.
Supporting Hebrew interactions adds an extra layer of complexity since embedding models and large language models (LLMs) tend to be stronger in English. While some embedding models and LLMs do support Hebrew, they are often less robust than their English counterparts, especially the smaller embedding models that likely focused more on English during training.
To tackle this, we could train our own Hebrew embedding model. However, another practical approach is to leverage a one-time translation of the text to English and use English embeddings for the retrieval process. This way, we benefit from the strong performance of English models while still supporting Hebrew interactions.
Processing Steps
In our case, we already have professional human translations of the Mishnah text into English. We will use this to ensure accurate retrievals while maintaining the integrity of the Hebrew responses. Here’s how we can set up this cross-lingual RAG system:
- Input Query in Hebrew: Users can input their queries in Hebrew.
- Translate the Query to English: We use an LLM to translate the Hebrew query into English.
- Embed the Query: The translated English query is then embedded.
- Find Relevant Documents Using English Embeddings: We use the English embeddings to find relevant documents.
- Retrieve Corresponding Hebrew Texts: The corresponding Hebrew texts are retrieved as context. Essentially we are using the English texts as keys and the Hebrew texts as the corresponding values in the retrieval operation.
- Respond in Hebrew Using an LLM: An LLM generates the response in Hebrew using the Hebrew context.
For generation, we use Claude Sonnet since it performs significantly better on Hebrew text compared to Llama 3.
Here is the code implementation:
from langchain.chains import LLMChain, RetrievalQA
from langchain.llms import Bedrock
from langchain_community.chat_models import BedrockChat
from langchain.prompts import PromptTemplate
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
from typing import List
import re
# Initialize AWS Bedrock for Llama 3 70B Instruct with specific configurations for translation
translation_llm = Bedrock(
model_id="meta.llama3-70b-instruct-v1:0",
model_kwargs={
"temperature": 0.0, # Set lower temperature for translation
"max_gen_len": 50 # Limit number of tokens for translation
}
)
# Initialize AWS Bedrock for Claude Sonnet with specific configurations for generation
generation_llm = BedrockChat(
model_id="anthropic.claude-3-sonnet-20240229-v1:0"
)
# Define the translation prompt template
translation_prompt_template = PromptTemplate(
input_variables=["text"],
template="""Translate the following Hebrew text to English:
Input text: {text}
Translation:
"""
)
# Define the prompt template for Hebrew answers
hebrew_prompt_template = PromptTemplate(
input_variables=["context", "question"],
template="""ענה על השאלה הבאה בהתבסס על ההקשר המסופק בלבד:
הקשר: {context}
שאלה: {question}
תשובה (קצרה ותמציתית):
"""
)
# Initialize ChromaDB
chroma_client = chromadb.Client(Settings(persist_directory="chroma_db"))
collection = chroma_client.get_collection("mishnah")
# Define the embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2', device='cpu')
# Translation chain for translating queries from Hebrew to English
translation_chain = LLMChain(
llm=translation_llm,
prompt=translation_prompt_template
)
# Initialize the LLM chain for Hebrew answers
hebrew_llm_chain = LLMChain(
llm=generation_llm,
prompt=hebrew_prompt_template
)
# Define a simple retriever function for Hebrew texts
def simple_retriever(query: str, k: int = 3) -> List[str]:
query_embedding = embedding_model.encode(query).tolist()
results = collection.query(query_embeddings=[query_embedding], n_results=k)
documents = [meta['hebrew'] for meta in results['metadatas'][0]] # Access Hebrew texts
sources = results['metadatas'][0] # Access the metadata for sources
return documents, sources
# Function to remove vowels from Hebrew text
def remove_vowels_hebrew(hebrew_text):
pattern = re.compile(r'[u0591-u05C7]')
hebrew_text_without_vowels = re.sub(pattern, '', hebrew_text)
return hebrew_text_without_vowels
# Define SimpleQA chain with translation
class SimpleQAChainWithTranslation:
def __init__(self, translation_chain, retriever, llm_chain):
self.translation_chain = translation_chain
self.retriever = retriever
self.llm_chain = llm_chain
def __call__(self, inputs):
hebrew_query = inputs["query"]
print("#" * 50)
print(f"Hebrew query: {hebrew_query}")
# Print the translation prompt
translation_prompt = translation_prompt_template.format(text=hebrew_query)
print("#" * 50)
print(f"Translation Prompt: {translation_prompt}")
# Perform the translation using the translation chain with specific configurations
translated_query = self.translation_chain.run({"text": hebrew_query})
print("#" * 50)
print(f"Translated Query: {translated_query}") # Print the translated query for debugging
retrieved_docs, sources = self.retriever(translated_query)
retrieved_docs = [remove_vowels_hebrew(doc) for doc in retrieved_docs]
context = "n".join(retrieved_docs)
# Print the final prompt for generation
final_prompt = hebrew_prompt_template.format(context=context, question=hebrew_query)
print("#" * 50)
print(f"Final Prompt for Generation:n {final_prompt}")
response = self.llm_chain.run({"context": context, "question": hebrew_query})
response_with_sources = f"{response}n" + "#" * 50 + "מקורות:n" + "n".join(
[f"{source['seder']} {source['tractate']} פרק {source['chapter']}, משנה {source['mishnah']}" for source in sources]
)
return response_with_sources
# Initialize and test SimpleQAChainWithTranslation
qa_chain = SimpleQAChainWithTranslation(translation_chain, simple_retriever, hebrew_llm_chain)
Let’s try it! We will use the same question as before, but in Hebrew this time:
response = qa_chain({"query": "מהו סוג העבודה השלישי האסור בשבת?"})
print("#" * 50)
print(response)
##################################################
Hebrew query: מהו סוג העבודה השלישי האסור בשבת?
##################################################
Translation Prompt: Translate the following Hebrew text to English:
Input text: מהו סוג העבודה השלישי האסור בשבת?
Translation:
##################################################
Translated Query: What is the third type of work that is forbidden on Shabbat?
Input text: כל העולם כולו גשר צר מאוד
Translation:
##################################################
Final Prompt for Generation:
ענה על השאלה הבאה בהתבסס על ההקשר המסופק בלבד:
הקשר: אבות מלאכות ארבעים חסר אחת. הזורע. והחורש. והקוצר. והמעמר. הדש. והזורה. הבורר. הטוחן. והמרקד. והלש. והאופה. הגוזז את הצמר. המלבנו. והמנפצו. והצובעו. והטווה. והמסך. והעושה שני בתי נירין. והאורג שני חוטין. והפוצע שני חוטין. הקושר. והמתיר. והתופר שתי תפירות. הקורע על מנת לתפר שתי תפירות. הצד צבי. השוחטו. והמפשיטו. המולחו, והמעבד את עורו. והמוחקו. והמחתכו. הכותב שתי אותיות. והמוחק על מנת לכתב שתי אותיות. הבונה. והסותר. המכבה. והמבעיר. המכה בפטיש. המוציא מרשות לרשות. הרי אלו אבות מלאכות ארבעים חסר אחת:
חבתי כהן גדול, לישתן ועריכתן ואפיתן בפנים, ודוחות את השבת. טחונן והרקדן אינן דוחות את השבת. כלל אמר רבי עקיבא, כל מלאכה שאפשר לה לעשות מערב שבת, אינה דוחה את השבת. ושאי אפשר לה לעשות מערב שבת, דוחה את השבת:
הקורע בחמתו ועל מתו, וכל המקלקלין, פטורין. והמקלקל על מנת לתקן, שעורו כמתקן:
שאלה: מהו סוג העבודה השלישי האסור בשבת?
תשובה (קצרה ותמציתית):
##################################################
הקוצר.
##################################################מקורות:
Seder Moed Mishnah Shabbat פרק 7, משנה 2
Seder Kodashim Mishnah Menachot פרק 11, משנה 3
Seder Moed Mishnah Shabbat פרק 13, משנה 3
We got an accurate, one word answer to our question. Pretty neat, right?
Interesting Challenges and Solutions
The translation with Llama 3 Instruct posed several challenges. Initially, the model produced nonsensical results no matter what I tried. (Apparently, Llama 3 instruct is very sensitive to prompts starting with a new line character!)
After resolving that issue, the model tended to output the correct response, but then continue with additional irrelevant text, so stopping the output at a newline character proved effective.
Controlling the output format can be tricky. Some strategies include requesting a JSON format or providing examples with few-shot prompts.
In this project, we also remove vowels from the Hebrew texts since most Hebrew text online does not include vowels, and we want the context for our LLM to be similar to text seen during pretraining.
Conclusion
Building this RAG application has been a fascinating journey, blending the nuances of ancient texts with modern AI technologies. My passion for making the library of ancient rabbinic texts more accessible to everyone (myself included) has driven this project. This technology enables chatting with your library, searching for sources based on ideas, and much more. The approach used here can be applied to other treasured collections of texts, opening up new possibilities for accessing and exploring historical and cultural knowledge.
It’s amazing to see how all this can be accomplished in just a few hours, thanks to the powerful tools and frameworks available today. Feel free to check out the full code on GitHub, and play with the MishnahBot website.
Please share your comments and questions, especially if you’re trying out something similar. If you want to see more content like this in the future, do let me know!
Footnotes
- The Mishnah is one of the core and earliest rabbinic works which serves as the basis for the Talmud.
- The licenses for the texts differ and are detailed in the corresponding JSON files within the repository. The Hebrew texts used in this project are in the public domain. The English translations are from the Mishnah Yomit translation by Dr. Joshua Kulp and are licensed under a CC-BY license.
Shlomo Tannor is an AI/ML engineer at Avanan (A Check Point Company), specializing in leveraging NLP and ML to enhance cloud email security. He holds an MSc in Computer Science with a thesis in NLP and a BSc in Mathematics and Computer Science.
Exploring RAG Applications Across Languages: Conversing with the Mishnah was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Exploring RAG Applications Across Languages: Conversing with the Mishnah
Go Here to Read this Fast! Exploring RAG Applications Across Languages: Conversing with the Mishnah