Building a RAG based Question-Answering System

Introduction: A RAG based question-answering system using the LangChain library and the HuggingFace transformers library.

3 min readApr 18, 2024

The Retrieval-Augmented Generation (RAG) pipeline, introduced by researchers at Google AI, provides an elegant solution to this problem by combining the strengths of two powerful AI models: a retriever and a generator.

The RAG pipeline consists of two main components:

Retriever: The retriever is a neural information retrieval model that is trained to retrieve relevant passages from a large corpus given a question. Popular retrievers include Dense Passage Retriever (DPR) and Contriever.
Generator: The generator is a large language model, such as BART or T5, that is fine-tuned to generate answers to questions based on the retrieved passages.

The RAG pipeline works as follows:

The user provides a question to the system.
The retriever model identifies the top-k most relevant passages from the corpus based on the question.
The retrieved passages are fed as additional context to the generator model, along with the original question.
The generator model generates an answer by attending to both the question and the retrieved passages.

In this tutorial, we’ll walk through how to build a RAG based question-answering system using the LangChain library and the HuggingFace transformers library. This system will allow us to answer questions based on a corpus of documents, leveraging the power of large language models like the “google/gemma-1.1–7b-it” model.

Setup

First, let’s install the required packages:

# Install PyTorch
conda install pytorch torchvision torchaudio
# Install transformers library 
pip install transformers
# Install other required packages
pip install langchain sentence_transformers huggingface-hub

Next, we’ll check if a CUDA-enabled GPU is available and set the device accordingly:

import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using device:", device)

Loading and Splitting Documents

Add a directory called documents containing text files with the documents we want to use for question-answering. We'll load these documents using LangChain's DirectoryLoader and TextLoader:

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
folder_path = r'vector_stores\documents'
text_loader_kwargs = {'autodetect_encoding': True}
mixed_loader = DirectoryLoader(
    path=folder_path,
    glob=r'.\*.txt',
    loader_cls=TextLoader,
    loader_kwargs=text_loader_kwargs
)
doc = mixed_loader.load()

Since large language models work better with shorter text sequences, we’ll split the loaded documents into smaller chunks using the RecursiveCharacterTextSplitter from LangChain:

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(doc)

Creating Vector Embeddings and Vector Store

To enable efficient similarity search over our document chunks, we’ll create embedding vector representations of the chunks using the HuggingFaceEmbeddings from LangChain and store them in a FAISS vector store here k=10 as we need top 10 similar documents chunks:

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
encoder = HuggingFaceEmbeddings()
db = FAISS.from_documents(documents=docs, embedding=encoder)
retriever = db.as_retriever(search_kwargs={"k": 10})

Loading the Language Model

Next, we’ll load the “google/gemma-1.1–7b-it” language model from HuggingFace. This is a large Italian language model that we’ll use to generate answers based on the relevant document chunks:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model = AutoModelForCausalLM.from_pretrained("google/gemma-1.1-7b-it", quantization_config=quantization_config)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-1.1-7b-it", quantization_config=quantization_config)

To reduce the memory footprint of the model, we’re using the BitsAndBytesConfig to quantize the model to a lower precision. You can skip this if you have a powerful GPU.

Setting up the Question-Answering Pipeline

Finally, we’ll create a HuggingFacePipeline from LangChain, which wraps the language model and tokenizer for convenient use, and combine it with the retriever (FAISS vector store) into a RetrievalQA chain from LangChain:

from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
pipeline = HuggingFacePipeline(model=model, tokenizer=tokenizer, model_kwargs={"temperature": 0.7})
qa = RetrievalQA.from_chain_type(llm=pipeline, retriever=retriever, chain_type="stuff")

The resulting qa object can now be used to answer questions based on the documents loaded from the vector_stores\documents directory. When a question is asked, the system will use the FAISS vector store to retrieve the most relevant document chunks, and then use the language model to generate an answer based on those relevant chunks.

query = "What is the capital of Italy?"
answer = qa.run(query)
print(answer)

This is just a simple example of how to build a RAG question-answering system . You can further customize and extend this system to suit your specific needs, such as using different language models, fine-tuning the models on your domain-specific data, with difeerent Vector Stores or adding any types of documents as per your need.

check my github for more details. github.com/mfz16