LLM based AI Chatbot
Introduction
Conversational AI chatbots have become very popular in many domains, ranging from customer service and virtual assistants to entertainment and education. However, achieving human-like conversational capabilities remains a significant challenge. But now using Large language models we can fine-tune them on domain-specific datasets, here our chatbot will be mimics the personality of the users on whose chat history it is trained.
I have used a smaller model like GPT2 for aloowing it work on older GPUs but if you want you can easily modify the code with larger models like llama2,falcon7b, Mstral7b if you have compute power. There are another techniques like PEFT ,Lora and Qlora which can enable this large models to be trained on consumer level GPUs or even on free google colab with a basic T4 ,I will create separate tutorial on that.
Preparing the Data
The first step in building our custom chatbot involves collecting and preprocessing a dataset of conversational text. This dataset serves as the foundation for training the GPT-2 model to understand and generate human-like responses. Here I have used a WhatsApp group chat data.
import re
import pandas as pd
In this section, we import necessary libraries and modules required for data preprocessing. The re
module provides support for regular expressions, which we'll use to parse timestamps in conversation data. The pandas
library is a powerful tool for data manipulation and analysis, which we'll utilize to structure the conversation data.
# Define the path to the conversation data
conversation = "chat\_chat.txt"
date_list = []
time_list = []
sender_list = []
message_list = []
Here, we define the path to the conversation data file, which contains raw textual data of chat conversations. We also initialize empty lists to store parsed information such as dates, times, senders, and messages.
# Parse the conversation data, extracting relevant information
with open(conversation, encoding="utf-8") as fp:
while True:
line = fp.readline()
# Define the pattern for detecting timestamps
pattern = '([0-9]{1,2}/[0-9]{1,2}/[0-9]{1,2}, [0-9]{1,2}:[0-9]{1,2}:[0-9]{1,2})'
result = re.findall(pattern, line)
if result:
# Split the line to extract date, time, sender, and message
splitline = line.split('] ')
date, time = splitline[0].split(',')
date = date[1:]
date_list.append(date)
time = time[1:]
time_list.append(time)
# Identify sender and message components
if re.findall(r"changed the subject|was added|added|changed this group's icon|left|changed their phone number|deleted this group's icon", splitline[1]):
sender = None
sender_list.append(sender)
message = None
message_list.append(message)
else:
sender, message = splitline[1].split(': ', 1)
sender_list.append(sender)
message_list.append(message)
if not line:
break
In this block of code, we open the conversation data file and iterate through each line. We use a regular expression pattern to detect timestamps in each line. If a timestamp is found, we split the line to extract date, time, sender, and message components. We also filter out non-message lines such as group events or administrative actions. Parsed information is stored in respective lists.
# Organize the parsed data into a structured DataFrame
df_short = pd.DataFrame(list(zip(sender_list, message_list)), columns=['Sender', 'Message'])
df_short = df_short.dropna(axis=0, how='all')
# Format the conversation data for further processing
chat_data = [f"[NAME] {name} [MESSAGE] {message}" for name, message in
zip(df_short["Sender"].str.strip(), df_short["Message"].str.strip())]
mydf = pd.DataFrame(chat_data)
mydf.to_csv('chat\chat_data_tok.txt', sep='\t', index=False)
After parsing the data, we organize it into a structured DataFrame using Pandas. We then format the conversation data for further processing, combining sender and message components into a standardized format [NAME] sender [MESSAGE] message
. Finally, we save the formatted data to a CSV file for later use.
Fine-tuning the GPT-2 Model
Once we have our preprocessed dataset, the next step is to fine-tune the GPT-2 model on this data. Fine-tuning involves training the model on our custom dataset to adapt it to our specific domain or task. Here’s how we fine-tune the GPT-2 model using the Hugging Face Transformers library:
import torch
from torch.utils.data import random_split
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments, TextDataset, \
DataCollatorForLanguageModeling
In this section, we import necessary libraries and modules required for fine-tuning the GPT-2 model. These include the PyTorch library for tensor computations, the Transformers library from Hugging Face for easy integration of pre-trained models, and specific modules for handling text datasets and language modeling tasks.
# Initialize the pre-trained GPT-2 model
model = GPT2LMHeadModel.from_pretrained('gpt2')
Here, we initialize the GPT-2 model using the GPT2LMHeadModel.from_pretrained()
method, loading the pre-trained weights of the GPT-2 model provided by the Hugging Face model hub. This model has been pre-trained on a vast corpus of text data and will serve as the foundation for our chatbot.
# Tokenize the training data
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
We initialize the GPT-2 tokenizer using GPT2Tokenizer.from_pretrained()
, which ensures that the tokenizer aligns with the model architecture. Additionally, we add a special padding token [PAD]
to the tokenizer. This token is used to pad sequences to a consistent length during training, ensuring uniform input dimensions.
# Prepare the training dataset
dataset = TextDataset(
tokenizer=tokenizer,
file_path="chat\clean_chat_short.csv",
block_size=128
)
Here, we prepare the training dataset for fine-tuning the GPT-2 model. We use the TextDataset
class provided by the Transformers library, passing in the tokenizer and the file path to our preprocessed conversation data. The block_size
parameter determines the maximum sequence length for each input sample.
# Split the dataset into training and evaluation sets
train_size = int(0.9 * len(dataset))
eval_size = len(dataset) - train_size
train_dataset, eval_dataset = random_split(dataset, [train_size, eval_size])
We split the dataset into training and evaluation sets using the random_split()
function from PyTorch. In this case, 90% of the dataset is allocated for training (train_size
), and the remaining 10% is allocated for evaluation (eval_size
).
# Define the data collator for language modeling
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=False
)p
The DataCollatorForLanguageModeling
class is used to collate batches of input data for language modeling tasks. Here, we specify mlm=False
since we are not performing masked language modeling, but rather language modeling where the model predicts the next token in a sequence.
# Configure the training arguments
training_args = TrainingArguments(
output_dir='.\llmchat\results',
num_train_epochs=3,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
warmup_steps=500,
weight_decay=0.01,
logging_dir='.\llmchat\logs',
logging_steps=1000,
save_steps=5000,
evaluation_strategy='steps',
eval_steps=5000,
load_best_model_at_end=True,
use_mps_device=torch.backends.mps.is_available(),
)
Here, we configure the training arguments for the fine-tuning process. These arguments include parameters such as the number of training epochs, batch size, logging settings, evaluation strategy, and device utilization (in this case, we check if a multi-processing shared memory device is available).
# Initialize the Trainer with training parameters
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
data_collator=data_collator,
)
We initialize the Trainer object, which orchestrates the training process. The Trainer is provided with the initialized GPT-2 model, training arguments, training dataset, evaluation dataset, and data collator for language modeling.
# Commence the training process
trainer.train(resume_from_checkpoint=True)
Finally, we commence the training process by invoking the train()
method of the Trainer object. If resume_from_checkpoint
is set to True
, the training process will resume from the latest checkpoint, if available.
# Save the trained model and tokenizer for future use
model.save_pretrained("gpt2_fine_tune")
tokenizer.save_pretrained("gpt2_fine_tune")
After training, we save the fine-tuned GPT-2 model and tokenizer to disk for future use. This allows us to deploy the trained model for generating contextually aware responses in real-world chatbot applications.
Conclusion
This is the complete walkthrough of the main code for complete code like gui implementation go to my github page https://github.com/mfz16/Gen-AI-LLM-