Natural Language Processing (NLP) has an abundance of intuitively explained tutorials with code, such as Andrej Kaparthy’s Neural Networks: Zero to Hero, the viral The Illustrated Transformer and its successor The Annotated Transformer, Umar Jamil’s YouTube series dissecting SOTA models and the companion repo, among others.
When it comes to chatbots and question-answering systems, use cases are business-dependent, and productionizing agentic/LLM/ML workflows can be quite complex. Recently, I used Databricks and Mosaic AI foundation models to run a RAG (Retrieval-Augmented Generation) application. In this blog, I will explain the system design from a high-level perspective on how to use LLMs with RAG on a custom dataset, leveraging Azure Databricks. I learned building this RAG application from Databricks demos. I am still learning and would appreciate any feedback on the post!
Creating a chatbot that can answer questions such as summarizing papers, providing conclusions, or generating short abstracts will make researchers' lives easier.
Databricks is a Platform-as-a-Service (PaaS) integrated with cloud providers like AWS, Azure, and GCP. It offers a comprehensive platform for processing machine learning and data engineering workflows, making it easy to develop code. I am using the Azure Databricks service because it provides ready-to-use foundation models, a straightforward way to create vector indices, Unity Catalog for storing trained/finetuned models and vector databases, and tools for model inference. It's an all-in-one platform that is both versatile and easy to use.
The first step in building the chatbot is to collect a dataset of CVPR papers. You can store PDF files in Azure Databricks' Unity Catalog, which provides secure and scalable data storage. CVPR papers can be sourced from the official site or GitHub repo. The Unity Catalog allows for the organization and management of the dataset, ensuring accessibility during the model's training and inference stages.
Once the dataset is prepared, the PDF files are chunked into smaller, manageable text blocks. This is necessary because most large language models (LLMs) perform better with smaller, context-specific inputs rather than entire documents. Each chunk is processed into manageable sections like abstracts, conclusions, or individual paragraphs.
To manage this effectively, we will create two tables:
Table 1: Chunks Table This table stores the individual chunks of text, each associated with a unique Chunk_Id and a reference to the source document.
chunk_id | file_name | text | chunk_idx |
---|---|---|---|
001 | paper_01.pdf | “this paper explores..” | 1 |
002 | paper_01.pdf | “this architecture will” | 2 |
003 | paper_02.pdf | “in this study we..” | 1 |
Each text field contains a portion of the document text. This table allows the RAG model to query specific chunks during the question-answering process.
Table 2: Processed Files Table This table keeps track of files that have already been processed and chunked. This ensures that when new files are added to the dataset, only those new files will be processed, avoiding duplication of effort.