Hillary Kipkemoi
This project focused on implementing a RAG system for Contract Q&A that enables chatting with a contract and asking questions about the contract.. It involved building, evaluating and improving the RAG Q&A system through small and iterative refinements. The system was supposed to be a part of a powerful contract assistant by LizzyAI with the goal of developing a fully autonomous contract bot.
Approach:
Data understanding: I analyzed the contract documents to identify the patterns within the contract data, ensuring a clear understanding of the information structure and identifying key insights.
Explore text Chunking strategies: Implemented RecursiveCharacterSplitter and Semantic Chunking with OpenAI embeddings to break down large legal documents into manageable, meaningful chunks, optimizing the retrieval process for accuracy and efficiency.
Experimented with different embedding models. I explored a variety of embedding models both from huggingface, Cohere and OpenAI embeddings models to evaluate the best embedding model for semantic chunking and retrieval of contract documents.
Query Processing and Expansion: Enhanced the retrieval system by using advanced query processing techniques, including query expansion and reciprocal rank fusion, to improve the relevance and accuracy of the retrieved passages.
Backend Integration with Autogen Agents: Utilized AutoGen for facilitating dynamic interactions between user and assistant agents in the RAG pipeline, ensuring seamless execution of the retrieval and generation tasks.
Frontend Development: Built a user-friendly interface with React, connected to the backend via FastAPI and websockets, allowing for real-time interactions and prompt handling of user queries.
The primary evaluation metrics used were from the RAGAS (Retrieval-Augmented Generation Assessment Suite) framework, providing a comprehensive assessment of the system's performance with the key metrics including Faithfulness, Answer Correctness, context precision and recall. The metrics included human-in-the-loop evaluations ensuring a robust evaluation of system's performance.
This project developed an automated prompt tuning system for enterprise RAG systems. Leveraging LangChain and OpenAI's language models, it streamlined the generation, evaluation, and ranking of prompts to enhance the accuracy and relevance of RAG-based applications. It generates diverse test cases, creates candidate prompts, evaluates their performance, and provides a user-friendly interface for selecting the most effective prompts.
Methods and Tools:
Prompt Generation: The system utilized LangChain and OpenAI's language models to generate a diverse set of candidate prompts based on task descriptions and relevant context.
Test Case Generation: Test cases consisting of task descriptions and expected outputs were created to simulate real-world scenarios and provide a benchmark for prompt performance evaluation.
Evaluation: Prompt effectiveness was evaluated through a combination of methods:
RAGAS (Retrieval Augmented Generation Assessment Suite): The RAGAS library was employed to provide automated metrics for evaluating the quality and relevance of answers generated by the RAG system using different prompts. This included assessing answer relevance, coherence, and factuality.
Human-in-the-Loop Evaluation: Human evaluators assessed the generated answers to provide subjective feedback on their usefulness and accuracy, offering valuable insights beyond automated metrics.
Elo Rating System: The ELO rating system was adapted to rank prompts based on their performance in both automated and human evaluations. Prompts consistently generating higher quality answers received higher ELO ratings.
Monte Carlo Simulations: Monte Carlo simulations were used to estimate the performance of different prompts by simulating multiple head-to-head battles. This helped identify the most promising prompts for further refinement.
The primary evaluation metric was the ELO rating, which integrated both RAGAS metrics and human judgment scores. This comprehensive approach ensured a balanced assessment of prompt performance, considering both objective metrics and subjective feedback.
A Redash chatbot streamlines data exploration for non-technical members, transforming complex SQL queries into natural language conversations. This empowers non-technical stakeholders to directly access and analyze key business metrics within Redash and make informed decisions without needing SQL expertise.
Approach:
Data Analysis: We Analyzed data structure and defined categories for efficient storage and retrieval. Performed Exploratory Data Analysis (EDA) to uncover trends.
Reshaped long data formats for better analysis, handled duplicates, and saved preprocessed data in CSV format for database loading.
Connecting data source to Redash: Established a connection between the project and PostgreSQL using SQLAlchemy. Designed and implemented a database schema to store the preprocessed data. Loaded the prepared data for querying and analysis.
Developed a chat interface (frontend) with React for user interaction. Built a backend API using Python's Quart framework to connect the chatbot to the LLM. Implemented prompt engineering to fine-tune the LLM.
We leveraged OpenAI's GPT to generate SQL queries based on user questions. Additionally, we utilized LangChain and LLamaIndex to enhance the chatbot's ability to process complex queries and retrieve relevant data.
Automation of visualizations. We implemented functionalities to automatically generate visualizations based on user queries and retrieved data. Connected the chatbot to Redash's visualization capabilities to present insights in a clear and actionable format.