RAG

Retrieval-Augmented Generation (RAG)

RAG is simple. It simply uses an existing LLM model to answer questions, with a pre-retrieval of relevant document.

Fine tuning or training a LLM for a domain subject is too hard. Given a domain related question, you can search in your domain database first to retrieve relevant documents, and then present both the question and relevant documents into an LLM to answer/summarize. The relevant documents (or paragraphs) provide the context for LLM to answer a domain related question.

Essentially it is similar to Google + OpenAI. You google relevant information first, and ask OpenAI to summarize/answer based on the google results.

The quality of the search / retrieval is critical. If you can retrieve closely relevant information from your domain database or document base, then the answer would be good. If you can't retrieve relevant information, the result would be poor.

The solution in RAG is using embedding / vector database. It firstly convert documents (paragraphs / sections / sentences) into embeddings, using an pre-trained embedding model. The embeddings are then stored in a vector database for fast retrieval. The questions are converted using the same embedding model, and is matched with relevant document embeddings using e.g. cosine distance.

Select the top k relevant documents / document parts, and feed them with the original question into an LLM model. With proper prompt, the LLM will answer the question within the provided relevant context.

Although you can convert the whole document into embedding, but it may not provide the right granularity of relevant information, and it feeds too much information into LLM to digest, increasing cost and chance of misinterpretation. A better strategy is chunk the document into smaller units, e.g. paragrah, section, clause, etc. where it makes sense for the specific use case. For law document, it may make sense to go down to the clause or even sub clause granularity. Below are 5 chunking methods depending on if the document structure is known or not.

Fixed size chunking (with overlap) is straightforward.

Semantic chunking firstly segments the document into e.g. sentences and get all sentences' embeddings. Staring from the first sentence, checks if the cosine similarity between 1st and 2nd sentence. If similar, that means the second sentence should be in the same chunk as the first one. This keeps going until the cosine similarity drops a lot, at which point a new chunk starts.

Document structure chunking relies on an existing document structure, and then a parser can be used to split the document.

LLM-based chunking relies on LLM to interpret the document and divide it into chunks. Computationally costing

Page updated

Google Sites

Report abuse