GenAI: Generative AI is a branch of artificial intelligence centered around computer models capable of generating original content. By leveraging the power of large language models (LLM), neural networks, and machine learning (ML), generative AI (GenAI) is able to produce novel content that mimics human creativity.
https://www.elastic.co/what-is/generative-ai
What can I do with generative AI? Elastic lets you search your proprietary data and integrate with large language models using retrieval augmented generation (RAG). Your proprietary, real-time data provides additional context and accuracy to generative AI experiences. Providing highly relevant results via context window to generative AI is a cost-effective, secure alternative to building or training your own large language model.
RAG: Retrieval augmented generation is a technique that supplements text generation with information from private or proprietary data sources. It combines a retrieval model, which is designed to search large datasets or knowledge bases, with a generation model such as a large language model (LLM), which takes that information and generates a readable text response.
With Elasticsearch Relevance Engine (ESRE), you can build RAG-enabled search for your generative AI app, website, customer, or employee experiences. Use ESRE to apply semantic search with superior relevance out of the box (without domain adaptation), integrate with external large language models (LLMs), implement hybrid search, and use third-party or your own transformer models.
https://www.elastic.co/what-is/retrieval-augmented-generation
https://www.elastic.co/elasticsearch/elasticsearch-relevance-engine
LLM: A large language model is a deep learning algorithm that can perform a variety of natural language processing (NLP) tasks. Large language models use transformer models and are trained using massive datasets — hence, large. This enables them to recognize, translate, predict, or generate text or other content.
https://www.elastic.co/what-is/large-language-models
https://www.elastic.co/search-labs/blog/domain-specific-generative-ai-pre-training-fine-tuning-rag
Transformers: were introduced by Google in 2017 with the “Attention is all you need” paper. The introduction of transformers was the catalyst of today’s AI boom, being the architecture for ChatGPT (GPT stands for “Generative Pre-trained Transformers”).
A transformer model is the most common architecture of a large language model. It consists of an encoder and a decoder.
A transformer model processes data by tokenizing the input, then simultaneously conducting mathematical equations to discover relationships between tokens. This enables the computer to see the patterns a human would see were it given the same query.
Transformer models work with self-attention mechanisms, which enables the model to learn more quickly than traditional models like long short-term memory models. Self-attention is what enables the transformer model to consider different parts of the sequence, or the entire context of a sentence, to generate predictions.
https://www.elastic.co/search-labs/blog/generative-ai-transformers-explained
Generative Pre-trained Transformer GPT: Generative pre-trained transformers are perhaps the best-known large language models. Developed by OpenAI, GPT is a popular foundational model whose numbered iterations are improvements on their predecessors (GPT-3, GPT-4, etc.). It can be fine-tuned to perform specific tasks downstream. Examples of this are EinsteinGPT, developed by Salesforce for CRM, and Bloomberg's BloombergGPT for finance.
https://www.elastic.co/what-is/large-language-models
https://www.elastic.co/search-labs/blog/chatgpt-elasticsearch-creating-custom-gpts-with-elastic-data
Self-attention: Self-attention enhances embeddings by “blending in” contextual information from within the input text. For each word in a given piece of text, it recognizes the words that are the most relevant to it in the sequence, and updates its embedding to include contextual information. The term self-attention pertains to the ability to attend to different words within the input sequence itself (as opposed to attention systems that attend to contexts external to the input sequence, as we will see in the upcoming second part).
Context Window The search results added can provide up-to-date information that’s from a private source or specialized domain, and therefore can return more factual information when prompted instead of relying solely on a model’s so-called "parametric" knowledge.
Encoders and Decoders: Let's use the example of understanding and speaking a foreign language. An "encoder" is like your ability to understand the foreign language. It takes sentences from that language and turns them into something (an internal representation) you can understand. The "decoder" is like your ability to speak in the foreign language. It takes your thoughts (internal representations) and turns them into sentences in the foreign language.
BM25: is a widely used text retrieval algorithm based on probabilistic information retrieval theory. It ranks documents based on the frequency of query terms in the document, taking into account factors such as term frequency, inverse document frequency, and document length normalization. While BM25 has been effective in traditional search applications, it has some limitations. For example, BM25 relies heavily on exact term matches, which can lead to less relevant results when dealing with synonyms, misspellings, or subtle semantic variations. Additionally, BM25 does not capture the contextual relationships between words, making it less effective at understanding the meaning of phrases or sentences.
Semantic search is a search technique centered around understanding the meaning of a search query and the content being searched. It aims to provide more contextually relevant search results.
Vector search is a technique that represents data points as vectors in a high-dimensional space. It enables efficient similarity search and recommendation systems by calculating distances between vectors. Vector Search everages machine learning (ML) to capture the meaning and context of unstructured data, including text and images, transforming it into a numeric representation. Frequently used for semantic search, vector search finds similar data using approximate nearest neighbor (ANN) algorithms. Compared to traditional keyword search, vector search yields more relevant results and executes faster.
You can enhance the search experience by combining vector search with filtering and aggregations to optimize relevance by implementing a hybrid search and combining it with traditional scoring.
Hybrid Search combines Reciprocal Rank Fusion (RRF), vector, keyword, and semantic techniques for better results.
KNN Search: kNN, or the k-nearest neighbor algorithm, is a machine learning algorithm that uses proximity to compare one data point with a set of data it was trained on and has memorized to make predictions. Use kNN to make predictions based on similarity.
kNN is a supervised learning algorithm - it is fed training datasets it memorizes - in which 'k' represents the number of nearest neighbors considered in the classification or regression problem, and 'NN' stands for the nearest neighbors to the number chosen for k. While kNN's computation occurs during a query and not during a training phase, it has important data storage requirements and is therefore heavily reliant on memory.
Types of computing kNN distance metrics: Euclidean distance, Manhattan distance, Minkowski distance, Hamming distance
Advantages: simple, adaptable, easily porgrammable
Challenges: difficult to scale, curse of dimentionality, overfitting
Top use cases: relevance ranking, similarity search for images or videos, pattern recognition, finance (stock market), health (genetics), product recommendations, data preprocessing.
To run a kNN search, you must be able to convert your data into meaningful vector values. You can create these vectors using a natural language processing (NLP) model in Elasticsearch, or generate them outside Elasticsearch. Vectors can be added to documents as dense_vector field values. Queries are represented as vectors with the same dimension.
ANN Search: Traditional nearest neighbor algorithms, like k-nearest neighbor algorithm (kNN), lead to excessive execution times and zaps computational resources. ANN sacrifices perfect accuracy in exchange for executing efficiently in high dimensional embedding spaces, at scale. Elasticsearch 8.0 uses an ANN algorithm called Hierarchical Navigable Small World graphs (HNSW), which organizes vectors into a graph based on their similarity to each other. HNSW is widely used in industry, having been implemented in several different systems.
https://www.elastic.co/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
Embeddings: Vector embeddings are a way to convert words and sentences and other data into numbers that capture their meaning and relationships. They represent different data types as points in a multidimensional space, where similar data points are clustered closer together. These numerical representations help machines understand and process this data more effectively.
Word and sentence embeddings are two of the most common subtypes of vector embeddings, but there are others. Some vector embeddings can represent entire documents, as well as image vectors designed to match up visual content, user profile vectors to determine a user’s preferences, product vectors that help identify similar products and many others
https://www.elastic.co/what-is/vector-embedding
Sparse Vector / Dense Vector: There are two big families of retrieval approaches, often referred to as “dense” and “sparse.” Both use a vector representation of text, which encodes meaning and associations, and both perform a search for close matches as a second step. The vector is considered “dense” because most of its values are non-zero. By contrast to “dense” vectors “sparse” representations contain very few non-zero values.
ESRE: the Elasticsearch Relevance Engine (ESRE) is a relevance engine built for artificial intelligence-powered search applications. With ESRE, developers are empowered to build their own semantic search application, utilize their own transformer models, and combine NLP and generative AI to enhance their customers' search experience.
Apply advanced relevance ranking features including BM25f, a critical component of hybrid search
Create, store, and search dense embeddings using Elastic’s vector database
Process text using a wide range of natural language processing (NLP) tasks and models
Manage and use own transformer models in Elastic for business specific context
Integrate with third-party transformer models such as OpenAI’s GPT-3 and 4 via API
Enable ML-powered search without training or maintaining a model using Elastic’s out-of-the-box Learned Sparse Encoder (ELSER) model
Combine sparse and dense retrieval using Reciprocal Rank Fusion (RRF)
Integrate with third-party tooling such as LangChain to help build sophisticated data pipelines and generative AI applications
https://www.elastic.co/elasticsearch/elasticsearch-relevance-engine
https://www.elastic.co/guide/en/esre/current/faq.html
ELSER (Elastic Learned Sparse EncodeR ) is a retrieval model trained by Elastic that enables you to perform semantic search to retrieve more relevant search results. This search type provides you search results based on contextual meaning and user intent, rather than exact keyword matches. ELSER is an out-of-domain model which means it does not require fine-tuning on your own data, making it adaptable for various use cases out of the box.
trained and architected in such a way that you do not need to fine tune it on your data. As an out-of-domain model, it outperforms dense vector models when no domain-specific retraining is applied.
outperforms SPLADE (Sparse Lexical and Expansion Model), the previous out-of-domain, sparse-vector, text-expansion champion, as measured by the same benchmarks.
you don’t have to worry about licensing, support, continuity of competitiveness, and extensibility beyond your Elastic license tier.
As sparse-vector representation, it uses the Elasticsearch, Lucene-based inverted index. This means decades of optimizations are leveraged to provide optimal performance.
Fewer dimensions are activated than in dense representations, and they often directly map to words, in contrast with the opaqueness of dense representations.
RRF Reciprocal rank fusion is a method for combining multiple result sets with different relevance indicators into a single result set. RRF requires no tuning, and the different relevance indicators do not have to be related to each other to achieve high-quality results.
Elasticsearch integrates the RRF algorithm into the search query. Consider a query and a knn sections to request full-text and vector searches respectively, and a rrf section that combines them into a single result list. RRF allows us to combine the two results sets generated by completely independent scoring algorithms with equal weighting. Not only does this remove the need to figure out what the appropriate weighting would be using linear combination, but RRF is also shown to give improved relevance over either query individually.
Chunking is the logical grouping of information. Your chunking decision should be based on the max token limit of your model. Why “chunk” at all? Sentences and paragraphs have more cohesive and targetable meaning than words or whole pages of text. Avoid truncation: Embedding models have token limits Tips: consider overlapping chunks and vector embeddings for summaries and abstracts
Chunking for large documents
The combination of Elasticsearch features such as ingest pipelines, the flexibility of a script processor and new support for nested documents with dense_vectors allows for a straightforward way to at ingest time chunk large documents into small enough passages that can then be processed by text embedding models to generate all the vectors needed to represent the full meaning of the large documents.
Ingest your document data as you would normally, and add to your ingest pipeline a script processor to break the large text data into an array of sentence or other types of chunks followed by a for_each processor to run an inference processor on each chunk. Mappings for the index are defined such that the array of chunks is set up as a nested object with a dense_vector mapping as a subobject which will then properly index each of the vectors and make them searchable.
Inference: An inference processor is a pipeline task that uses a deployed, trained model to transform incoming data during indexing or re-indexing.
https://www.elastic.co/guide/en/elasticsearch/reference/current/inference-processor.html
Tokenization: In NLP, tokenization refers to the process of converting sentences into tokens, or smaller units of information. It is a process that enables faster computer processing. A transformer model processes data by tokenizing the input, then simultaneously conducting mathematical equations to discover relationships between tokens. This enables the computer to see the patterns a human would see were it given the same query. In the context of text, a token can be a word, part of a word (subword), or even a character — depending on the tokenization process.
Token limits refer to the maximum number of tokens that a large language model (LLM) can process in a single interaction. Token limits are crucial because they influence the efficiency and output of LLMs. If the token limit is set too low, the LLM might not be able to produce the desired output.