Information Retrieval is a field of study that focuses on the efficient and effective retrieval of relevant information from large collections of unstructured or semi-structured data, typically textual in nature. The goal is to match user queries with relevant documents, making information retrieval vital for search engines, document retrieval systems, and recommendation systems.
Vector Space Model (VSM) - Represents documents and queries as vectors in a high-dimensional space, where the cosine similarity between vectors is used to rank documents based on relevance.
Term Frequency-Inverse Document Frequency (TF-IDF) - Weights terms based on their frequency in a document relative to their frequency across the entire corpus, helping to identify terms that are discriminative for a particular document.
Boolean Model - Represents documents and queries as sets of terms and employs Boolean operators (AND, OR, NOT) to retrieve documents that match a given query.
Probabilistic Information Retrieval Models (e.g., BM25) - Computes the probability of relevance for a document given a query, considering factors such as term frequency and document length. BM25 is a popular probabilistic model used in IR.
Latent Semantic Indexing (LSI) or Latent Semantic Analysis (LSA) - Applies singular value decomposition (SVD) to identify latent semantic structures in the document-term matrix, enabling the capture of hidden relationships between terms.
Okapi BM25 - An extension of the probabilistic model BM25 that introduces term saturation and length normalization to improve document ranking.
Learning-to-Rank Models - Utilizes machine learning algorithms to learn the ranking of documents based on features such as relevance signals, click-through rates, and user behavior.
Query Expansion - Augments the user's original query with additional terms to improve retrieval performance, often using synonyms or related terms.
Web Search Algorithms (e.g., PageRank, HITS) - Algorithms designed for web search engines, considering link structures and authority measures to rank web pages.
Semantic Search - Integrates natural language understanding and semantic analysis to understand the meaning of queries and documents, enabling more context-aware retrieval.