Resources

We have compiled a list of resources that may prove useful for the final project. This list is continually expanding, so feel free to suggest additional resources.

In general, you might find CMU's ML⇄DB Seminar Series useful for anything related to vector databases and AI for databases.

Vector Databases

Background Knowledge

FAISS Manual: Pinecone provides a free manual on the essentials of vector search and using FAISS, Meta's open-source vector similarity search library.
FAISS Wiki: FAISS provides a wiki with background information on basic usage, mathematical background, and optimizations, as well as some case studies for usage at scale.
Long Term Memory for AI: Princeton offered a course on the mathematical theory and implementation of vector databases.

Vector Databases

FAISS (C++): Meta's open-source vector similarity search library (associated paper).
Weaviate (Go): Cloud-native vector database (associated talk).
- We will also have a guest lecture from the co-founder/CTO of Weaviete, Etienne Dilocker.
Milvus (Go): Cloud-native vector database (associated talk).
- We will also have a guest lecture from a Milvus developer, Jianguo Wang.
ANNOY (C++/Python): Spotify's approximate nearest neighbor search library (documentation).
pgvector (C): Vector similarity search for Postgres (associated talk).
Qdrant (Rust): Vector similarity search engine and database (associated talk).
Chroma (Python): embedding library for LLM applications (associated talk).

Datasets

Note that most larger vector datasets also provide small subsets (100K — 10M) of the original dataset; you might start with these, before scaling up.

Sift1M: ~1M vectors. Embeddings for ANNS. Uses Euclidean distance.
Deep1B: ~1B vectors. Image embeddings from GoogLeNet, pre-trained on Imagenet. Uses cosine distance.
Text-to-Image-1B: ~1B image, ~50M textual query vectors. Image embeddings are produced by Se-ResNext-101 model, and textual embeddings are produced by the DSSM model.
Space1B: ~1B document, ~29K query vectors. Encoded by Microsoft's SpaceV Superior model.
YFCC100M: ~100M multimedia assets (~99M images, ~1M videos). The actual media files can be found here.

AI/LLMs for Databases

Background Knowledge

Introduction to LLMs: An introduction to LLMs by Andrej Karpathy.

Models

HuggingFace: A community of open-source models and datasets. In particular, Hugging Face Transformers provides pre-trained SotA models.
Text-To-Text Transfer Transformer (T5/T5x): One of Google's open-source text transformers for transfer learning (associated paper).
LLaMA: Meta's open-source LLM (associated blog post, paper).
Open Pretrained Transformers: A collection of open-source LLMs (associated blog post, paper).

Datasets

HuggingFace: A community of open-source models and datasets.
Common Crawl: Terabytes of web data from billions of scraped web pages.
- RefinedWeb is a subset of Common Crawl with de-duplicated, filtered tokens.
The Pile: 800GB dataset from 22 professional/academic datasets (associated paper).
OpenWebText: ~42GB dataset of Reddit posts.
StarCoder: 783GB dataset of code from GitHub/Jupyter notebooks (associated paper).
Snorkel: Not a dataset, but rather a training data generator with weak supervision (associated paper).

Report abuse