Last updated: March 09, 2025
This version in under development and may be difficult to follow.
Large Language Models (LLMs) generally consist of three major components: Tokenization, Embedding, and the Transformer architecture. Here's a breakdown of these components:
Tokenization is the process of breaking down raw text into smaller units called tokens. These tokens can represent words, subwords, or even characters, depending on the model's design.
It converts text into a structured form that the model can process. Each token is assigned a unique ID, which acts as its representation in the vocabulary of the model.
Once tokenized, each token is mapped to a dense vector representation in a high-dimensional space. This process is called embedding.
Embeddings capture semantic and syntactic meanings of tokens, ensuring that similar tokens are closer in this vector space. Positional embeddings are often added to encode the order of tokens in a sequence.
The core computational framework of LLMs is the Transformer architecture, which uses mechanisms like self-attention and multi-head attention to process sequences efficiently.
Transformers analyze relationships between tokens in a sequence, focusing on relevant parts of the input to generate context-aware outputs.
These components work together to enable LLMs to understand and generate human-like text effectively. Let's look at the bold items above in more detail.
Self-attention is a core mechanism in transformer-based models that enables dynamic contextual understanding by analyzing relationships between elements in a sequence. Here's a detailed breakdown:
Core Components
Query (Q), Key (K), Value (V) Vectors
Each input token embedding is transformed into three vectors via learned linear layers:
Query: Represents the current focus (e.g., "What context is needed for this word?")
Key: Acts as an identifier for comparison with queries
Value: Contains the actual content to be weighted and aggregated
Positional Encoding
Since transformers lack inherent sequence awareness, positional embeddings are added to token embeddings to encode order (e.g., distinguishing "dog bites man" vs. "man bites dog").
Models use parallel self-attention "heads" to capture diverse relationships (e.g., syntax, semantics, and entity interactions). Each head has independent Q/K/V transformations, and outputs are concatenated for final processing.
Key Advantages
Long-Range Context: Directly connects distant tokens (e.g., linking pronouns to their antecedents).
Dynamic Weighting: Adjusts focus based on input (e.g., "bank" in "river bank" vs. "bank deposit").
Parallelization: Efficient computation across sequence positions.
Example Workflow
For the token "white" in "The car color is white":
Query: Focuses on "white"
Key Comparisons: Strong match with "car" and "color"
Aggregation: Combines values of "car" and "color" to enrich "white"'s representation.
This mechanism underpins modern LLMs like GPT-4 and BERT, enabling nuanced language understanding.
AI increasingly relies on the vector database, which enables low latency queries, making them ideal for AI-driven applications.
See What is a Vector Database (by IBM Technology), https://youtu.be/t9IDoenf-lo?si=4gGkIx2upFiNyV_G
Astra by Cassandra is a vector database. https://www.datastax.com/guides/what-is-a-vector-database
Atlas by MongoDB is a native vector database, https://www.mongodb.com/lp/cloud/atlas/vector/database
The Illustrated Word2vec, https://jalammar.github.io/illustrated-word2vec
A Complete Overview of Word Embeddings (by AssemblyAI), https://youtu.be/5MaWmXwxFNQ?si=nL8FiwAzC3oWvK2K
Multi-vector embeddings enhance traditional single-vector approaches by representing data objects through multiple vectors rather than a single fixed-dimensional vector. This technique captures richer contextual or perspectival information for complex tasks, particularly in multimodal AI systems.
See https://x.com/femke_plantinga/status/1895466605317877881
And https://weaviate.io/developers/weaviate/tutorials/multi-vector-embeddings
An autoregressive model is a way of predicting the next step in a sequence based on what has already happened. It works by looking at the past and using that information to guess what comes next.
What is the Vector Autoregressive (VAR) Model, https://www.youtube.com/watch?v=0-FKPJ5KxSo
Transformers are a specific type of neural network architecture. They excel at understanding relationships between things in sequences, like words in a sentence or DNA in a strand. This makes them particularly powerful for tasks in natural language processing (NLP). They work by using a technique called "attention" to focus on important parts of the sequence. This allows them to learn complex relationships that traditional models might miss. Models like BERT and GPT-3 use transformers.
[Important] Self-Attention in Transformers, https://www.youtube.com/watch?v=-tCKPl_8Xb8
[Important] What are Transformers (Machine Learning Model)? (IBM Technology), https://www.youtube.com/watch?v=ZXiruGOCn9s
[Important] Transformer Explainer - Really cool interactive tool to learn about the inner workings of a Transformer model, poloclub.github.io/transformer-explainer. Apparently, it runs a GPT-2 instance locally in the user's browser and allows you to experiment with your own inputs. Here is a short video going over the tool: youtu.be/V5kAmFRwuxc. Paper: arxiv.org/abs/2408.04619
Transformers explained: Understand the model behind GPT, BERT, and T5 (Google Cloud Tech), https://youtu.be/SZorAJ4I-sA?si=eaYV0dWHdZU34eiO. See the corresponding blog at https://daleonai.com/transformers-explained and Google Colab notebook, https://colab.research.google.com/github/tensorflow/text/blob/master/docs/tutorials/classify_text_with_bert.ipynb
The Transformer Attention Mechanism (with some maths), https://machinelearningmastery.com/the-transformer-attention-mechanism/
Transformers and LLMs, https://deeplearning.cs.cmu.edu/F23/document/slides/lec19.transformersLLMs.pdf
Transformers for beginners | What are they and how do they work (Assembly AI), https://youtu.be/_UVfwBqcnbM?si=JPosaEN3rFQx1fH5
The original paper - https://arxiv.org/pdf/1706.03762.pdf
Lab (a bit advanced): but if you’re interested in following the original paper with the code - https://nlp.seas.harvard.edu/2018/04/03/attention.html
The Illustrated Transformer (a very good and graphical explanation), https://jalammar.github.io/illustrated-transformer/
Blog about positional encodings - https://kazemnejad.com/blog/transformer_architecture_positional_encoding/
About attention - Visualizing A Neural Machine Translation Model - https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
Layer normalization - https://arxiv.org/abs/1607.06450
Stable Diffusion is a program for generating images from text descriptions. It stands out because it can be customized to your needs, freely downloaded and run on your own computer, and is constantly being improved. There are other image generation options, such as OpenAI's DALL-E 3, but Stable Diffusion has shown impressive results in creating images from both textual descriptions and existing images.
A Technical Introduction to Stable Diffusion by Vidhi Chugh, https://machinelearningmastery.com/a-technical-introduction-to-stable-diffusion/
Prompting Techniques for Stable Diffusion by Vidhi Chugh,
https://machinelearningmastery.com/prompting-techniques-stable-diffusion/
More about Stable Diffusion, https://machinelearningmastery.com/category/stable-diffusion/
How to Use Stable Diffusion Effectively, https://machinelearningmastery.com/how-to-use-stable-diffusion-effectively
LLaVA represents a novel end-to-end trained large multimodal model that combines a vision encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA. https://llava-vl.github.io/
Here is a video presentation: https://www.youtube.com/live/cambXXq9mrs?feature=shared
And its corresponding Jupyter notebook: https://colab.research.google.com/drive/1-AR1OC6Csm4rPoWTM8vM8sFI55nye4l_?usp=sharing
Build a Large Language Model AI Chatbot using Retrieval Augmented Generation (IBM Technology), https://www.youtube.com/watch?v=XctooiH0moI
Build a simple RAG chatbot with LangChain, https://medium.com/credera-engineering/build-a-simple-rag-chatbot-with-langchain-b96b233e1b2a
Build an LLM RAG Chatbot With LangChain, https://realpython.com/build-llm-rag-chatbot-with-langchain/
How to Build a Retrieval-Augmented Generation Chatbot, https://www.anaconda.com/blog/how-to-build-a-retrieval-augmented-generation-chatbot