Introduction to Large Language Models, Google
https://developers.google.com/machine-learning/resources/intro-llms
What is LLM (Large Language Model)? Amazon
https://aws.amazon.com/what-is/large-language-model/
Large Language Models powered by world-class Google AI, Google
https://cloud.google.com/ai/llms
What is a Large Language Model (LLM), geeksforgeeks
https://www.geeksforgeeks.org/large-language-model-llm/
Meta
Discover the possibilities with Meta Llama
Intro to Large Language Models, Andrej Karpathy
https://www.youtube.com/watch?v=zjkBMFhNj_g
What Is ChatGPT Doing and Why Does It Work? by Stephen Wolfram
https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
What Is ChatGPT Doing? Wolfram 1 / 6
https://www.youtube.com/watch?v=HKfn5q-Gbg8&list=PLxn-kpJHbPx2upO5Rm_4qe7_h9IjaQ1Tz
Generative AI course, Freecodecamp
https://www.freecodecamp.org/news/learn-generative-ai-in/
What Is a Transformer Model? Nvidia
https://blogs.nvidia.com/blog/2022/03/25/what-is-a-transformer-model/
Getting Started with Transformers, geeksforgeeks
https://www.geeksforgeeks.org/getting-started-with-transformers/
What Is ChatGPT Doing and Why Does It Work?, Stephen Wolfram
https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
ChatGPT Prompt Engineering for Developers
https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/
Natural Language Processing, DeepLearning.AI
https://www.deeplearning.ai/resources/natural-language-processing/
DeepLearning AI, ChatGPT Prompt Engineering for Developers
https://learn.deeplearning.ai/chatgpt-prompt-eng/lesson/1/introduction
OpenAI
OpenAI Quickstart
https://platform.openai.com/docs/quickstart
How can I use the ChatGPT API?
https://help.openai.com/en/articles/7232945-how-can-i-use-the-chatgpt-api
OpenAI Introduction
https://platform.openai.com/docs/introduction/overview
Best practices for prompt engineering with OpenAI API
https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api
Example Applications with OpenAI
https://platform.openai.com/examples
What are Large Language Models, Machine Learning Mastery
https://machinelearningmastery.com/what-are-large-language-models/
The Surprising Power of Next Word Prediction: Large Language Models Explained, Part 1, Center for security and emerging technology
How to Build an LLM from Scratch: A Step-by-Step Guide
https://blog.spheron.network/how-to-build-an-llm-from-scratch-a-step-by-step-guide
What is a Large Language Model (LLM), geeksforgeeks
https://www.geeksforgeeks.org/large-language-model-llm/
What are Large Language Models, machine learning mastery
https://machinelearningmastery.com/what-are-large-language-models/
Large Language Models Explained, nvidia
https://www.nvidia.com/en-us/glossary/large-language-models/
Introduction to Large Language Models
https://developers.google.com/machine-learning/resources/intro-llms
Intro to Large Language Models, Andrej Karpathy
https://www.youtube.com/watch?v=zjkBMFhNj_g
AI Demystified: Introduction to large language models
https://uit.stanford.edu/service/techtraining/ai-demystified/llm
Join the Hugging Face community
https://huggingface.co/learn/llm-course/en/chapter1/1
MIT 6.S191: Recurrent Neural Networks, Transformers, and Attention Alexander Amini, 2/ 71
https://www.youtube.com/watch?v=dqoEU9Ac3ek&list=PLtBw6njQRU-rwp5__7C0oIVt26ZgjG9NI&index=2
The Annotated Transformer
https://nlp.seas.harvard.edu/annotated-transformer/
Attention Is All You Need
https://arxiv.org/abs/1706.03762
Added July 14, 2025
What are the basic methods to train an LLM from scratch
(Gemini)
Training a Large Language Model (LLM) from scratch is a massive undertaking, typically requiring significant computational resources, vast amounts of data, and deep expertise.
However, the fundamental methods involved can be broken down into these core stages:
1. Data Collection and Preparation
This is arguably the most crucial step, as the quality and diversity of your data directly impact the LLM's performance and capabilities.
Gathering Diverse Data: LLMs are trained on enormous text corpora. This includes a wide variety of sources like:
Books (e.g., Project Gutenberg)
Web pages (e.g., Common Crawl)
Articles and academic papers (e.g., arXiv, Wikipedia)
Code repositories (e.g., GitHub)
Conversational data
Other publicly available text data.
Data Cleaning and Preprocessing: Raw data is messy and needs extensive cleaning. This involves:
Deduplication: Removing identical or near-identical text to prevent bias and wasted training.
Filtering: Eliminating low-quality content, spam, offensive language, and irrelevant information.
Normalization: Converting text to a consistent format (e.g., lowercase, handling special characters, fixing encoding issues).
Privacy Redaction: Removing personally identifiable information (PII) or sensitive data.
Tokenization: Converting raw text into numerical "tokens" that the model can process. This involves:
Choosing a Tokenizer: Common methods include Byte Pair Encoding (BPE), WordPiece, or SentencePiece, which split words into subword units.
Building a Vocabulary: Creating a dictionary of unique tokens.
2. Model Architecture Design
Most modern LLMs are built upon the Transformer architecture. Key components include:
Encoder and Decoder Stacks (or Decoder-only for generative LLMs): The Transformer consists of stacked layers of encoders and/or decoders. For generative LLMs (like GPT), a decoder-only architecture is common.
Embedding Layer: Converts input tokens into dense vector representations.
Positional Encoding: Adds information about the order of words in a sequence, as Transformers process words in parallel without inherent sequential understanding.
Self-Attention Mechanism (Multi-Head Attention): This is the core of the Transformer. It allows the model to weigh the importance of different words in the input sequence when processing each word. "Multi-head" means it does this in parallel with multiple "heads," each focusing on different aspects of the input.
Feed-Forward Networks (MLPs): Simple neural networks applied to each position independently after the attention mechanism.
Normalization Layers (e.g., Layer Normalization, RMSNorm): Help stabilize and accelerate training by normalizing the activations.
Residual Connections: Allow information to flow through the network more easily, preventing vanishing gradients.
Output Layer: Projects the model's final representations to a vocabulary size, typically for predicting the next token.
Design Choices: Deciding on the number of layers, hidden dimension size (width), number of attention heads, and activation functions.
3. Pre-training (Self-Supervised Learning)
This is the initial, computationally intensive phase where the LLM learns the general patterns, grammar, syntax, and semantics of language.
Objective: The most common objective is next-token prediction (also known as causal language modeling), where the model learns to predict the next word or token in a sequence given the preceding ones. This is a self-supervised task because the "labels" (the next tokens) are derived directly from the unannotated input text.
Loss Function: Typically, cross-entropy loss is used to measure the difference between the model's predicted probability distribution for the next token and the actual next token.
Optimization: Optimizers like Adam or AdamW are used to adjust the model's parameters (weights and biases) to minimize the loss function.
Batch Training: Data is fed to the model in small groups (batches) to manage memory and computational resources.
Learning Rate Schedules: The learning rate (how much the model adjusts its weights with each update) is often varied during training (e.g., decreasing it over time) to improve convergence.
Computational Resources: This phase requires massive GPU clusters and can take weeks or months for large models.
4. Fine-tuning (Optional, but common for practical LLMs)
After pre-training, the LLM has a general understanding of language. Fine-tuning adapts this general knowledge to specific tasks or desired behaviors. While not strictly "from scratch" if you consider the pre-trained model as a starting point, it's a crucial step in creating a usable LLM.
Supervised Fine-tuning (SFT) / Instruction Tuning: The pre-trained LLM is further trained on a smaller, high-quality dataset of prompt-response pairs. This teaches the model to follow instructions and generate responses aligned with specific tasks (e.g., summarization, question answering, translation).
Reinforcement Learning from Human Feedback (RLHF): This is a more advanced technique to align the LLM's outputs with human preferences and values (e.g., helpfulness, harmlessness, honesty).
Training a Reward Model: Human annotators rank multiple model outputs for a given prompt. This data is used to train a separate "reward model" that learns to assess the quality of responses.
Optimizing the LLM with the Reward Model: The LLM is then further fine-tuned using reinforcement learning algorithms (like PPO) to generate responses that maximize the reward signal from the trained reward model.
5. Evaluation and Optimization
Metrics: During training, metrics like accuracy, perplexity (a measure of how well the model predicts a sample), and F1 score are monitored.
Validation Set: A separate dataset is used to evaluate the model's performance during training and detect overfitting.
Regularization: Techniques like dropout and L1/L2 regularization are used to prevent overfitting.
Early Stopping: Training can be halted when performance on the validation set stops improving.
Hyperparameter Tuning: Experimenting with different batch sizes, learning rates, and other architectural parameters to find optimal settings.
Model Optimization (for deployment): Techniques like quantization (reducing precision of weights), pruning (removing unnecessary connections), and distillation (training a smaller model to mimic a larger one) can be used to make the model more efficient for inference.
Training an LLM from scratch is a complex and resource-intensive endeavor. For many applications, leveraging existing pre-trained open-source LLMs and then fine-tuning them is a more practical and cost-effective approach.
Glossary
Added August 5, 2025
Large Language Models for Mathematicians
https://arxiv.org/html/2312.04556v1/#S2
Mathematics of LLMs in Everyday Language
https://www.youtube.com/Create a Large Language Model from Scratch with Python – Tutorial
Fine Tuning LLM Models – Generative AI Course
https://www.youtube.com/watch?v=iOdFUJiB0Zc
AI Engineer Roadmap – How to Learn AI in 2025 free code camp
https://www.youtube.com/watch?v=nYXVvK-Wmn0
Algorithmic Trading – Machine Learning & Quant Strategies Course with Python free code camp
https://www.youtube.com/watch?v=9Y3yaoi9rUQ
All Machine Learning algorithms explained in 17 min
https://www.youtube.com/watch?v=E0Hmnixke2g
Tutorial point 1/17
https://www.youtube.com/watch?v=E0Hmnixke2g
Towards Data Science
Understanding LLMs from Scratch Using Middle School Math
Algorithmic Trading – Machine Learning & Quant Strategies Course with Python