Artificial intelligence
An artificial intelligence (AI) is
An artificial intelligence (AI) is
[C1] A corpus is a collection of texts used to train a language model. It determines the language model/s vocabulary and what words it can generate.
An error is outputted likely as that word is non-existant in the corpus. If we give the model word it didn't see in a training progress, it's unable to suggest what word comes next as it's no storing info about that word. Such model only knows about words in its corpus, big or small.
Large language models (LLM)
LLM's corpus is often a combination of texts from various sources like chat rooms, Wikipedia, novels, etc.
[C1] Smaller ones (SLMs) may have one type of text like emails or Pablo Neruda poems. Things like a phone's predictive text, virtual assistants like Siri or Alex rely on a language model.
Modern email programs often try to predict the user's next word in a sentence, via probability. They store probabilities for each word that may be next, calculating these based on word sequences in the corpus.
E.g., If we wrote an email by starting with "Thank you for the update...", the next word likely is "on" or "about", not something like "truck.
N-gram model are simple and offer a good model to understand how language models' predictions work. IO the letter "N", the types of N-gram model's names have numbers or prefixes in them to hint at the number of words are used as context in a phrase the model uses to predict the phrase's last word. E.g., A "bi"gram model uses 1 word to predict a phrase's 2nd (and last) word since its N = 2, then 2 - 1 = 1 word is used to predict that word. Another example: 8-gram model uses 7 words to predict the 8th word.
Overall, an N-gram model uses "'n' - 1" words to predict the 'n'th word.
A context window is an AI model's "working memory" that store that number of words the model can process and recall.
But it's an issue if 'n' if a n-gram model's 'n' gets too large. E.g., A 6-gram model needs 5 words that appeared in that order to predict, so it's limited in its prediction. Thus, a text autocorrector uses a smaller 'n' than a plagiarism detector as fixing a word's spelling needs smaller context than finding a plagiarized string of words.
[C1] A bigram model uses 1 word to predict its second word. Bigram is a 2-word sequence so N - 1 = 1 word is used to predict the last word.
It answer questions like "given this first word, what's a likely next word?". The last word a user give provides the cvocntext of te next word. E.g., If we gave a bigram model a long prompt like "To be or not to be, that's the question," it'll use "question" as context. This is because it's the last word when the model predicts and ignores the prompt's rest, a processs called Markov assumption, which says that a future word's probability depends only on the current word.
A trigram model uses 2 words of context to predict a word. Trigram is a 3-word sequence
Bigram models
[C1] n-gram models use predictions based on the most likely next word. To figure out what that word is, we translate words into numbers/probabilities.
A unigram model ("uni" = 1) that produces a prediction without context. A word a unigram predicts the most often is "the", as it's the most commonly used English word.
E.g., For a book, the n-gram algorithm first goes through the whole book and counts how many times each word is used.
Consider this list of all bigrams in the phrase “Turn off the heat and stir in the butter and herbs.” in the book:
turn + off: 1
off + the: 1
the + heat: 1
heat + and: 1
and + stir: 1
stir + in: 1
in + the: 1
the + butter: 1
butter + and: 1
and + the: 1
the + herbs: 1
[C1] The list of the 100 most frequent bigrams in that book are:
in + the: 1675
of + the: 1338
to, the: 521
the + sauce: 469
for + to: 485
Therefore the most likely word to be after "the" is "sauce" and the most likely one to be before "the" is "in".
[C1] In a prompt ending by "egg", these are the most likely words to be after "egg":
[C1] Therefore, "whites" has 34% of the time to be after "egg and "yolk" is 8.2% of the time.
[C1] An n-gram model reach their limit in more complex tasks. E.g., It can't predict when the cake will be made or makes it, which is instead done by complex models like neural networks.
[C1] Pythia 12B model is an LLM made via EleutherAI/pythia-12b, using math to predict the next word in a sequence. E.g., In a phrase "The sky is...", an LLM may have "pickles" in its possibility list alongside "blue" or "big". A setting called "temperature" is used to adjust an predicted word's probability depending on a value (i.e., it controls how random an LLM's output is).
But at max temperature, most models are incomprehensible. At min temperature, it eliminates all randomness from its prediction, so it chooses the most likely output each time. This is a deterministic model, which produce the same output per given input. E.g., We'd set a model's temperature to moderate to write a poem (to balance the randomness/creativity and predictability) and to min to write a cover letter (as they're more predictable
[C1] A model must go through/be "exposed" to a set of data many times to be competent. Each data training session is called an epoch. E.g., This LLM has completed 1 epoch, with 800 GB of English text. But even at 800 GB of training data, the model hasn't learnt how to make words, which are most common, or how to string words together in a phrase. It randomly chooses the next characters.
LLMs are neural networks. Before training, predictions are stored in them are random, so is the output. After, say, 1,000 epochs, such model can make sentences better, but don't always make sense.
[C1] A LLM's neural network can improve per training session by comparing its prediction to the original data, each epoch to learn the pattern of English and create fully coherent sequences. The model's prediction is likely off by some amout. The contrast between the predicted and actual values is a "loss", which reduces each loss.
But even after so many epochs, a model's loss won't reach 0 as the output isn't the same text the model was trained on. Training a model til its loss reaches 0 is undesired, otherwise te model's prediction will fit the training data exactly and can't generate "novel" text, referred to as
"overfitting", i.e., it the training data's replicates patterns so well it can't account for new data/create new patterns. If an overfitted model receives a new prompt, it won't generate novel text, but repetitive ones. The model replicates the training data too closely, resulting in both training data's reproducted text or repetitive strings. So a monitor must train a model closely as a model starts perfoming well.
[1] Wikipedia
[1.1] Large langauge model
[1.2] Small language model
[C1] How AI Works - Brilliant