5 STAR AI.IO

TOOLS

FOR YOUR BUSINESS

HELLO & WELCOME TO THE

5 STAR AI.IO

TOOLS

FOR YOUR BUSINESS

Build GPT

Generate Your First Professional Build GPT AI PROJECT & Get Your BUSINESS 2 Another Level. 


Let's build GPT: from scratch, in code, spelled out.


Let's build GPT: from scratch, in code, spelled out.


Chapters:

00:00:00 intro: ChatGPT, Transformers, nanoGPT, Shakespeare

baseline language modeling, code setup

00:07:52 reading and exploring the data

00:09:28 tokenization, train/val split

00:14:27 data loader: batches of chunks of data

00:22:11 simplest baseline: bigram language model, loss, generation

00:34:53 training the bigram model

00:38:00 port our code to a script

Building the "self-attention"

00:42:13 version 1: averaging past context with for loops, the weakest form of aggregation

00:47:11 the trick in self-attention: matrix multiply as weighted aggregation

00:51:54 version 2: using matrix multiply

00:54:42 version 3: adding softmax

00:58:26 minor code cleanup

01:00:18 positional encoding

01:02:00 THE CRUX OF THE VIDEO: version 4: self-attention

01:11:38 note 1: attention as communication 

01:12:46 note 2: attention has no notion of space, operates over sets

01:13:40 note 3: there is no communication across batch dimension

01:14:14 note 4: encoder blocks vs. decoder blocks

01:15:39 note 5: attention vs. self-attention vs. cross-attention

01:16:56 note 6: "scaled" self-attention. why divide by sqrt(head_size)

Building the Transformer

01:19:11 inserting a single self-attention block to our network

01:21:59 multi-headed self-attention

01:24:25 feedforward layers of transformer block

01:26:48 residual connections

01:32:51 layernorm (and its relationship to our previous batchnorm)

01:37:49 scaling up the model! creating a few variables. adding dropout

Notes on Transformer

01:42:39 encoder vs. decoder vs. both (?) Transformers

01:46:22 super quick walkthrough of nanoGPT, batched multi-headed self-attention

01:48:53 back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

01:54:32 conclusions

 

Corrections: 

00:57:00 Oops "tokens from the future cannot communicate", not "past". Sorry! :)

01:20:05 Oops I should be using the head_size for the normalization, not C 

2,502,648 views  17 Jan 2023  Neural Networks: Zero to Hero

We build a Generatively Pretrained Transformer (GPT), following the paper "Attention is All You Need" and OpenAI's GPT-2 / GPT-3. We talk about connections to ChatGPT, which has taken the world by storm. We watch GitHub Copilot, itself a GPT, help us write a GPT (meta :D!) . I recommend people watch the earlier makemore videos to get comfortable with the autoregressive language modeling framework and basics of tensors and PyTorch nn, which we take for granted in this video.


Links:

- Google colab for the video: https://colab.research.google.com/dri...

- GitHub repo for the video: https://github.com/karpathy/ng-video-...

- Playlist of the whole Zero to Hero series so far:    • The spelled-out i...  

- nanoGPT repo: https://github.com/karpathy/nanoGPT

- my website: https://karpathy.ai

- my twitter: https://twitter.com/karpathy

- our Discord channel: https://discord.gg/3zy8kqD9Cp


Supplementary links:

- Attention is All You Need paper: https://arxiv.org/abs/1706.03762

- OpenAI GPT-3 paper: https://arxiv.org/abs/2005.14165 

- OpenAI ChatGPT blog post: https://openai.com/blog/chatgpt/

- The GPU I'm training the model on is from Lambda GPU Cloud, I think the best and easiest way to spin up an on-demand GPU instance in the cloud that you can ssh to: https://lambdalabs.com . If you prefer to work in notebooks, I think the easiest path today is Google Colab.


Suggested exercises:

- EX1: The n-dimensional tensor mastery challenge: Combine the `Head` and `MultiHeadAttention` into one class that processes all the heads in parallel, treating the heads as another batch dimension (answer is in nanoGPT).

- EX2: Train the GPT on your own dataset of choice! What other data could be fun to blabber on about? (A fun advanced suggestion if you like: train a GPT to do addition of two numbers, i.e. a+b=c. You may find it helpful to predict the digits of c in reverse order, as the typical addition algorithm (that you're hoping it learns) would proceed right to left too. You may want to modify the data loader to simply serve random problems and skip the generation of train.bin, val.bin. You may want to mask out the loss at the input positions of a+b that just specify the problem using y=-1 in the targets (see CrossEntropyLoss ignore_index). Does your Transformer learn to add? Once you have this, swole doge project: build a calculator clone in GPT, for all of +-*/. Not an easy problem. You may need Chain of Thought traces.)

- EX3: Find a dataset that is very large, so large that you can't see a gap between train and val loss. Pretrain the transformer on this data, then initialize with that model and finetune it on tiny shakespeare with a smaller number of steps and lower learning rate. Can you obtain a lower validation loss by the use of pretraining?

- EX4: Read some transformer papers and implement one additional feature or change that people seem to use. Does it improve the performance of your GPT?


 TRANSCRIPT 


PART 1

0:00

hi everyone so by now you have probably heard of Chachi PT it has taken the world and the

0:05

AI Community by storm and it is a system that allows you to interact with an AI

0:10

and give it text-based tasks so for example we can ask chatgpt to write us a small haiku about how important it is

0:17

that people understand Ai and then they can use it to improve the world and make it more prosperous so when we run this

0:23

AI knowledge brings prosperity for all to see Embrace its power okay not bad

0:29

and so you could see that Chachi PT went from left to right and generated all these words seek sort of sequentially

0:35

now I asked it already the exact same prompt a little bit earlier and it generated a slightly different outcome

0:41

AI is power to grow ignorance holds us back learn Prosperity weights

0:47

so uh pretty good in both cases and slightly different so you can see that chatgpt is a probabilistic system and

0:53

for any one prompt it can give us multiple answers sort of replying to it

0:58

now this is just one example of a prompt people have come up with many many examples and there are entire websites

1:03

that index interactions with charge EBT and so many of them are quite humorous

1:09

explain HTML to me like I'm a dog write release notes for chess 2. write a note

1:15

about Elon Musk buying on Twitter and so on so as an example please write a breaking

1:21

news article about a leaf falling from a tree uh and a shocking turn of events a leaf

1:26

has fallen from a treat in the local park Witnesses report that the leaf which was previously attached to a branch of a tree detached itself and

1:33

fell to the ground very dramatic so you can see that this is a pretty remarkable system and it is what we call a language

1:40

model because it it models the sequence of words or characters or tokens more

1:47

generally and it knows how sort of words follow each other in English language and so from its perspective what it is

1:54

doing is it is completing the sequence so I give it the start of a sequence and

1:59

it completes the sequence with the outcome and so it's a language model in that sense now I would like to focus on the under

2:06

the hood of um under the hood components of what makes chat GPT work so what is the

2:12

neural network under the hood that models the sequence of these words and that comes from this paper called

2:18

attention is all you need in 2017 a landmark paper a landmark paper and AI

2:24

that produced and proposed the Transformer architecture so GPT is short for generally

2:31

generatively pre-trained Transformer so Transformer is the neural nut that actually does all the heavy lifting

2:37

under the hood it comes from this paper in 2017. now if you read this paper this

2:42

reads like a pretty random machine translation paper and that's because I think the authors didn't fully anticipate the impact that the

2:49

Transformer would have on the field and this architecture that they produced in the context of machine translation in

2:55

their case actually ended up taking over the rest of AI in the next five years after and so this architecture with

3:02

minor changes was copy pasted into a huge amount of applications in AI in

3:07

more recent years and that includes at the core of chat GPT now we are not going to what I'd like to

3:15

do now is I'd like to build out something like chatgpt but we're not going to be able to of course reproduce

3:20

chatgpt this is a very serious production grade system it is trained on a good chunk of internet and then

3:28

there's a lot of pre-training and fine-tuning stages to it and so it's very complicated what I'd like to focus

3:34

on is just to train a Transformer based language model and in our case it's

3:39

going to be a character level a language model I still think that is a very educational with respect to how

3:45

these systems work so I don't want to train on the chunk of Internet we need a smaller data set in this case I propose

3:51

that we work with my favorite toy data set it's called tiny Shakespeare and what it is is basically it's a

3:58

concatenation of all of the works of Shakespeare in my understanding and so this is all of Shakespeare in a single

4:04

file this file is about one megabyte and it's just all of Shakespeare

4:09

and what we are going to do now is we're going to basically model how these characters follow each other so for

4:15

example given a chunk of these characters like this are given some context of characters in

4:21

the past the Transformer neural network will look at the characters that I've highlighted and is going to predict that

4:27

g is likely to come next in the sequence and it's going to do that because we're going to train that Transformer on

4:33

Shakespeare and it's just going to try to produce uh character sequences that look like this

4:39

and in that process is going to model all the patterns inside this data so once we've trained the system I'd just

4:45

like to give you a preview we can generate infinite Shakespeare and of course it's a fake thing that looks kind

4:52

of like Shakespeare um apologies for there's some junk that I'm

4:59

not able to resolve in in here but um you can see how this is going character

5:05

by character and it's kind of like predicting Shakespeare like language so verily my Lord the sights have left the

5:13

again the king coming with my curses with precious pale and then tronio says

5:19

something else Etc and this is just coming out of the Transformer in a very similar manner as it would come out in

5:25

Chachi PT in our case character by character in Chachi PT it's coming out

5:31

on the token by token level and tokens are these a sort of like little sub word pieces so they're not Word level they're

5:37

kind of like work chunk level um and now the I've already written this

5:43

entire code to train these Transformers um and it is in a GitHub repository that

5:49

you can find and it's called a nano GPT so Nano GPT is a repository that you can

5:54

find on my GitHub and it's a repository for training Transformers um On Any Given text

6:01

and what I think is interesting about it because there's many ways to train Transformers but this is a very simple implementation so it's just two files of

6:08

300 lines of code each one file defines the GPT model the Transformer and one

6:13

file trains it on some given Text data set and here I'm showing that if you train it on a open webtext data set

6:19

which is a fairly large data set of web pages then I reproduce the the performance of gpt2

6:26

so gpt2 is an early version of openai's GPT from 2017 if I occur correctly and

6:33

I've only so far reproduced the the smallest 124 million parameter model but basically this is just proving that the

6:39

code base is correctly arranged and I'm able to load the neural network weights

6:44

that open AI has released later so you can take a look at the finished code here in Nano GPT but what I would

6:51

like to do in this lecture is I would like to basically write this repository from scratch so we're going to begin

6:57

with an empty file and we're going to define a Transformer piece by piece

7:03

we're going to train it on the tiny Shakespeare data set and we'll see how we can then generate infinite

7:09

Shakespeare and of course this can copy paste to any arbitrary Text data set that you like but my goal really here is

7:16

to just make you understand and appreciate how under the hood chat GPT works and really all that's required is

7:23

a Proficiency in Python and some basic understanding of calculus and statistics

7:29

and it would help if you also see my previous videos on the same YouTube channel in particular my make more

7:36

series where I Define smaller and simpler neural

7:41

network language models so multilevel perceptrons and so on it really introduces the language modeling

7:47

framework and then here in this video we're going to focus on the Transformer neural network itself

reading and exploring the data

7:52

okay so I created a new Google collab uh jupyter notebook here and this will

7:57

allow me to later easily share this code that we're going to develop together with you so you can follow along so this

8:03

will be in the video description later now here I've just done some preliminaries I downloaded the data set

8:09

the tiny Shakespeare data set at this URL and you can see that it's about a one megabyte file then here I open the input.txt file and

8:17

just read in all the text as a string and we see that we are working with 1 million characters roughly

8:23

and the first 1000 characters if we just print them out are basically what you would expect this is the first 1000

8:28

characters of the tiny Shakespeare data set roughly up to here so so far so good next we're going to

8:36

take this text and the text is a sequence of characters in Python so when I call the set Constructor on it I'm

8:43

just going to get the set of all the characters that occur in this text

8:48

and then I call list on that to create a list of those characters instead of just a set so that I have an ordering an

8:54

arbitrary ordering and then I sort that so basically we get just all the

8:59

characters that occur in the entire data set and they're sorted now the number of them is going to be our vocabulary size

9:05

these are the possible elements of our sequences and we see that when I print here the characters

9:12

there's 65 of them in total there's a space character and then all kinds of special characters

9:18

and then capitals and lowercase letters so that's our vocabulary and that's the sort of like possible characters that

9:25

the model can see or emit okay so next we would like to develop some strategy to tokenize the input text

tokenization, train/val split

9:33

now when people say tokenize they mean convert the raw text as a string to some

9:39

sequence of integers According to some notebook According to some vocabulary of possible elements

9:45

so as an example here we are going to be building a character level language model so we're simply going to be

9:50

translating individual characters into integers so let me show you a chunk of code that

9:55

sort of does that for us so we're building both the encoder and the decoder and let me just talk through

10:01

What's Happening Here when we encode an arbitrary text like hi there we're going to receive a list of

10:08

integers that represents that string so for example 46 47 Etc

10:14

and then we also have the reverse mapping so we can take this list and decode it to get back the exact same

10:21

string so it's really just like a translation two integers and back for arbitrary string and for us it is done on a

10:28

character level now the way this was achieved is we just iterate over all the characters here and

10:34

create a lookup table from the character to the integer and vice versa and then to encode some string we simply

10:40

translate all the characters individually and to decode it back we use the reverse mapping and concatenate

10:46

all of it now this is only one of many possible encodings or many possible sort of

10:51

tokenizers and it's a very simple one but there's many other schemas that people have come up with in practice so

10:57

for example Google uses a sentence piece uh so sentence piece will also encode text into integers but in a different

11:05

schema and using a different vocabulary and sentence piece is a sub word sort of

11:12

tokenizer and what that means is that you're not encoding entire words but you're not also encoding individual

11:18

characters it's it's a sub word unit level and that's usually what's adopted

11:23

in practice for example also openai has this Library called tick token that uses a pipe pair encoding tokenizer

11:31

um and that's what GPT uses and you can also just encode words into like hello world into a list of integers

11:38

so as an example I'm using the tick token Library here I'm getting the encoding for gpt2 or

11:44

that was used for gpt2 instead of just having 65 possible characters or tokens they have 50 000

11:51

tokens and so when they encode the exact same string High there we only get a list of

11:57

three integers but those integers are not between 0 and 64. they are between 0 and 5000 50 256.

12:06

so basically you can trade off the code book size and the sequence lengths so

12:12

you can have a very long sequences of integers with very small vocabularies or you can have a short

12:18

um sequences of integers with very large vocabularies and so typically people use

12:25

in practice the sub word encodings but I'd like to keep our tokenizer very simple so we're using character level

12:31

tokenizer and that means that we have very small code books we have very simple encode

12:36

and decode functions but we do get very long sequences as a result but that's

12:42

the level at which we're going to stick with this lecture because it's the simplest thing okay so now that we have an encoder and a decoder effectively a

12:49

tokenizer we can tokenize the entire training set of Shakespeare so here's a chunk of code that does that

12:55

and I'm going to start to use the pytorch library and specifically the torch.tensor from the pytorch library

13:01

so we're going to take all of the text in tiny Shakespeare encode it and then wrap it into a torch.tensor to get the

13:08

data tensor so here's what the data tensor looks like when I look at just the first 1000 characters or the 1000

13:14

elements of it so we see that we have a massive sequence of integers and this sequence of integers here is basically an

13:21

identical translation of the first 1000 characters here so I believe for example that zero is a

13:27

new line character and maybe one is a space not 100 sure but from now on the

13:33

entire data set of text is re-represented as just it just stretched out as a single very large uh sequence

13:38

of integers let me do one more thing before we move on here I'd like to separate out our

13:43

data set into a train and a validation split so in particular we're going to take the first 90 of the data set and

13:51

consider that to be the training data for the Transformer and we're going to withhold the last 10 percent at the end

13:56

of it to be the validation data and this will help us understand to what extent our model is overfitting so we're going

14:03

to basically hide and keep the validation data on the side because we don't want just a perfect memorization

14:08

of this exact Shakespeare we want a neural network that sort of creates Shakespeare like text and so it should

14:15

be fairly likely for it to produce the actual like stowed away uh true

14:22

Shakespeare text um and so we're going to use this to get a sense of the overfitting okay so now

data loader: batches of chunks of data

14:28

we would like to start plugging these text sequences or integer sequences into the Transformer so that it can train and

14:34

learn those patterns now the important thing to realize is we're never going to actually feed the

14:39

entire text into Transformer all at once that would be computationally very expensive and prohibitive so when we

14:45

actually train a Transformer on a lot of these data sets we only work with chunks of the data set and when we train the

14:51

Transformer we basically sample random little chunks out of the training set and train them just chunks at a time and

14:57

these chunks have basically some kind of a length and as a maximum length now the maximum

15:04

length typically at least in the code I usually write is called block size you can you can find it on the different

15:10

names like context length or something like that let's start with the block size of just eight and let me look at

15:16

the first train data characters the first block size plus one characters I'll explain why plus one in a second

15:23

so this is the first nine characters in the sequence in the training set

15:29

now what I'd like to point out is that when you sample a chunk of data like this so say that these nine characters

15:34

out of the training set this actually has multiple examples packed into it

15:39

and that's because all of these characters follow each other and so what this thing is going to say

15:46

when we plug it into a Transformer is we're going to actually simultaneously train it to make prediction at every one

15:52

of these positions now in the in a chunk of nine characters there's actually eight individual

15:59

examples packed in there so there's the example that one 18 when in the context of 18 47 likely comes

16:07

next in the context of 18 and 47 56 comes next in the context of 1847-56 57

16:14

can come next and so on so that's the eight individual examples let me

16:20

actually spell it out with code so here's a chunk of code to illustrate X are the inputs to the Transformer it

16:27

will just be the first block size characters y will be the next block size characters

16:33

so it's offset by one and that's because y are the targets for each position in the input

16:41

and then here I'm iterating over all the block size of 8. and the context is always all the characters in X up to T

16:49

and including t and the target is always the teeth character but in the targets array why

16:56

so let me just run this and basically it spells out what I've said in words these are the eight

17:02

examples hidden in a chunk of nine characters that we uh sampled from the

17:08

training set I want to mention one more thing we train on all the eight examples here

17:14

with context between one all the way up to context of block size and we train on that not just for

17:20

computational reasons because we happen to have the sequence already or something like that it's not just done for efficiency it's also done to make

17:28

the Transformer Network be used to seeing contexts all the way from as little as one all the way to block size

17:35

and we'd like the transform to be used to seeing everything in between and that's going to be useful later during

17:41

inference because while we're sampling we can start the sampling generation with as little as one character of

17:46

context and the Transformer knows how to predict the next character with all the way up to just one context of one and so

17:53

then it can predict everything up to block size and after block size we have to start truncating because the

17:58

Transformer will never receive more than block size inputs when it's predicting the next character

18:04

Okay so we've looked at the time dimension of the tensors that are going to be feeding into the Transformer there's one more Dimension to care about

18:10

and that is the batch dimension and so as we're sampling these chunks of text we're going to be actually every time

18:17

we're going to feed them into a Transformer we're going to have many batches of multiple chunks of text that are all like stacked up in a single

18:23

tensor and that's just done for efficiency just so that we can keep the gpus busy because they are very good at

18:29

parallel processing of um of data and so we just want to process multiple chunks all at the same

18:36

time but those chunks are processed completely independently they don't talk to each other and so on so let me

18:42

basically just generalize this and introduce a batch Dimension here's a chunk of code let me just run it and then I'm going to

18:48

explain what it does so here because we're going to start sampling random locations in the data

18:55

set to pull chunks from I am setting the seed so that um in the random number generator so

19:01

that the numbers I see here are going to be the same numbers you see later if you try to reproduce this now the back size here is how many

19:07

independent sequences we are processing every forward backward pass of the Transformer

19:13

the block size as I explained is the maximum context length to make those predictions so let's say by size 4 block size 8 and

19:21

then here's how we get batch for any arbitrary split if the split is a training split then we're going to

19:26

look at train data otherwise and validata that gets us the data array and then

19:33

when I Generate random positions to grab a chunk out of I actually grab I actually generate

19:39

batch size number of random offsets so because this is four we are IX is

19:46

going to be a four numbers that are randomly generated between 0 and Len of data minus block size so it's just

19:52

random offsets into the training set and then X's as I explained are the

19:58

first block size characters starting at I the Y's are the offset by one of that so

20:06

just add plus one and then we're going to get those chunks for every one of integers I in IX and

20:13

use a torch.stack to take all those one-dimensional tensors as we saw here

20:20

and we're going to um stack them up at rows and so they all become a row in a four

20:27

by eight tensor so here's where I'm printing then when I sample a batch XP and YB

20:34

the input the Transformer now are the input X is the four by eight tensor

20:41

four uh rows of eight columns and each one of these is a chunk of the

20:47

training set and then the targets here are in the associated array Y and they will come in

20:54

through the Transformer all the way at the end to create the loss function so

20:59

they will give us the correct answer for every single position inside X and then these are the four independent

21:07

rows so spelled out as we did before this four by eight array contains a

21:15

total of 32 examples and they're completely independent as far as the Transformer is concerned

21:21

uh so when the input is 24 the target is 43 or rather

21:27

43 here in the Y array when the input is 2443 the target is 58.

21:32

when the input is 24 43.58 the target is 5 Etc or like when it is a 5258 one the

21:39

target is 58. right so you can sort of see this spelled out these are the 32 independent

21:46

examples packed in to a single batch of the input X and then the desired targets

21:51

are in y and so now this integer tensor of X is

21:59

going to feed into the Transformer and that Transformer is going to simultaneously process all these

22:04

examples and then look up the correct um integers to predict in every one of these positions in the tensor y okay so

simplest baseline: bigram language model, loss, generation

22:12

now that we have our batch of input that we'd like to feed into a Transformer let's start basically feeding this into

22:17

neural networks now we're going to start off with the simplest possible neural network which in the case of language

22:22

modeling in my opinion is the bigram language model and we've covered the background language model in my make more series in a lot of depth and so

22:30

here I'm going to sort of go faster and let's just implement the pytorch module directly that implements the bigram

22:36

language model so I'm importing the pytorch and then module

22:41

uh for reproducibility and then here I'm constructing a diagram language model which is a subclass of NN

22:47

module and then I'm calling it and I'm passing in the inputs and the targets

22:53

and I'm just printing now when the inputs and targets come here you see that I'm just taking the index the

22:59

inputs X here which I rename to idx and I'm just passing them into this token embedding table

23:06

so what's going on here is that here in the Constructor we are creating a token embedding table

23:11

and it is of size vocab size by vocab size and we're using nn.embedding which is a

23:18

very thin wrapper around basically a tensor of shape both capsized by vocab size

23:24

and what's happening here is that when we pass idx here every single integer in our input is going to refer to this

23:30

embedding table and is going to pluck out a row of that embedding table corresponding to its index so 24 here

23:38

we'll go to the embedding table and we'll pluck out the 24th row and then 43 will go here and block out the 43rd row

23:45

Etc and then Pi torch is going to arrange all of this into a batch by Time by Channel tensor in this case batch is

23:53

4 time is 8 and C which is the channels is vocab size or 65. and so we're just

24:01

going to pluck out all those rows arrange them in a b by T by C and now we're going to interpret this as the

24:07

logits which are basically the scores for the next character in the sequence

24:12

and so what's happening here is we are predicting what comes next based on just the individual identity of a single

24:19

token and you can do that because um I mean currently the tokens are not talking to each other and they're not

24:25

seeing any context except for they're just seeing themselves so I'm a I'm a token number five and then I can

24:32

actually make pretty decent predictions about what comes next just by knowing that I'm token five because some characters know cert follow other

24:40

characters in in typical scenarios so we saw a lot of this in a lot more depth in

24:45

the make more series and here if I just run this then we currently get the predictions the scores the logits for

24:53

every one of the four by eight positions now that we've made predictions about what comes next we'd like to evaluate

24:58

the loss function and so in make more series we saw that a good way to measure a loss or like a quality of the

25:04

predictions is to use the negative log likelihood loss which is also implemented in pytorch under the name

25:10

cross entropy so what we'd like to do here is loss is the cross entropy on the

25:17

predictions and the targets and so this measures the quality of the logits with respect to the Targets in other words we

25:24

have the identity of the next character so how well are we predicting the next character based on Illusions and

25:30

intuitively the correct um the correct dimension of logits uh

25:36

depending on whatever the target is should have a very high number and all the other dimensions should be very low number right

25:42

now the issue is that this won't actually this is what we want we want to basically output the logits and the loss

25:51

this is what we want but unfortunately uh this won't actually run we get an error message but intuitively

25:58

we want to measure this now when we go to the pi torch cross entropy

26:04

a documentation here um we're trying to call the cross entropy in its functional form so that means we

26:11

don't have to create like a module for it but here when we go to the documentation

26:16

you have to look into the details of how pytorch expects these inputs and basically the issue here is by torch

26:22

expects if you have multi-dimensional input which we do because we have a b by T by C tensor then it actually really

26:29

wants the channels to be the second dimension here

26:34

so if you um so basically it wants a b by C by T instead of a b by T by C

26:42

and so it's just the details of how pytorch treats um these kinds of inputs and so we don't

26:49

actually want to deal with that so what we're going to do instead is we need to basically reshape our logits so here's

26:54

what I like to do I like to take basically give names to the dimensions so launches.shape is B by T by C and

27:01

unpack those numbers and then let's say that logits equals logits.view

27:07

and we want it to be a b times c b times T by C so just a two-dimensional array

27:13

right so we're going to take all the we're going to take all of these um

27:18

positions here and we're going to uh stretch them out in a one-dimensional sequence and preserve the channel Dimension as

27:25

the second dimension so we're just kind of like stretching out the array so it's two-dimensional and in that case it's going to better

27:32

conform to what pi torch sort of expects in its dimensions now we have to do the same to targets

27:38

because currently targets are of shape B by T and we want it to be

27:45

just B times T so one dimensional now alternatively you could always still just do -1 because Pi torch will guess

27:53

what this should be if you want to lay it out but let me just be explicit on say Q times t once we've reshaped this it will match

28:00

the cross entropy case and then we should be able to evaluate our loss

28:06

okay so with that right now and we can do loss and So currently we see that the

28:12

loss is 4.87 now because our we have 65 possible

28:17

vocabulary elements we can actually guess at what the loss should be and in particular

28:22

we covered negative log likelihood in a lot of detail we are expecting log or

28:28

long of um 1 over 65 and negative of that

28:33

so we're expecting the loss to be about 4.1217 but we're getting 4.87 and so

28:39

that's telling us that the initial predictions are not super diffuse they've got a little bit of entropy and

28:44

so we're guessing wrong uh so uh yes but actually we're I able

28:50

we are able to evaluate the loss okay so now that we can evaluate the quality of the model on some data we'd likely also

28:57

be able to generate from the model so let's do the generation now I'm going to go again a little bit faster here

29:03

because I covered all this already in previous videos so

29:08

here's a generate function for the model so we take some uh we take the the same

29:15

kind of input idx here and basically this is the current context of some

29:22

characters in a batch in some batch so it's also B by T and the job of

29:28

generate is to basically take this B by T and extend it to be B by T plus one plus two plus three and so it's just

29:34

basically it contains the generation in all the batch dimensions in the time dimension So that's its job and we'll do that for

29:41

Max new tokens so you can see here on the bottom there's going to be some stuff here but on the bottom whatever is predicted is

29:48

concatenated on top of the previous idx along the First Dimension which is the time Dimension to create a b by T plus

29:55

one so that becomes the new idx so the job of generators to take a b by T and make

30:00

it a b by T plus one plus two plus three as many as we want maximum tokens so

30:05

this is the generation from the model now inside the generation what we're what are we doing we're taking the

30:11

current indices we're getting the predictions so we get those are in the

30:16

logits and then the loss here is going to be ignored because um we're not we're not using that and we

30:22

have no targets that are sort of ground truth targets that we're going to be comparing with

30:28

then once we get the logits we are only focusing on the last step so instead of

30:33

a b by T by C we're going to pluck out the negative one the last element in the

30:38

time dimension because those are the predictions for what comes next so that this is the logits which we then

30:44

convert to probabilities via softmax and then we use torch that multinomial to sample from those probabilities and we

30:51

ask by torch to give us one sample and so idx next will become a b by one

30:56

because in each one of the batch Dimensions we're going to have a single prediction for what comes next so this

31:03

num samples equals one will make this be a one and then we're going to take those

31:08

integers that come from the sampling process according to the probability distribution given here and those integers got just concatenated

31:15

on top of the current sort of like running stream of integers and this gives us a p by T plus one

31:21

and then we can return that now one thing here is you see how I'm calling self of idx which will end up going to

31:29

the forward function I'm not providing any Targets So currently this would give an error because targets is uh is uh

31:36

sort of like not given so target has to be optional so targets is none by default and then if targets is none then

31:44

there's no loss to create so it's just loss is none but else all of this

31:50

happens and we can create a loss so this will make it so um if we have the targets we provide them

31:57

and get a loss if we have no targets we'll just get the logits so this here will generate from the

32:03

model um and let's take that for a ride now

32:09

oops so I have another code chunk here which will generate for the model from the model and okay this is kind of crazy so

32:16

maybe let me let me break this down so these are the idx right

32:24

I'm creating a batch will be just one time will be just one so I'm creating a little one by one

32:31

tensor and it's holding a zero and the D type the data type is integer

32:37

so 0 is going to be how we kick off the generation and remember that zero is uh

32:42

is the element standing for a new line character so it's kind of like a reasonable thing to to feed in as the

32:48

very first character in a sequence to be the new line um so it's going to be idx which we're

32:55

going to feed in here then we're going to ask for 100 tokens and then enter generate will continue

33:00

that now because uh generate works on the level of batches we then have to index

33:07

into the zero throw to basically unplug the um the single bash Dimension that exists

33:14

and then that gives us a um time steps it's just a one-dimensional

33:20

array of all the indices which we will convert to simple python list from pytorch tensor so that that can

33:28

feed into our decode function and convert those integers into text

33:33

so let me bring this back and we're generating 100 tokens let's run and uh here's the generation that we

33:41

achieved so obviously it's garbage and the reason it's garbage is because this is a totally random model so next up

33:47

we're going to want to train this model now one more thing I wanted to point out here is this function is written to be General

33:53

but it's kind of like ridiculous right now because we're feeding in all this we're building

33:59

out this context and we're concatenating it all and we're always feeding it all

34:04

into the model but that's kind of ridiculous because this is just a simple background model

34:09

so to make for example this prediction about K we only needed this W but actually what we fed into the model is

34:16

we fed the entire sequence and then we only looked at the very last piece and predicted k

34:22

so the only reason I'm writing it in this way is because right now this is a bygram model but I'd like to keep this

34:28

function fixed and I'd like it to work later when our character is actually

34:34

basically look further in the history and so right now the history is not used so this looks silly but eventually the

34:41

history will be used and so that's why we want to do it this way so just a quick comment on that so now we see that

34:48

this is um random so let's train the model so it becomes a bit less random okay let's Now train the model so first

training the bigram model

34:55

what I'm going to do is I'm going to create a pytorch optimization object so here we are using the optimizer

35:02

atom W now in the make more series we've only ever used stochastic gradient descent

35:07

the simplest possible Optimizer which you can get using the SGD instead but I want to use Adam which is a much more

35:13

advanced and popular Optimizer and it works extremely well for a typical good

35:18

setting for the learning rate is roughly 3E negative four but for very very small networks luck is the case here you can

35:25

get away with much much higher learning rates running negative three or even higher probably but let me create the optimizer object

35:32

which will basically take the gradients and update the parameters using the gradients

35:37

and then here our batch size up above was only four so let me actually use something bigger

35:42

let's say 32 and then for some number of steps um we are sampling a new batch of data

35:48

we're evaluating the loss we're zeroing out all the gradients from the previous step getting the gradients for all the

35:55

parameters and then using those gradients to update our parameters so typical training loop as we saw in the

36:01

make more series so let me now uh run this for say 100 iterations and let's see

36:07

what kind of losses we're gonna get so we started around 4.7

36:13

and now we're going to down to like 4.6 4.5 Etc so the optimization is definitely

36:18

happening but um let's uh sort of try to increase the number of iterations and only print at

36:25

the end because we probably will not train for longer okay so we're down to 3.6 roughly

36:35

roughly down to three

36:41

this is the most janky optimization

36:47

okay it's working let's just do ten thousand and then from here we want to copy this

36:55

and hopefully we're going to get something reasonable and of course it's not going to be Shakespeare from a background model but at least we see

37:01

that the loss is improving and hopefully we're expecting something a bit more reasonable

37:07

okay so we're down there about 2.5 ish let's see what we get okay dramatic improvements certainly on what

37:14

we had here so let me just increase the number of tokens okay so we see that we're starting to

37:20

get something at least like reasonable ish um

37:26

certainly not Shakespeare but the model is making progress so that is the simplest possible model

37:34

so now what I'd like to do is obviously that this is a very simple

37:39

model because the tokens are not talking to each other so given the previous context of whatever was generated we're

37:45

only looking at the very last character to make the predictions about what comes next so now these uh now these tokens

37:50

have to start talking to each other and figuring out what is in the context so that they can make better predictions

37:56

for what comes next and this is how we're going to kick off the Transformer okay so next I took the code that we

port our code to a script

38:02

developed in this Jupiter notebook and I converted it to be a script and I'm doing this because I just want to

38:08

simplify our intermediate work into just the final product that we have at this point so in the top here I put all the hyper

38:15

parameters that we've defined I introduced a few and I'm going to speak to that in a little bit otherwise a lot

38:20

of this should be recognizable reproducibility read data get the encoder in the decoder

38:26

create the training test splits I use the uh kind of like data loader that

38:32

gets a batch of the inputs and targets this is new and I'll talk about it in a

38:37

second now this is the background language model that we developed and it can forward and give us a logits and loss

38:44

and it can generate and then here we are creating the optimizer and this is the training Loop

38:51

so everything here should look pretty familiar now some of the small things that I added number one I added the

38:58

ability to run on a GPU if you have it so if you have a GPU then you can this will use Cuda instead of just CPU and

39:05

everything will be a lot more faster now when device becomes screwed up then we need to make sure that when we load the

39:11

data we move it to device when we create the model we want to move

39:16

the model parameters to device so as an example here we have the NN embedding table and it's got a double

39:23

weight inside it which stores the sort of lookup table so that would be moved to the GPU so that all the calculations

39:30

here happen on the GPU and they can be a lot faster and then finally here when I'm creating

39:35

the context that feeds into generate I have to make sure that I create on the device number two what I introduced is

39:43

the fact that here in the training Loop here I was just printing the Lost dot

39:50

item inside the training Loop but this is a very noisy measurement of the current loss because every batch will be more or

39:57

less lucky and so what I want to do usually is I have an estimate loss function and the

40:04

estimated loss basically then goes up here and it averages up the loss over

40:11

multiple batches so in particular we're going to iterate invalider times and we're going to

40:18

basically get our loss and then we're going to get the average loss for both splits and so this will be a lot less

40:23

noisy so here what we call the estimate loss we're going to report the pretty

40:28

accurate train and validation loss now when we come back up you'll notice a

40:34

few things here I'm setting the model to evaluation phase and down here I'm resetting it back to training phase

40:40

now right now for our model as is this this doesn't actually do anything because the only thing inside this model

40:46

is this nn.embedding and um this this network would behave both

40:52

would be have the same in both evaluation mode and training mode we have no Dropout layers we have no

40:58

bathroom layers Etc but it is a good practice to Think Through what mode your neural network is in because some layers

41:05

will have different Behavior at inference time or training time and

41:10

there's also this context manager torch.nograd and this is just telling pytorch that everything that happens

41:16

inside this function we will not call that backward on and so Patrick can be a

41:21

lot more efficient with its memory use because it doesn't have to store all the intermediate variables because we're

41:27

never going to call backward and so it can it can be a lot more memory efficient in that way so also a good

41:32

practice to tell Pi torch when we don't intend to do back propagation so right now the script is about 120

41:41

lines of code of and that's kind of our starter code I'm calling it background.pi and I'm

41:47

going to release it later now running this script gives us output in the terminal and it looks something like

41:53

this it basically as I ran this code it was giving me the train loss and Val loss

41:59

and we see that we convert to somewhere around 2.5 with the migrant model and then here's

42:05

the sample that we produced at the end and so we have everything packaged up in

42:10

the script and we're in a good position now to iterate on this okay so we are almost ready to start writing our very

version 1: averaging past context with for loops, the weakest form of aggregation

42:16

first self-attention block for processing these tokens now before we actually get there I want

42:24

to get you used to a mathematical trick that is used in the self attention inside a Transformer and is really just

42:29

like at the heart of an efficient implementation of self-attention and so

42:34

I want to work with this toy example you just get used to this operation and then it's going to make it much more clear

42:40

once we actually get to um to it in the script again

42:45

so let's create a b by T by C where B T and C are just 4 8 and 2 in the story example and these are basically channels

42:52

and we have batches and we have the time component and we have some information at each point in the sequence so C

43:01

now what we would like to do is we would like these um tokens so we have up to eight tokens here in a batch and these

43:09

eight tokens are currently not talking to each other and we would like them to talk to each other we'd like to couple them

43:14

and in particular we don't we we want to couple them in a very specific way so

43:20

the token for example at the fifth location it should not communicate with tokens in the sixth seventh and eighth

43:26

location because those are future tokens in the sequence the token on the fifth location should

43:32

only talk to the one in the fourth third second and first so it's only so information only flows

43:38

from previous context to the current timestamp and we cannot get any information from the future because we

43:43

are about to try to predict the future so what is the easiest way for tokens to

43:49

communicate okay the easiest way I would say is okay if we are up to if we're a

43:55

fifth token and I'd like to communicate with my past the simplest way we can do that is to just do a weight is to just

44:01

do an average of all the um of all the preceding elements so for example if I'm

44:07

the fifth token I would like to take the channels that make up that are

44:12

information at my step but then also the channels from the four step third step second step in the first step I'd like

44:18

to average those up and then that would become sort of like a feature Vector that summarizes me in the context of my

44:24

history now of course just doing a sum or like an average is an extremely weak form of

44:30

interaction like this communication is extremely lossy we've lost a ton of information about the spatial Arrangements of all those tokens but

44:37

that's okay for now we'll see how we can bring that information back later for now what we would like to do is

44:43

for every single batch element independently for every teeth token in that sequence

44:48

we'd like to now calculate the average of all the vectors in all the previous

44:54

tokens and also at this token so let's write that out

44:59

um I have a small snippet here and instead of just fumbling around let me just copy paste it and talk to it

45:06

so in other words we're going to create X and bow is short for backup words

45:12

because backup words is um is kind of like um a term that people use when you are just

45:18

averaging up things so it's just a bag of words basically there's a word stored on every one of these eight locations

45:24

and we're doing a bag of words such as averaging so in the beginning we're going to say that it's just initialized at Zero and

45:30

then I'm doing a for Loop here so we're not being efficient yet that's coming but for now we're just iterating over

45:36

all the batch Dimensions independently iterating over time and then the previous tokens are at this

45:44

batch Dimension and then everything up to and including the teeth token okay

45:50

so when we slice out X in this way xrev Becomes of shape

45:56

um how many T elements there were in the past and then of course C so all the two

46:02

dimensional information from these log tokens so that's the previous sort of chunk of

46:08

um tokens from my current sequence and then I'm just doing the average or

46:13

the mean over the zeroth dimension so I'm averaging out the time here and I'm just going to get a little C

46:20

one-dimensional Vector which I'm going to store in X background words so I can run this and uh this is not

46:28

going to be very informative because let's see so this is x sub 0. so this is

46:33

the zeroth batch element and then expo at zero now

46:38

you see how the at the first location here you see that the two are equal and

46:44

that's because it's we're just doing an average of this one token but here this one is now an average of

46:50

these two and now this one is an average of these three

46:55

and so on so uh and this last one is the average of all of these elements so vertical

47:03

average just averaging up all the tokens now gives this outcome here

47:08

so this is all well and good but this is very inefficient now the trick is that we can be very very efficient about

the trick in self-attention: matrix multiply as weighted aggregation

47:14

doing this using matrix multiplication so that's the mathematical trick and let me show you what I mean let's work with

47:21

the toy example here let me run it and I'll explain I have a simple Matrix here that is a

47:28

three by three of all ones a matrix B of just random numbers and it's a three by two

47:33

and a matrix C which will be three by three multiply three by two which will give out a three by two

47:39

so here we're just using um matrix multiplication so a multiply B gives us C

47:46

okay so how are these numbers in C achieved right so this number in the top

47:54

left is the first row of a DOT product with the First Column of B

48:00

and since all the the row of a right now is all just once then the dot product here with with this

48:06

column of B is just going to do a sum of these of this column so 2 plus 6 plus 6

48:11

is 14. the element here and the output of C is also the first column here the first row

48:17

of a multiplied now with the second column of B so 7 plus 4 plus plus 5 is

48:23

16. now you see that there's repeating elements here so this 14 again is because this row is again all once and

48:29

it's multiplying the First Column of B so we get 14. and this one is and so on

48:35

so this last number here is the last row dot product last column

48:40

now the trick here is uh the following this is just a boring number of

48:46

um it's just a boring array of all ones but torch has this function called trell

48:51

which is short for a triangular uh something like that and you can wrap

48:57

it in torched at once and it will just return the lower triangular portion of this

49:02

okay so now it will basically zero out uh these guys here so we just get the lower

49:08

triangular part well what happens if we do that

49:15

so now we'll have a like this and B like this and now what are we getting here in C

49:20

well what is this number well this is the first row times the First Column and

49:25

because this is zeros uh these elements here are now ignored so we just get a two

49:32

and then this number here is the first row times the second column and because these are zeros they get ignored and

49:38

it's just seven the seven multiplies this one but look what happened here because this

49:44

is one and then zeros we what ended up happening is we're just plucking out the row of this row of B and that's what we

49:50

got now here we have 1 1 0. so here one one

49:57

zero dot product with these two columns will now give us two plus six which is eight and seven plus four which is 11.

50:03

and because this is one one one we ended up with the addition of all of them

50:08

and so basically depending on how many ones and zeros we have here we are basically doing a sum currently of a

50:16

variable number of these rows and that gets deposited into C So currently we're doing sums because

50:23

these are ones but we can also do average right and you can start to see how we could do average of the rows of B

50:30

uh sort of in an incremental fashion because we don't have to we can basically normalize these rows so that

50:37

they sum to one and then we're going to get an average so if we took a and then we did a equals

50:43

a divide a torch.sum in the um

50:48

of a in the warmth Dimension and then let's keep them as

50:55

true so therefore the broadcasting will work out so if I rerun this you see now that

51:01

these rows now sum to one so this row is one this row is 0.5.50 and here we get

51:07

one thirds and now when we do a multiply B what are we getting here we are just getting the first row

51:14

first row here now we are getting the average of the first two rows

51:21

okay so 2 and 6 average is four and four and seven average is 5.5 and on the bottom here we are now

51:28

getting the average of these three rows so the average of all of elements of B

51:33

are now deposited here and so you can see that by manipulating these uh elements of this multiplying

51:41

Matrix and then multiplying it with any given Matrix we can do these averages in

51:47

this incremental fashion because we just get um and we can manipulate that based on the

51:53

elements of a okay so that's very convenient so let's swing back up here and see how we can vectorize this and

version 2: using matrix multiply

51:59

make it much more efficient using what we've learned so in particular we are going to produce an array a but

52:07

here I'm going to call it way short for weights but this is r a

52:12

and this is how much of every row we want to average up and it's going to be an average because you can see it in

52:18

these rows sum to 1. so this is our a and then our B in this

52:23

example of course is X so it's going to happen here now is that

52:29

we are going to have an expo 2. and this Expo 2 is going to be way

52:35

multiplying RX so let's think this through way is T by

52:42

T and this is Matrix multiplying in pi torch a b by T by C

52:48

and it's giving us uh the what shape so pytorch will come here and then we'll see that these

52:53

shapes are not the same so it will create a batch Dimension here and this is a batched matrix multiply

53:00

and so it will apply this matrix multiplication in all the batch elements in parallel

53:06

and individually and then for each batch element there will be a t by T multiplying T by C exactly as we had

53:13

below so this will now create B by T by C

53:21

and X both 2 will now become identical to Expo

53:26

so we can see that torch.all close

53:31

of Expo and Expo 2 should be true now

53:37

so this kind of like misses us that uh these are in fact the same

53:42

so Expo and Expo 2 if I just print them uh okay we're not going to be able to

53:49

okay we're not going to be able to just stare it down but um

53:55

well let me try Expo basically just at the zeroth element and Expo two at the zeroth element so just the first batch and we should see that this and that

54:02

should be identical which they are right so what happened here the trick is

54:07

we were able to use batched Matrix multiply to do this uh aggregation really and

54:14

it's awaited aggregation and the weights are specified in this T by T array

54:21

and we're basically doing weighted sums and uh these weighted sums are according

54:27

to the weights inside here they take on sort of this triangular form

54:32

and so that means that a token at the teeth Dimension will only get uh sort of um information from the um tokens

54:39

preceding it so that's exactly what we want and finally I would like to rewrite it in one more way

version 3: adding softmax

54:45

and we're going to see why that's useful so this is the third version and it's also identical to the first and second

54:51

but let me talk through it it uses softmax so Trill here is this Matrix lower

55:00

triangular ones way begins as all zero

55:06

okay so if I just print way in the beginning it's all zero then I used

55:12

masked fill so what this is doing is wait that masked fill it's all zeros and

55:18

I'm saying for all the elements where Trill is equals equals zero make them be

55:24

negative Infinity so all the elements where Trill is zero will become negative Infinity now

55:30

so this is what we get and then the final one here is softmax

55:37

so if I take a soft Max along every single so dim is negative one so along every single row

55:42

if I do a soft Max what is that going to do well softmax is um

55:50

it's also like a normalization operation right and so spoiler alert you get the exact

55:56

same Matrix let me bring back the softmax and recall that in softmax we're going

56:03

to exponentiate every single one of these and then we're going to divide by the sum and so for if we exponentiate every

56:10

single element here we're going to get a one and here we're going to get uh basically zero zero zero zero zero

56:15

everywhere else and then when we normalize we just get one here we're going to get 1 1 and then

56:22

zeros and then softmax will again divide and this will give us 0.5.5 and so on

56:28

and so this is also the uh the same way to produce this mask

56:33

now the reason that this is a bit more interesting and the reason we're going to end up using it and solve a tension

56:38

is that these weights here begin uh with zero

56:43

and you can think of this as like an interaction strength or like an affinity so basically it's telling us how much of

56:51

each token from the past do we want to Aggregate and average up

56:57

and then this line is saying tokens from the past cannot communicate by setting

57:02

them to negative Infinity we're saying that we will not aggregate anything from those tokens

57:08

and so basically this then goes through softmax and through the weighted and this is the aggregation through matrix multiplication

57:14

and so what this is now is you can think of these as um these zeros are currently just set by

57:21

us to be zero but a quick preview is that these affinities between the tokens

57:26

are not going to be just constant at zero they're going to be data dependent these tokens are going to start looking

57:32

at each other and some tokens will find other tokens more or less interesting and depending on what their values are

57:39

they're going to find each other interesting to different amounts and I'm going to call those affinities I think

57:45

and then here we are saying the future cannot communicate with the past we're going to clamp them

57:51

and then when we normalize and sum we're going to aggregate sort of their values depending on how interesting they find

57:57

each other and so that's the preview for self-attention and basically long story

58:03

short from this entire section is that you can do weighted aggregations of your past elements

58:09

by having by using matrix multiplication of a lower triangular fashion


PART 2

58:15

and then the elements here in the lower triangular part are telling you how much of each element fuses into this position

58:23

so we're going to use this trick now to develop the self-attention block so first let's get some quick preliminaries

minor code cleanup

58:28

out of the way first the thing I'm kind of bothered by is that you see how we're passing in vocab size into the Constructor there's

58:35

no need to do that because vocab size has already defined up top as a global variable so there's no need to pass this

58:40

stuff around next one I want to do is I don't want to actually create I want to create like a

58:46

level of interaction here where we don't directly go to the embedding for the um logits but instead we go through this

58:53

intermediate phase because we're going to start making that bigger so let me introduce a new variable and embed a

59:01

short for a number of embedding dimensions so an embed here

59:06

will be say 32. that was a suggestion from GitHub by the way it also showed us

59:12

to 32 which is a good number so this is an embedding table and only 32 dimensional embeddings

59:19

so then here this is not going to give us logits directly instead this is going to give us token embeddings

59:25

that's what I'm going to call it and then to go from the token embeddings to the logits we're going to need a linear layer so self.lm head let's call it

59:34

short for language modeling head is n linear from an embed up to vocab size

59:39

and then when we swing over here we're actually going to get the logits by exactly what the copilot says

59:45

now we have to be careful here because this C and this C are not equal this is an embedded C and this is vocab

59:53

size so let's just say that an embed is equal to C

59:58

and then this just creates one spurious layer of interaction through a linear layer but this should basically run

1:00:12

so we see that this runs and uh this currently looks kind of spurious but we're going to build on top of this now

positional encoding

1:00:19

next up so far we've taken these in in the seas and we've encoded them based on the identity of the tokens inside idx

1:00:28

the next thing that people very often do is that we're not just encoding the identity of these tokens but also their

1:00:33

position so we're going to have a second position uh embedding table here so solve that

1:00:39

position embedding table is an embedding of block size by an embed and so each position from 0 to

1:00:45

block size minus 1 will also get its own embedding vector and then here first let me decode a b by

1:00:52

T from idx.shape and then here we're also going to have a positive bedding which is the positional

1:00:58

embedding and these are this is tour Dutch arrange so this will be basically just integers from 0 to T minus 1.

1:01:05

and all of those integers from 0 to T minus 1 get embedded through the table to create a t by C

1:01:12

and then here this gets renamed to just say x and x will be

1:01:17

the addition of the token embeddings with the positional embeddings and here the broadcasting note will work

1:01:23

out so B by T by C plus T by C this gets right aligned a new dimension of one

1:01:28

gets added and it gets broadcasted across batch so at this point x holds not just the

1:01:35

token identities but the positions at which these tokens occur and this is currently not that useful

1:01:41

because of course we just have a simple migrain model so it doesn't matter if you're in the fifth position the second position or wherever it's all

1:01:47

translation invariant at this stage so this information currently wouldn't help but as we work on the self potential

1:01:53

block we'll see that this starts to matter

1:01:59

okay so now we get the Crux of self-attention so this is probably the most important part of this video to

THE CRUX OF THE VIDEO: version 4: self-attention

1:02:04

understand we're going to implement a small self-attention for a single individual head as they're called

1:02:11

so we start off with where we were so all of this code is familiar so right now I'm working with an example

1:02:17

where I change the number of channels from 2 to 32 so we have a 4x8 arrangement of tokens and each and the

1:02:25

information at each token is currently 32 dimensional but we just are working with random numbers

1:02:31

now we saw here that the code as we had it before does a

1:02:36

simple weight a simple average of all the past tokens and the current token so

1:02:42

it's just the previous information and current information is just being mixed together in an average and that's what this code currently

1:02:48

achieves and it does so by creating this lower triangular structure which allows us to mask out this weight Matrix that

1:02:56

we create so we mask it out and then we normalize it and currently when we initialize the

1:03:04

affinities between all the different sort of tokens or nodes I'm going to use those terms interchangeably

1:03:10

so when we initialize the affinities between all the different tokens to be zero then we see that way gives us this

1:03:17

structure where every single row has these um uniform numbers and so that's what

1:03:23

that's what then uh in this Matrix multiply makes it so that we're doing a simple average

1:03:29

now we don't actually want this to be All Uniform because different uh tokens

1:03:36

will find different other tokens more or less interesting and we want that to be data dependent so for example if I'm a

1:03:42

vowel then maybe I'm looking for consonants in my past and maybe I want to know what those consonants are and I

1:03:48

want that information to Flow To Me and so I want to now gather information from the past but I want to do it in a

1:03:55

data dependent way and this is the problem that self-attention solves now the way self-attention solves this

1:04:00

is the following every single node or every single token at each position will

1:04:06

emit two vectors it will emit a query and it will emit a

1:04:11

key now the query Vector roughly speaking is what am I looking for

1:04:18

and the key Vector roughly speaking is what do I contain and then the way we get affinities

1:04:24

between these tokens now in a sequence is we basically just do a DOT product

1:04:30

between the keys and the queries so my query dot products with all the

1:04:35

keys of all the other tokens and that dot product now becomes way

1:04:42

and so um if the key and the query are sort of aligned they will interact to a

1:04:48

very high amount and then I will get to learn more about that specific token as

1:04:53

opposed to any other token in the sequence so let's implement this tab

1:05:01

we're going to implement a single what's called head of self-attention

1:05:07

so this is just one head there's a hyper parameter involved with these heads which is the head size

1:05:13

and then here I'm initializing the linear modules and I'm using bias equals false so these are just going to apply a

1:05:19

matrix multiply with some fixed weights and now let me produce a

1:05:24

key and Q K and Q by forwarding these modules on x

1:05:30

so the size of this will not become B by T by 16 because that is the head

1:05:37

size and the same here B by T by 16.

1:05:45

so this being that size so you see here that when I forward this linear on top of my X all the tokens in

1:05:52

all the positions in the B by T Arrangement all of them in parallel and independently produce a key and a query

1:05:59

so no communication has happened yet but the communication comes now all the

1:06:04

queries will dot product with all the keys so basically what we want is we want way

1:06:10

now or the affinities between these to be query multiplying key

1:06:16

but we have to be careful with uh we can't Matrix multiply this we actually need to transpose uh K but we have to be

1:06:22

also careful because these are when you have the batch Dimension so in particular we want to transpose uh the

1:06:29

last two Dimensions Dimension negative one and dimension negative two so negative 2 negative 1.

1:06:37

and so this Matrix multiplied now will basically do the following B by T by 16

1:06:45

Matrix multiplies B by 16 by T to give us

1:06:50

B by T by T right

1:06:55

so for every row of B we're not going to have a t-square matrix giving us the

1:07:01

affinities and these are now the way so they're not zeros they are now coming

1:07:06

from this dot product between the keys and the queries so this can now run I can I can run this

1:07:13

and the weighted aggregation now is a function in a data dependent manner between the keys and queries of these

1:07:19

nodes so just inspecting what happened here the way takes on this form

1:07:26

and you see that before way was just a constant so it was applied in the same way to all the batch elements but now

1:07:33

every single batch elements will have different sort of way because uh every single batch element contains different

1:07:39

tokens at different positions and so this is not data dependent so when we look at just the zeroth row

1:07:47

for example in the input these are the weights that came out and so you can see now that they're not just exactly

1:07:52

uniform and in particular as an example here for the last row this was the eighth token

1:07:59

and the eighth token knows what content it has and it knows at what position it's in and now the eighth token based on that

1:08:06

creates a query hey I'm looking for this kind of stuff I'm a vowel I'm on the

1:08:12

eighth position I'm looking for any consonants at positions up to four and then all the nodes get to emit keys

1:08:19

and maybe one of the channels could be I am a I am a consonant and I am in a position up to four

1:08:25

and that key would have a high number in that specific Channel and that's how the query and the key when they dot product

1:08:31

they can find each other and create a high affinity and when they have a high Affinity like say this token was pretty interesting to

1:08:39

uh to this eighth token when they have a high Affinity then through the soft Max I will end up

1:08:45

aggregating a lot of its information into my position and so I'll get to learn a lot about it

1:08:52

now just this we're looking at way after this has already happened

1:08:58

um let me erase this operation as well so let me erase the masking and the softmax just to show you the under the hood

1:09:04

internals and how that works so without the masking in the softmax way comes out like this right this is

1:09:11

the outputs of the dot products and these are the raw outputs and they take on values from negative you know

1:09:16

two to positive two Etc so that's the raw interactions and raw

1:09:22

affinities between all the nodes but now if I'm a if I'm a fifth node I will not want to aggregate anything from

1:09:28

the six node seventh node and the eighth node so actually we use the upper triangular masking so those are not

1:09:35

allowed to communicate and now we actually want to have a nice uh distribution so we don't want to

1:09:43

aggregate negative 0.11 of this node that's crazy so instead we exponentiate and normalize and now we get a nice

1:09:49

distribution that seems to one and this is telling us now in the data dependent manner how much of information to aggregate from any of these tokens in

1:09:57

the past so that's way and it's not zeros anymore

1:10:02

but but it's calculated in this way now there's one more uh part to a single

1:10:08

self-attention head and that is that when you do the aggregation we don't actually aggregate the tokens exactly we

1:10:15

aggregate we produce one more value here and we call that the value

1:10:21

so in the same way that we produced p and query we're also going to create a value and then

1:10:27

here we don't aggregate X we calculate a v which is just

1:10:34

achieved by propagating this linear on top of X again and then we

1:10:40

output way multiplied by V so V is the elements that we aggregate or the the

1:10:46

vector that we aggregate instead of the raw X and now of course this will make it so

1:10:51

that the output here of the single head will be 16 dimensional because that is the head size

1:10:58

so you can think of X as kind of like a private information to this token if you if you think about it that way so X is

1:11:04

kind of private to this token so I'm a fifth token at some and I have some identity and my information is kept in

1:11:11

Vector X and now for the purposes of the single head here's what I'm interested in

1:11:17

here's what I have and if you find me interesting here's what I will communicate to you and

1:11:23

that's stored in v and so V is the thing that gets aggregated for the purposes of this single head between the different nodes

1:11:31

and that's uh basically the self attention mechanism this is this is what it does

note 1: attention as communication

1:11:38

there are a few notes that I would make like to make about attention number one attention is a communication mechanism

1:11:44

you can really think about it as a communication mechanism where you have a number of nodes in a directed graph

1:11:50

where basically you have edges pointing between those like this and what happens is every node has some

1:11:57

Vector of information and it gets to aggregate information via a weighted sum from all the nodes that point to it

1:12:04

and this is done in a data dependent manner so depending on whatever data is actually stored at each node at any point in time

1:12:11

now our graph doesn't look like this our graph has a different structure we have

1:12:16

eight nodes because the block size is eight and there's always eight tokens and the first node is only pointed to by

1:12:24

itself the second node is pointed to by the first node and itself all the way up to the eighth node which is pointed to

1:12:30

by all the previous nodes and itself and so that's the structure that our directed graph has or happens happens to

1:12:37

have in other aggressive sort of scenario like language modeling but in principle attention can be applied to

1:12:43

any arbitrary directed graph and it's just a communication mechanism between the nodes the second note is that notice that

note 2: attention has no notion of space, operates over sets

1:12:49

there is no notion of space so attention simply acts over like a set of vectors

1:12:54

in this graph and so by default these nodes have no idea where they are positioned in a space and that's why we

1:12:59

need to encode them positionally and sort of give them some information that is anchored to a specific position so

1:13:05

that they sort of know where they are and this is different than for example from convolution because if you run for

1:13:12

example a convolution operation over some input there's a very specific sort of layout of the information in space in

1:13:18

the convolutional filters sort of act in space and so it's it's not like an

1:13:24

attention in attention is just a set of vectors out there in space they communicate and if you want them to have

1:13:29

a notion of space you need to specifically add it which is what we've done when we calculated the um relative

1:13:36

the position loan code encodings and added that information to the vectors the next thing that I hope is very clear

note 3: there is no communication across batch dimension

1:13:41

is that the elements across the batch Dimension which are independent examples never talk to each other don't always

1:13:47

processed independently and this is a bashed Matrix multiply that applies basically a matrix multiplication kind

1:13:53

of in parallel across the batch Dimension so maybe it would be more accurate to say that in this analogy of

1:13:58

a directed graph we really have because the batch size is four we really have four separate pools of eight nodes and

1:14:05

those eight nodes only talk to each other but in total there's like 32 nodes that are being processed but there's um

1:14:11

sort of four separate pools of eight you can look at it that way the next note is that here in the case

note 4: encoder blocks vs. decoder blocks

1:14:18

of language modeling uh we have this specific structure of directed graph where the future tokens will not

1:14:24

communicate to the Past tokens but this doesn't necessarily have to be the constraint in the general case and in

1:14:30

fact in many cases you may want to have all of the nodes talk to each other fully so as an example if you're doing

1:14:37

sentiment analysis or something like that with a Transformer you might have a number of tokens and you may want to

1:14:42

have them all talk to each other fully because later you are predicting for example the sentiment of the sentence

1:14:48

and so it's okay for these nodes to talk to each other and so in those cases you will use an

1:14:54

encoder block of self-attention and all it means that it's an encoder block is

1:14:59

that you will delete this line of code allowing all the nodes to completely talk to each other what we're

1:15:05

implementing here is sometimes called a decoder block and it's called a decoder because it is sort of like a decoding

1:15:13

language and it's got this Auto aggressive format where you have to mask with the Triangular Matrix so that nodes

1:15:21

from the future never talk to the Past because they would give away the answer and so basically in encoder blocks you

1:15:27

would delete this allow all the nodes to talk in decoder blocks this will always be present so that you have this

1:15:33

triangular structure but both are allowed and attention doesn't care attention supports arbitrary connectivity between nodes

note 5: attention vs. self-attention vs. cross-attention

1:15:39

the next thing I wanted to comment on is you keep me you keep hearing me say attention self-attention Etc there's

1:15:45

actually also something called cross attention what is the difference so basically the reason this attention is

1:15:53

self-attention is because the keys queries and the values are all coming

1:15:58

from the same Source from X so the same Source X produces case queries and

1:16:03

values so these nodes are self-attending but in principle attention is much more

1:16:09

General than that so for example an encoder decoder Transformers uh you can have a case where the queries are

1:16:15

produced from X but the keys and the values come from a whole separate external source and sometimes from

1:16:21

encoder blocks that encode some context that we'd like to condition on and so the keys and the values will actually

1:16:27

come from a whole separate Source those are nodes on the side and here we're just producing queries and we're reading

1:16:33

off information from the side so cross attention is used when there's a separate source of nodes we'd like to

1:16:41

pull information from into our nodes and it's self-attention if we just have nodes that would like to look at each

1:16:46

other and talk to each other so this attention here happens to be self-attention

1:16:52

but in principle um attention is a lot more General okay in the last note at this stage is if we

note 6: "scaled" self-attention. why divide by sqrt(head_size)

1:16:59

come to the attention is all you need paper here we've already implemented attention so given query key and value

1:17:04

we've multiplied the query on the key we've softmaxed it and then we are

1:17:09

aggregating the values there's one more thing that we're missing here which is the dividing by one over square root of the head size

1:17:16

the DK here is the head size why aren't they doing this once it's important so

1:17:21

they call it a scaled attention and it's kind of like an important normalization to basically have

1:17:27

the problem is if you have unit gaussian inputs so zero mean unit variance K and Q are unit caution and if you just do

1:17:34

way naively then you see that your way actually will be uh the variance will be on the order of head size which in our

1:17:40

case is 16. but if you multiply by one over head size square root so this is square root

1:17:46

and this is one over then the variance of way will be one so it will be preserved

1:17:52

now why is this important you'll notice that way here will feed into softmax

1:17:59

and so it's really important especially at initialization that way be fairly diffuse

1:18:04

so in our case here we sort of lucked out here and weigh had a fairly diffuse

1:18:10

numbers here so um like this now the problem is that because of softmax if weight takes on

1:18:17

very positive and very negative numbers inside it softmax will actually converge towards one hot vectors and so I can

1:18:25

illustrate that here um say we are applying softmax to a tensor

1:18:30

of values that are very close to zero then we're going to get a diffuse thing out of softmax but the moment I take the exact same

1:18:36

thing and I start sharpening it making it bigger by multiplying these numbers by eight for example you'll see that the

1:18:42

soft Max will start to sharpen and in fact it will sharpen towards the max so it will sharpen towards whatever number

1:18:48

here is the highest and so um basically we don't want these values to be too extreme especially the

1:18:54

initialization otherwise softmax will be way too peaky and you're basically aggregating

1:19:00

um information from like a single node every node just Aggregates information from a single other node that's not what

1:19:05

we want especially its initialization and so the scaling is used just to control the variance at initialization

inserting a single self-attention block to our network

1:19:11

okay so having said all that let's now take our soft retention knowledge and let's take it for a spin

1:19:18

so here in the code I created this head module and implements a single head of self-attention

1:19:24

so you give it a head size and then here it creates the key query and the value linear layers typically people don't use

1:19:30

biases in these so those are the linear projections that we're going to apply to all of our nodes

1:19:36

now here I'm creating this Trill variable Trill is not a parameter of the module so in sort of pythonomic

1:19:42

conventions this is called a buffer it's not a parameter and you have to call it you have to assign it to the module

1:19:48

using a register buffer so that creates the trail uh the triangle lower triangular Matrix

1:19:54

and when we're given the input X this should look very familiar now we calculate the keys the queries we call

1:19:59

it clock in the attentions course inside way we normalize it so we're using scaled attention here

1:20:06

then we make sure that a feature doesn't communicate with the past so this makes it a decoder block

1:20:11

and then softmax and then aggregate the value and output then here in the language model I'm

1:20:17

creating a head in the Constructor and I'm calling it self attention head and the head size I'm going to keep as the

1:20:24

same and embed just for now and then here once we've encoded the

1:20:31

information with the token embeddings and the position embeddings we're simply going to feed it into the self-attentioned head and then the

1:20:37

output of that is going to go into uh the decoder language modeling head and

1:20:43

create the logits so this is the sort of the simplest way to plug in a self-attention component into our

1:20:49

Network right now I had to make one more change which is that here

1:20:55

in the generate we have to make sure that our idx that we feed into the model

1:21:00

because now we're using positional embeddings we can never have more than block size coming in because if idx is

1:21:07

more than block size then our position embedding table is going to run out of scope because it only has embeddings for

1:21:12

up to block size and so therefore I added some code here to crop the context that we're going to

1:21:18

feed into self so that we never pass in more than block

1:21:23

size elements so those are the changes and let's Now train the network okay so I also came up

1:21:29

to the script here and I decreased the learning rate because the self-attention can't tolerate very very high learning

1:21:34

rates and then I also increase the number of iterations because the learning rate is lower and then I

1:21:39

trained it and previously we were only able to get to up to 2.5 and now we are down to 2.4 so we definitely see a

1:21:46

little bit of an improvement from 2.5 to 2.4 roughly but the text is still not amazing so clearly the self-attention

1:21:53

head is doing some useful communication but um we still have a long way to go okay

multi-headed self-attention

1:21:59

so now we've implemented the scale.product attention now next up in the attention is all you need paper there's something called multi-head

1:22:06

attention and what is multi-head attention it's just applying multiple attentions in parallel and concatenating

1:22:12

the results so they have a little bit of diagram here I don't know if this is super clear it's really just multiple attentions in

1:22:20

parallel so let's Implement that fairly straightforward if we want a multi-head attention then

1:22:27

we want multiple heads of self-attention running in parallel so in pytorch we can do this by simply

1:22:33

creating multiple heads so however heads how many however many heads you want and then what is the head

1:22:39

size of each and then we run all of them in parallel into a list and simply concatenate all

1:22:47

of the outputs and we're concatenating over the channel dimension so the way this looks now is we don't

1:22:53

have just a single attention that has a hit size of 32 because

1:22:58

remember and in bed is 32. instead of having one Communication channel we now have four communication

1:23:06

channels in parallel and each one of these communication channels typically will be smaller correspondingly so

1:23:14

because we have four communication channels we want eight dimensional self-attention and so from each

1:23:20

Communication channel we're going to gather eight dimensional vectors and then we have four of them and that

1:23:25

concatenates to give us 32 which is the original and embed and so this is kind of similar to um if

1:23:31

you're familiar with convolutions this is kind of like a group convolution because basically instead of having one

1:23:36

large convolution we do convolutional groups and uh that's multi-headed self-attention

1:23:42

and so then here we just use sa heads self-attussion Heads instead

1:23:47

now I actually ran it and uh scrolling down I ran the same thing and then we now get

1:23:53

this down to 2.28 roughly and the output is still the generation is still not

1:23:59

amazing but clearly the validation loss is improving because we were at 2.4 just now

1:24:04

and so it helps to have multiple communication channels because obviously these tokens have a lot to talk about

1:24:10

and they want to find the consonants the vowels they want to find the vowels just from certain positions they want to find

1:24:16

any kinds of different things and so it helps to create multiple independent channels of communication gather lots of

1:24:22

different types of data and then decode the output now going back to the paper for a second of course I didn't explain

feedforward layers of transformer block

1:24:28

this figure in full detail but we are starting to see some components of what we've already implemented we have the

1:24:33

positional encodings the token encodings that add we have the masked multi-headed attention implemented now here's another

1:24:41

multi-headed tension which is a cross attention to an encoder which we haven't we're not going to implement in this

1:24:46

case I'm going to come back to that later but I want you to notice that there's a feed forward part here and then this is

1:24:52

grouped into a block that gets repeated again and again now the feed forward part here is just a simple multi-layer perceptron

1:25:00

um so the multi-headed so here position wise feed forward networks is just a

1:25:06

simple little MLP so I want to start basically in a similar fashion also adding computation

1:25:11

into the network and this computation is on the per node level so

1:25:17

I've already implemented it and you can see the diff highlighted on the left here when I've added or changed things

1:25:22

now before we had the multi-headed self-attention that did the communication but we went way too fast

1:25:28

to calculate the logits so the tokens looked at each other but didn't really have a lot of time to think on what they

1:25:35

found from the other tokens and so what I've implemented here is a little feed forward single layer and

1:25:42

this little layer is just a linear followed by a relative nonlinearity and that's that's it

1:25:47

so it's just a little layer and then I call it feed forward

1:25:52

and embed and then this feed forward is just called sequentially right after the self-attention so we self-attend then we

1:26:00

feed forward and you'll notice that the feet forward here when it's applying linear this is on a per token level all

1:26:06

the tokens do this independently so the self-attention is the communication and then once they've gathered all the data

1:26:13

now they need to think on that data individually and so that's what feed forward is doing and that's why I've added it here now

1:26:20

when I train this the validation loss actually continues to go down now to 2.24 which is down from 2.28 the output

1:26:28

still look kind of terrible but at least we've improved the situation and so as a preview

1:26:34

we're going to now start to intersperse the communication with the computation

1:26:39

and that's also what the Transformer does when it has blocks that communicate and then compute and it groups them and

1:26:46

replicates them okay so let me show you what we like to do we'd like to do something like this

residual connections

1:26:52

we have a block and this block is basically this part here except for the cross attention

1:26:58

now the block basically intersperses communication and then computation the computation the communication is done

1:27:04

using multi-headed self-attention and then the computation is done using the feed forward Network on all the tokens

1:27:10

independently now what I've added here also is you'll notice

1:27:17

this takes the number of embeddings in the embedding Dimension and number of heads that we would like which is kind of like group size in group convolution

1:27:23

and I'm saying that number of heads we'd like is for and so because this is 32 we

1:27:29

calculate that because this 32 the number of hats should be four um there's num the head size should be

1:27:35

eight so that everything sort of works out Channel wise um so this is how the Transformer structures uh sort of the uh the sizes

1:27:42

typically so the head size will become eight and then this is how we want to intersperse them and then here I'm trying to create

1:27:49

blocks which is just a sequential application of block block so that we're interspersing communication feed forward

1:27:55

many many times and then finally we decode now actually try to run this and the

1:28:02

problem is this doesn't actually give a very good uh answer a very good result and the reason for that is we're

1:28:08

starting to actually get like a pretty deep neural net and deep neural Nets uh suffer from optimization issues and I

1:28:14

think that's where we're kind of like slightly starting to run into so we need one more idea that we can borrow from

1:28:19

the um Transformer paper to resolve those difficulties now there are two optimizations that dramatically help

1:28:25

with the depth of these networks and make sure that the networks remain optimizable let's talk about the first

1:28:31

one the first one in this diagram is you see this Arrow here and then this arrow and this Arrow those

1:28:37

are skip connections or sometimes called residual connections they come from this paper uh the

1:28:43

procedural learning form and recognition from about 2015. that introduced the

1:28:48

concept now these are basically what it means is you transform the data but then you have

1:28:55

a skip connection with addition from the previous features now the way I

1:29:00

like to visualize it that I prefer is the following here the computation happens from the top to bottom and

1:29:08

basically you have this uh residual pathway and you are free to Fork off from the residual pathway perform some

1:29:14

computation and then project back to the residual pathway via addition and so you go from the the inputs to the

1:29:22

targets only the plus and plus and plus and the reason this is useful is because during that propagation remember from

1:29:28

our micrograd video earlier addition distributes gradients equally to both of its branches that that fat as the input

1:29:36

and so the supervision or the gradients from the loss basically hop

1:29:42

through every addition node all the way to the input and then also Fork off into

1:29:48

the residual blocks but basically you have this gradient Super Highway that goes directly from

1:29:54

the supervision all the way to the input unimpeded and then these virtual blocks are usually initialized in the beginning

1:30:00

so they contribute very very little if anything to the residual pathway they they are initialized that way so in the

1:30:07

beginning they are sort of almost kind of like not there but then during the optimization they come online over time

1:30:13

and they start to contribute but at least at the initialization you can go

1:30:18

from directly supervision to the input gradient is unimpeded and just close and then the blocks over time kick in and so

1:30:26

that dramatically helps with the optimization so let's implement this so coming back to our block here basically

1:30:31

what we want to do is we want to do x equals X Plus solve the tension and x equals X Plus

1:30:38

solve that feed forward so this is X and then we Fork off and do

1:30:44

some communication and come back and we Fork off and we do some computation and come back so those are residual connections and

1:30:51

then swinging back up here we also have to introduce this projection so nn.linear

1:30:58

and this is going to be from after we concatenate this this is the

1:31:03

precise and embed so this is the output of the soft tension itself but then we actually want the uh to

1:31:10

apply the projection and that's the result so the projection is just a linear

1:31:15

transformation of the outcome of this layer so that's the projection back into the residual pathway

1:31:21

and then here in the feed forward it's going to be the same thing I could have a soft.projection here as well but let

1:31:28

me just simplify it and let me couple it inside the same sequential

1:31:33

container and so this is the projection layer going back into the residual pathway

1:31:39

and so that's uh well that's it so now we can train this so I implemented one more

1:31:44

small change when you look into the paper again you see that the dimensionality of input and output is

1:31:51

512 for them and they're saying that the inner layer here in the feed forward has dimensionality of 2048. so there's a

1:31:57

multiplier of four and so the inner layer of the feed forward Network should be multiplied by four in terms of

1:32:04

Channel sizes so I came here and I multiplied to four times embed here for the feed forward and then from four

1:32:10

times n embed coming back down to an embed when we go back to the project to the projection so adding a bit of

1:32:16

computation here and growing that layer that is in the residual block on the side of the residual pathway

1:32:22

and then I trained this and we actually get down all the way to uh 2.08 validation loss and we also see that the

1:32:29

network is starting to get big enough that our train loss is getting ahead of validation loss so we're starting to see like a little bit of overfitting

1:32:36

and um our our um Generations here are still not amazing

1:32:41

but at least you see that we can see like is here this now grieve sank like this starts to almost look like

1:32:48

English so um yeah we're starting to really get there okay and the second Innovation that is very helpful for optimizing very

layernorm (and its relationship to our previous batchnorm)

1:32:54

deep neural networks is right here so we have this addition now that's the residual part but this Norm is referring

1:33:00

to something called layer Norm so layer Norm is implemented in pi torch it's a paper that came out a while back

1:33:06

here um and layer Norm is very very similar to

1:33:11

Bachelor so remember back to our make more series part three we implemented

1:33:16

batch normalization and patch normalization basically just made sure that across the batch

1:33:23

Dimension any individual neuron had unit gaussian

1:33:28

distribution so it was zero mean and unit standard deviation one standard deviation output

1:33:35

so what I did here is I'm copy pasting The Bachelor 1D that we developed in our makemore series

1:33:40

and see here we can initialize for example this module and we can have a batch of 32 100 dimensional vectors

1:33:48

feeding through the bathroom layer so what this does is it guarantees

1:33:53

that when we look at just the zeroth column it's a zero mean one standard deviation

1:33:59

so it's normalizing every single column of this input now the rows are not going to be

1:34:06

normalized by default because we're just normalizing columns so let's now implement the layer Norm uh it's very

1:34:12

complicated look we come here we change this from 0 to 1. so we don't normalize

1:34:18

The Columns we normalize the rows and now we've implemented layer Norm

1:34:24

so now the columns are not going to be normalized but the rows are going to be normalized

1:34:32

for every individual example it's 100 dimensional Vector is normalized in this way and because our computation Now does

1:34:39

not span across examples we can delete all of this buffers stuff because we can

1:34:45

always apply this operation and don't need to maintain any running buffers so

1:34:51

we don't need the buffers we don't There's no distinction between

1:34:56

training and test time and we don't need these running buffers we do keep gamma and beta we don't need

1:35:04

the momentum we don't care if it's training or not and this is now a layer Norm

1:35:10

and it normalizes the rows instead of the columns and this here is identical to basically this here

1:35:19

so let's now Implement layer Norm in our Transformer before I incorporate the layer Norm I just wanted to note that as

1:35:25

I said very few details about the Transformer have changed in the last five years but this is actually something that slightly departs from the

1:35:31

original paper you see that the ADD and Norm is applied after the transformation

1:35:37

but um in now it is a bit more basically common to apply the layer Norm before

1:35:43

the transformation so there's a reshuffling of the layer Norms uh so this is called the pre-norm formulation

1:35:48

and that's the one that we're going to implement as well so slight deviation from the original paper basically we need two layer Norms layer

1:35:55

Norm one is an N dot layer norm and we tell it how many

1:36:00

um what is the embedding dimension and we need the second layer Norm and then here the layer rooms are

1:36:07

applied immediately on x so self.layer number one in applied on x

1:36:12

and salt on layer number two applied on X before it goes into sulfur tension and feed forward

1:36:18

and the size of the layer Norm here is an embeds of 32. so when the layer Norm

1:36:24

is normalizing our features it is the normalization here

1:36:30

happens the mean and the variance are taking over 32 numbers so the batch and the time act as batch Dimensions both of

1:36:37

them so this is kind of like a per token transformation that just normalizes the

1:36:43

features and makes them a unit mean unit gaussian at initialization

1:36:49

but of course because these layer Norms inside it have these gamma and beta trainable parameters

1:36:54

the layer normal eventually create outputs that might not be unit gaussian

1:37:00

but the optimization will determine that so for now this is the uh this is

1:37:05

incorporating the layer norms and let's train them up okay so I let it run and we see that we get down to 2.06 which is

1:37:12

better than the previous 2.08 so a slight Improvement by adding the layer norms and I'd expect that they help even

1:37:18

more if we had bigger and deeper Network one more thing I forgot to add is that there should be a layer Norm here also

1:37:25

typically as at the end of the Transformer and right before the final linear layer that decodes into

1:37:31

vocabulary so I added that as well so at this stage we actually have a pretty complete Transformer according to the

1:37:38

original paper and it's a decoder only Transformer I'll I'll talk about that in a second but at this stage the major

1:37:45

pieces are in place so we can try to scale this up and see how well we can push this number now in order to scale out the model I

scaling up the model! creating a few variables. adding dropout

1:37:51

had to perform some cosmetic changes here to make it nicer so I introduced this variable called end layer which

1:37:57

just specifies how many layers of the blocks we're going to have I create a bunch of blocks and we have a new

1:38:03

variable number of heads as well I pulled out the layer Norm here and so this is identical now one thing that I

1:38:10

did briefly change is I added a dropout so Dropout is something that you can add

1:38:15

right before the residual connection back or right before the connection back into the original pathway

1:38:21

so we can drop out that as the last layer here we can drop out uh here at the end of

1:38:27

the multi-headed extension as well and we can also drop out here when we

1:38:32

calculate the um basically affinities and after the soft Max we can drop out

1:38:37

some of those so we can randomly prevent some of the nodes from communicating and so Dropout comes from this paper

1:38:45

from 2014 or so and basically it takes your neural net

1:38:51

and it randomly every forward backward pass shuts off some subset of neurons

1:38:57

so randomly drops them to zero and trains without them and what this does

1:39:03

effectively is because the mask of what's being dropped out has changed every single forward backward pass it

1:39:08

ends up kind of training an ensemble of sub Networks and then at this time

1:39:14

everything is fully enabled and kind of all of those sub networks are merged into a single Ensemble if you can if you

1:39:19

want to think about it that way so I would read the paper to get the full detail for now we're just going to stay

1:39:24

on the level of this is a regularization technique and I added it because I'm about to scale up the model quite a bit

1:39:30

and I was concerned about overfitting so now when we scroll up to the top uh

1:39:36

we'll see that I changed a number of hyper parameters here about our neural net so I made the batch size B much

1:39:41

larger now with 64. I changed the block size to be 256 so previously it was just eight eight

1:39:47

characters of context now it is 256 characters of context to predict the 257th

1:39:53

uh I brought down the learning rate a little bit because the neural net is now much bigger so I brought down the

1:39:58

learning rate the embedding Dimension is now 384 and there are six heads so 384 divide 6

1:40:06

means that every head is 64 dimensional as it as a standard and then there are going to be six

1:40:12

layers of that and the Dropout will be of 0.2 so every forward backward passed 20 percent of

1:40:17

all of these um intermediate calculations are disabled and dropped to zero

1:40:23

and then I already trained this and I ran it so uh drum roll how well does it perform

1:40:29

so let me just scroll up here we get a validation loss of 1.48 which

1:40:36

is actually quite a bit of an improvement on what we had before which I think was 2.07 so we went from 2.07

1:40:41

all the way down to 1.48 just by scaling up this neural nut with the code that we have and this of course ran for a lot

1:40:47

longer this may be trained for I want to say about 15 minutes on my a100 GPU so

1:40:53

that's a pretty good GPU and if you don't have a GPU you're not going to be able to reproduce this on a CPU this

1:40:58

would be um I would not run this on the CPU or a Macbook or something like that you'll have to break down the number of layers

1:41:05

and the embedding Dimension and so on but in about 15 minutes we can get this kind of a result and

1:41:12

um I'm printing some of the Shakespeare here but what I did also is I printed 10 000 characters

1:41:17

so a lot more and I wrote them to a file and so here we see some of the outputs

1:41:24

so it's a lot more recognizable as the input text file so the input text file

1:41:29

just for reference looked like this so there's always like someone speaking in this matter and uh

1:41:36

our predictions now take on that form except of course they're they're nonsensical when you actually read them

1:41:42

so it is every crimpy bee house oh those

1:41:47

preparation we give heed um you know

1:41:56

Oho sent me you mighty Lord anyway so you can read through this

1:42:02

um it's nonsensical of course but this is just a Transformer trained on the Character level for 1 million characters

1:42:08

that come from Shakespeare so they're sort of like blabbers on and Shakespeare like manner but it doesn't of course

1:42:14

make sense at this scale uh but I think I think still a pretty good demonstration of what's possible

1:42:21

so now I think uh that kind of like concludes the programming section of this video we

1:42:28

basically kind of did a pretty good job in um of implementing this Transformer but the picture doesn't exactly match up

1:42:36

to what we've done so what's going on with all these additional Parts here so let me finish explaining this

encoder vs. decoder vs. both (?) Transformers

1:42:41

architecture and why it looks so funky basically what's happening here is what we implemented here is a decoder only

1:42:48

Transformer so there's no component here this part is called the encoder and there's no cross attention block here

1:42:55

our block only has a self-attention and the feed forward so it is missing this

1:43:00

third in between piece here this piece does cross attention so we don't have it

1:43:05

and we don't have the encoder we just have the decoder and the reason we have a decoder only

1:43:10

is because we are just generating text and it's unconditioned on anything we're just we're just blabbering on according

1:43:16

to a given data set what makes it a decoder is that we are using the Triangular mask in our

1:43:22

Transformer so it has this Auto regressive property where we can just go and sample from it

1:43:28

so the fact that it's using the Triangular triangular mask to mask out the attention makes it a decoder and it

1:43:34

can be used for language modeling now the reason that the original paper had an encoder decoder architecture is

1:43:41

because it is a machine translation paper so it is concerned with a different setting in particular

1:43:46

it expects some tokens that encode say for example French

1:43:51

and then it is expected to decode the translation in English so so you typically these here are

1:43:58

special tokens so you are expected to read in this and condition on it and

1:44:03

then you start off the generation with a special token called start so this is a special new token that you introduce and

1:44:10

always place in the beginning and then the network is expected to put neural networks are awesome and then a

1:44:17

special end token to finish a generation so this part here will be decoded

1:44:23

exactly as we we've done it neural networks are awesome will be identical to what we did

1:44:28

but unlike what we did they want to condition the generation on some

1:44:34

additional information and in that case this additional information is the French sentence that they should be translating

1:44:40

so what they do now is they bring in the encoder now the encoder reads this part here so we're

1:44:48

only going to take the part of French and we're going to create tokens from it exactly as we've seen in our video and

1:44:54

we're going to put a Transformer on it but there's going to be no triangular mask and so all the tokens are allowed

1:45:00

to talk to each other as much as they want and they're just encoding whatever's the content of this French

1:45:06

sentence once they've encoded it they've they basically come out in the

1:45:12

top here and then what happens here is in our decoder which does the language modeling

1:45:18

there's an additional connection here to the outputs of the encoder

1:45:23

and that is brought in through a cross attention so the queries are still generated from

1:45:28

X but now the keys and the values are coming from the side the keys and the values are coming from the top

1:45:35

generated by the nodes that came outside of the encoder and those tops the keys and the values

1:45:41

there the top of it feeding on the side into every single block of the decoder and so that's why

1:45:48

there's an additional cross attention and really what it's doing is it's conditioning the decoding not just on

1:45:54

the past of this current decoding but also on having seen the full fully

1:46:01

encoded French prompt sort of and so it's an encoder decoder model

1:46:06

which is why we have those two Transformers an additional block and so on so we did not do this because we have

1:46:12

no we have nothing to encode there's no conditioning we just have a text file and we just want to imitate it and that's why we are using a decoder only

1:46:19

Transformer exactly as done in GPT okay so now I wanted to do a very brief

super quick walkthrough of nanoGPT, batched multi-headed self-attention

1:46:24

walkthrough of Nano GPT which you can find on my GitHub and uh Nano GPT is basically two files of Interest there's

1:46:31

train.pi and model.pi trained at Pi is all the boilerplate code for training the network it is basically all the

1:46:38

stuff that we had here it's the training Loop it's just that it's a lot more complicated because we're saving and

1:46:44

loading checkpoints and pre-trained weights and we are decaying the learning rate and compiling the model and using

1:46:50

distributed training across multiple nodes or gpus so the training that Pi gets a little bit more hairy complicated

1:46:56

there's more options Etc but the model.pi should look very very

1:47:02

um similar to what we've done here in fact the model is is almost identical so first here we have the causal

1:47:09

self-attention block and all of this should look very very recognizable to you we're producing queries Keys values

1:47:15

we're doing Dot products we're masking applying softmax optionally dropping out

1:47:20

and here we are pooling the values what is different here is that in our

1:47:26

code I have separated out the multi-headed attention into just a single individual

1:47:32

head and then here I have multiple heads and I explicitly concatenate them

1:47:37

whereas here all of it is implemented in a batched manner inside a single causal self-attention and so we don't just have

1:47:44

a b and a T and A C Dimension we also end up with a fourth dimension which is the heads and so it just gets a lot more sort of

1:47:52

hairy because we have four dimensional array tensors now but it is equivalent

1:47:57

mathematically so the exact same thing is happening is what we have it's just it's a bit more efficient because all

1:48:02

the heads are not treated as a batch Dimension as well then we have to multiply perceptron it's

1:48:08

using the gallon nonlinearity which is defined here except instead of relu and

1:48:13

this is done just because openingi used it and I want to be able to load their checkpoints uh the blocks of the Transformer are

1:48:19

identical the communicate and the compute phase as we saw and then the GPT will be identical we

1:48:25

have the position encodings token encodings the blocks the layer Norm at the end the final linear layer

1:48:32

and this should look all very recognizable and there's a bit more here because I'm loading checkpoints and stuff like that

1:48:38

I'm separating out the parameters into building that should be weight decayed and those that shouldn't

1:48:44

um but the generate function should also be very very similar so a few details are different but you should definitely

1:48:49

be able to look at this uh file and be able to understand a lot of the pieces now so let's now bring things back to

back to ChatGPT, GPT-3, pretraining vs. finetuning, RLHF

1:48:55

chat GPT what would it look like if we wanted to train chatgpt ourselves and how does it relate to what we learned today

1:49:02

well to train in chat GPT there are roughly two stages first is the pre-training stage and then the fine

1:49:07

tuning stage in the pre-training stage we are training on a large chunk of

1:49:12

internet and just trying to get a first decoder only Transformer to Babel text

1:49:18

so it's very very similar to what we've done ourselves except we've done like a tiny little

1:49:24

baby pre-training step and so in our case uh this is how you

1:49:30

print a number of parameters I printed it and it's about 10 million so this Transformer that I created here to

1:49:36

create little Shakespeare um Transformer was about 10 million parameters our data set is roughly 1

1:49:43

million uh characters so roughly 1 million tokens but you have to remember that opening uses different vocabulary

1:49:49

they're not on the Character level they use these um subword chunks of words and

1:49:54

so they have a vocabulary of 50 000 roughly elements and so their sequences are a bit more condensed

1:50:01

so our data set the Shakespeare data set would be probably around 300 000 tokens

1:50:06

in the openai vocabulary roughly so we trained about 10 million parameter

1:50:11

model and roughly 300 000 tokens now when you go to the gpd3 paper

1:50:17

and you look at the Transformers that they trained they trained a number of Transformers of

1:50:23

different sizes but the biggest Transformer here has 175 billion parameters so ours is again 10 million

1:50:30

they used this number of layers in the Transformer This is the End embed this is the number of heads and this is

1:50:37

the head size and then this is the batch size so ours was 65.

1:50:44

and the learning rate is similar now when they train this Transformer they trained on 300 billion tokens

1:50:50

so again remember ours is about 300 000 so this is uh about a million fold

1:50:56

increase and this number would not be even that large by today's standards you'd be going up uh one trillion and

1:51:01

above so they are training a significantly larger model

1:51:07

on a good chunk of the internet and that is the pre-training stage but otherwise

1:51:12

these hyper parameters should be fairly recognizable to you and the architecture is actually like nearly identical to

1:51:17

what we implemented ourselves but of course it's a massive infrastructure challenge to train this you're talking

1:51:23

about typically thousands of gpus having to you know talk to each other to train models of this size so that's just a

1:51:30

pre-training stage now after you complete the pre-training stage you don't get something that responds to

1:51:36

your questions with answers and is not helpful and Etc you get a document completer right so it babbles but it

1:51:44

doesn't Babble Shakespeare in Babel's internet it will create arbitrary news articles and documents and it will try

1:51:50

to complete documents because that's what it's trained for it's trying to complete the sequence so when you give it a question it would just uh

1:51:56

potentially just give you more questions it would follow with more questions it will do whatever it looks like the some

1:52:02

closed document would do in the training data on the internet and so who knows you're getting kind of like undefined

1:52:08

Behavior it might basically answer with two questions with other questions it might ignore your question it might just

1:52:15

try to complete some news article it's totally underlined as we say so the second fine tuning stage is to

1:52:21

actually align it to be an assistant and this is the second stage and so this Chachi PT blog post from

1:52:29

open AI talks a little bit about how the stage is achieved we basically um

1:52:35

there's roughly three steps to the to this stage uh so what they do here is they start to collect training data that

1:52:41

looks specifically like what an assistant would do so if you have documents that have the format where the question is on top and then an answer is

1:52:48

below and they have a large number of these but probably not on the order of the internet this is probably on the

1:52:53

order of maybe thousands of examples and so they they then fine-tuned the

1:52:59

model to basically only focus on documents that look like that and so

1:53:04

you're starting to slowly align it so it's going to expect a question at the top and it's going to expect to complete the answer

1:53:10

and uh these very very large models are very sample efficient during their fine tuning so this actually somehow works

1:53:17

but that's just step one that's just fine-tuning so then they actually have more steps where okay the second step is

1:53:23

you let the model respond and then different Raiders look at the different responses and rank them for their

1:53:29

preference as to which one is better than the other they use that to train a reward model so they can predict

1:53:34

basically using a different network how much of any candidate response would be

1:53:41

desirable and then once they have a reward model they run PPO which is a form of policy

1:53:47

policy gradient um reinforcement learning optimizer to fine-tune this sampling policy so

1:53:54

that the answers that the GPT GPT now generates are expected to score a high

1:54:00

reward according to the reward model and so basically there's a whole the lining stage here or fine-tuning stage

1:54:07

it's got multiple steps in between there as well and it takes the model from being a document completer to a question

1:54:14

answer and that's like a whole separate stage a lot of this data is not available publicly it is internal to

1:54:21

open Ai and it's much harder to replicate this stage um and so that's roughly what would give

1:54:28

you a child GPD and Nano GPT focuses on the pre-training stage okay and that's everything that I wanted to cover today

conclusions

1:54:34

so we trained to summarize a decoder only Transformer following this famous

1:54:41

paper attention is all you need from 2017. and so that's basically a GPT we trained

1:54:47

it on a tiny Shakespeare and got sensible results all of the training code is roughly

1:54:55

200 lines of code I will be releasing this um code base so also it comes with

1:55:02

all the git log commits along the way as we built it up in addition to this code I'm going to

1:55:08

release the notebook of course the Google collab and I hope that gave you a sense for how

1:55:14

you can train um these models like say gpt3 there will be architecturally basically identical to

1:55:21

what we have but they are somewhere between ten thousand and one million times bigger depending on how you count

1:55:26

and so that's all I have for now we did not talk about any of the fine tuning

1:55:32

stages that would typically go on top of this so if you're interested in something that's not just language modeling but you actually want to you

1:55:38

know say perform tasks or you want them to be aligned in a specific way or you

1:55:43

want to detect sentiment or anything like that basically anytime you don't want something that's just a document

1:55:48

completer you have to complete further stages of fine tuning which we did not cover uh and that could be simple supervised

1:55:55

fine tuning or it can be something more fancy like we see in chargept we actually train a reward model and then

1:56:01

do rounds of PPO to align it with respect to the reward model so there's a lot more that can be done

1:56:06

on top of it I think for now we're starting to get to about two hours Mark so I'm going to

1:56:11

um kind of finish here I hope you enjoyed the lecture and uh yeah go forth and transform see you


Taking Your Existing Business With Build GPT AI 

Build GPT

ALL 5 STAR AI.IO PAGE STUDY

How AI and IoT are Creating An Impact On Industries Today


HELLO AND WELCOME  TO THE 


5 STAR AI.IOT TOOLS FOR YOUR BUSINESS


ARE NEW WEBSITE IS ABOUT 5 STAR AI and io’t TOOLS on the net.

We prevaid you the best

Artificial Intelligence  tools and services that can be used to create and improve BUSINESS websites AND CHANNELS .

This site is  includes tools for creating interactive visuals, animations, and videos.

 as well as tools for SEO, marketing, and web development.

 It also includes tools for creating and editing text, images, and audio. The website is intended to provide users with a comprehensive list of AI-based tools to help them create and improve their business.

https://studio.d-id.com/share?id=078f9242d5185a9494e00852e89e17f7&utm_source=copy

This website is a collection of Artificial Intelligence (AI) tools and services that can be used to create and improve websites. It includes tools for creating interactive visuals, animations, and videos, as well as tools for SEO, marketing, and web development. It also includes tools for creating and editing text, images, and audio. The website is intended to provide users with a comprehensive list of AI-based tools to help them create and improve their websites.



אתר זה הוא אוסף של כלים ושירותים של בינה מלאכותית (AI) שניתן להשתמש בהם כדי ליצור ולשפר אתרים. הוא כולל כלים ליצירת ויזואליה אינטראקטיבית, אנימציות וסרטונים, כמו גם כלים לקידום אתרים, שיווק ופיתוח אתרים. הוא כולל גם כלים ליצירה ועריכה של טקסט, תמונות ואודיו. האתר נועד לספק למשתמשים רשימה מקיפה של כלים מבוססי AI שיסייעו להם ליצור ולשפר את אתרי האינטרנט שלהם.

Hello and welcome to our new site that shares with you the most powerful web platforms and tools available on the web today

All platforms, websites and tools have artificial intelligence AI and have a 5-star rating

All platforms, websites and tools are free and Pro paid

The platforms, websites and the tool's  are the best  for growing your business in 2022/3

שלום וברוכים הבאים לאתר החדש שלנו המשתף אתכם בפלטפורמות האינטרנט והכלים החזקים ביותר הקיימים היום ברשת. כל הפלטפורמות, האתרים והכלים הם בעלי בינה מלאכותית AI ובעלי דירוג של 5 כוכבים. כל הפלטפורמות, האתרים והכלים חינמיים ומקצועיים בתשלום הפלטפורמות, האתרים והכלים באתר זה הם הטובים ביותר  והמועילים ביותר להצמחת ולהגדלת העסק שלך ב-2022/3 

A Guide for AI-Enhancing Your Existing Business Application


A guide to improving your existing business application of artificial intelligence

מדריך לשיפור היישום העסקי הקיים שלך בינה מלאכותית

What is Artificial Intelligence and how does it work? What are the 3 types of AI?

What is Artificial Intelligence and how does it work? What are the 3 types of AI? The 3 types of AI are: General AI: AI that can perform all of the intellectual tasks a human can. Currently, no form of AI can think abstractly or develop creative ideas in the same ways as humans.  Narrow AI: Narrow AI commonly includes visual recognition and natural language processing (NLP) technologies. It is a powerful tool for completing routine jobs based on common knowledge, such as playing music on demand via a voice-enabled device.  Broad AI: Broad AI typically relies on exclusive data sets associated with the business in question. It is generally considered the most useful AI category for a business. Business leaders will integrate a broad AI solution with a specific business process where enterprise-specific knowledge is required.  How can artificial intelligence be used in business? AI is providing new ways for humans to engage with machines, transitioning personnel from pure digital experiences to human-like natural interactions. This is called cognitive engagement.  AI is augmenting and improving how humans absorb and process information, often in real-time. This is called cognitive insights and knowledge management. Beyond process automation, AI is facilitating knowledge-intensive business decisions, mimicking complex human intelligence. This is called cognitive automation.  What are the different artificial intelligence technologies in business? Machine learning, deep learning, robotics, computer vision, cognitive computing, artificial general intelligence, natural language processing, and knowledge reasoning are some of the most common business applications of AI.  What is the difference between artificial intelligence and machine learning and deep learning? Artificial intelligence (AI) applies advanced analysis and logic-based techniques, including machine learning, to interpret events, support and automate decisions, and take actions.  Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.  Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised from data that is unstructured or unlabeled.  What are the current and future capabilities of artificial intelligence? Current capabilities of AI include examples such as personal assistants (Siri, Alexa, Google Home), smart cars (Tesla), behavioral adaptation to improve the emotional intelligence of customer support representatives, using machine learning and predictive algorithms to improve the customer’s experience, transactional AI like that of Amazon, personalized content recommendations (Netflix), voice control, and learning thermostats.  Future capabilities of AI might probably include fully autonomous cars, precision farming, future air traffic controllers, future classrooms with ambient informatics, urban systems, smart cities and so on.  To know more about the scope of artificial intelligence in your business, please connect with our expert.

מהי בינה מלאכותית וכיצד היא פועלת? מהם 3 סוגי הבינה המלאכותית?

מהי בינה מלאכותית וכיצד היא פועלת? מהם 3 סוגי הבינה המלאכותית? שלושת סוגי הבינה המלאכותית הם: בינה מלאכותית כללית: בינה מלאכותית שיכולה לבצע את כל המשימות האינטלקטואליות שאדם יכול. נכון לעכשיו, שום צורה של AI לא יכולה לחשוב בצורה מופשטת או לפתח רעיונות יצירתיים באותן דרכים כמו בני אדם. בינה מלאכותית צרה: בינה מלאכותית צרה כוללת בדרך כלל טכנולוגיות זיהוי חזותי ועיבוד שפה טבעית (NLP). זהו כלי רב עוצמה להשלמת עבודות שגרתיות המבוססות על ידע נפוץ, כגון השמעת מוזיקה לפי דרישה באמצעות מכשיר התומך בקול. בינה מלאכותית רחבה: בינה מלאכותית רחבה מסתמכת בדרך כלל על מערכי נתונים בלעדיים הקשורים לעסק המדובר. זה נחשב בדרך כלל לקטגוריית הבינה המלאכותית השימושית ביותר עבור עסק. מנהיגים עסקיים ישלבו פתרון AI רחב עם תהליך עסקי ספציפי שבו נדרש ידע ספציפי לארגון. כיצד ניתן להשתמש בבינה מלאכותית בעסק? AI מספקת דרכים חדשות לבני אדם לעסוק במכונות, ומעבירה את הצוות מחוויות דיגיטליות טהורות לאינטראקציות טבעיות דמויות אדם. זה נקרא מעורבות קוגניטיבית. בינה מלאכותית מגדילה ומשפרת את האופן שבו בני אדם קולטים ומעבדים מידע, לעתים קרובות בזמן אמת. זה נקרא תובנות קוגניטיביות וניהול ידע. מעבר לאוטומציה של תהליכים, AI מאפשר החלטות עסקיות עתירות ידע, תוך חיקוי אינטליגנציה אנושית מורכבת. זה נקרא אוטומציה קוגניטיבית. מהן טכנולוגיות הבינה המלאכותית השונות בעסק? למידת מכונה, למידה עמוקה, רובוטיקה, ראייה ממוחשבת, מחשוב קוגניטיבי, בינה כללית מלאכותית, עיבוד שפה טבעית וחשיבת ידע הם חלק מהיישומים העסקיים הנפוצים ביותר של AI. מה ההבדל בין בינה מלאכותית ולמידת מכונה ולמידה עמוקה? בינה מלאכותית (AI) מיישמת ניתוח מתקדמות וטכניקות מבוססות לוגיקה, כולל למידת מכונה, כדי לפרש אירועים, לתמוך ולהפוך החלטות לאוטומטיות ולנקוט פעולות. למידת מכונה היא יישום של בינה מלאכותית (AI) המספק למערכות את היכולת ללמוד ולהשתפר מניסיון באופן אוטומטי מבלי להיות מתוכנתים במפורש. למידה עמוקה היא תת-קבוצה של למידת מכונה בבינה מלאכותית (AI) שיש לה רשתות המסוגלות ללמוד ללא פיקוח מנתונים שאינם מובנים או ללא תווית. מהן היכולות הנוכחיות והעתידיות של בינה מלאכותית? היכולות הנוכחיות של AI כוללות דוגמאות כמו עוזרים אישיים (Siri, Alexa, Google Home), מכוניות חכמות (Tesla), התאמה התנהגותית לשיפור האינטליגנציה הרגשית של נציגי תמיכת לקוחות, שימוש בלמידת מכונה ואלגוריתמים חזויים כדי לשפר את חווית הלקוח, עסקאות בינה מלאכותית כמו זו של אמזון, המלצות תוכן מותאמות אישית (Netflix), שליטה קולית ותרמוסטטים ללמידה. יכולות עתידיות של AI עשויות לכלול כנראה מכוניות אוטונומיות מלאות, חקלאות מדויקת, בקרי תעבורה אוויריים עתידיים, כיתות עתידיות עם אינפורמטיקה סביבתית, מערכות עירוניות, ערים חכמות וכן הלאה. כדי לדעת יותר על היקף הבינה המלאכותית בעסק שלך, אנא צור קשר עם המומחה שלנו.

Glossary of Terms


Application Programming Interface(API):

An API, or application programming interface, is a set of rules and protocols that allows different software programs to communicate and exchange information with each other. It acts as a kind of intermediary, enabling different programs to interact and work together, even if they are not built using the same programming languages or technologies. API's provide a way for different software programs to talk to each other and share data, helping to create a more interconnected and seamless user experience.

Artificial Intelligence(AI):

the intelligence displayed by machines in performing tasks that typically require human intelligence, such as learning, problem-solving, decision-making, and language understanding. AI is achieved by developing algorithms and systems that can process, analyze, and understand large amounts of data and make decisions based on that data.

Compute Unified Device Architecture(CUDA):

CUDA is a way that computers can work on really hard and big problems by breaking them down into smaller pieces and solving them all at the same time. It helps the computer work faster and better by using special parts inside it called GPUs. It's like when you have lots of friends help you do a puzzle - it goes much faster than if you try to do it all by yourself.

The term "CUDA" is a trademark of NVIDIA Corporation, which developed and popularized the technology.

Data Processing:

The process of preparing raw data for use in a machine learning model, including tasks such as cleaning, transforming, and normalizing the data.

Deep Learning(DL):

A subfield of machine learning that uses deep neural networks with many layers to learn complex patterns from data.

Feature Engineering:

The process of selecting and creating new features from the raw data that can be used to improve the performance of a machine learning model.

Freemium:

You might see the term "Freemium" used often on this site. It simply means that the specific tool that you're looking at has both free and paid options. Typically there is very minimal, but unlimited, usage of the tool at a free tier with more access and features introduced in paid tiers.

Generative Art:

Generative art is a form of art that is created using a computer program or algorithm to generate visual or audio output. It often involves the use of randomness or mathematical rules to create unique, unpredictable, and sometimes chaotic results.

Generative Pre-trained Transformer(GPT):

GPT stands for Generative Pretrained Transformer. It is a type of large language model developed by OpenAI.

GitHub:

GitHub is a platform for hosting and collaborating on software projects


Google Colab:

Google Colab is an online platform that allows users to share and run Python scripts in the cloud

Graphics Processing Unit(GPU):

A GPU, or graphics processing unit, is a special type of computer chip that is designed to handle the complex calculations needed to display images and video on a computer or other device. It's like the brain of your computer's graphics system, and it's really good at doing lots of math really fast. GPUs are used in many different types of devices, including computers, phones, and gaming consoles. They are especially useful for tasks that require a lot of processing power, like playing video games, rendering 3D graphics, or running machine learning algorithms.

Large Language Model(LLM):

A type of machine learning model that is trained on a very large amount of text data and is able to generate natural-sounding text.

Machine Learning(ML):

A method of teaching computers to learn from data, without being explicitly programmed.

Natural Language Processing(NLP):

A subfield of AI that focuses on teaching machines to understand, process, and generate human language

Neural Networks:

A type of machine learning algorithm modeled on the structure and function of the brain.

Neural Radiance Fields(NeRF):

Neural Radiance Fields are a type of deep learning model that can be used for a variety of tasks, including image generation, object detection, and segmentation. NeRFs are inspired by the idea of using a neural network to model the radiance of an image, which is a measure of the amount of light that is emitted or reflected by an object.

OpenAI:

OpenAI is a research institute focused on developing and promoting artificial intelligence technologies that are safe, transparent, and beneficial to society

Overfitting:

A common problem in machine learning, in which the model performs well on the training data but poorly on new, unseen data. It occurs when the model is too complex and has learned too many details from the training data, so it doesn't generalize well.

Prompt:

A prompt is a piece of text that is used to prime a large language model and guide its generation

Python:

Python is a popular, high-level programming language known for its simplicity, readability, and flexibility (many AI tools use it)

Reinforcement Learning:

A type of machine learning in which the model learns by trial and error, receiving rewards or punishments for its actions and adjusting its behavior accordingly.

Spatial Computing:

Spatial computing is the use of technology to add digital information and experiences to the physical world. This can include things like augmented reality, where digital information is added to what you see in the real world, or virtual reality, where you can fully immerse yourself in a digital environment. It has many different uses, such as in education, entertainment, and design, and can change how we interact with the world and with each other.

Stable Diffusion:

Stable Diffusion generates complex artistic images based on text prompts. It’s an open source image synthesis AI model available to everyone. Stable Diffusion can be installed locally using code found on GitHub or there are several online user interfaces that also leverage Stable Diffusion models.

Supervised Learning:

A type of machine learning in which the training data is labeled and the model is trained to make predictions based on the relationships between the input data and the corresponding labels.

Unsupervised Learning:

A type of machine learning in which the training data is not labeled, and the model is trained to find patterns and relationships in the data on its own.

Webhook:

A webhook is a way for one computer program to send a message or data to another program over the internet in real-time. It works by sending the message or data to a specific URL, which belongs to the other program. Webhooks are often used to automate processes and make it easier for different programs to communicate and work together. They are a useful tool for developers who want to build custom applications or create integrations between different software systems.


מילון מונחים


ממשק תכנות יישומים (API): API, או ממשק תכנות יישומים, הוא קבוצה של כללים ופרוטוקולים המאפשרים לתוכנות שונות לתקשר ולהחליף מידע ביניהן. הוא פועל כמעין מתווך, המאפשר לתוכניות שונות לקיים אינטראקציה ולעבוד יחד, גם אם הן אינן בנויות באמצעות אותן שפות תכנות או טכנולוגיות. ממשקי API מספקים דרך לתוכנות שונות לדבר ביניהן ולשתף נתונים, ועוזרות ליצור חווית משתמש מקושרת יותר וחלקה יותר. בינה מלאכותית (AI): האינטליגנציה שמוצגת על ידי מכונות בביצוע משימות הדורשות בדרך כלל אינטליגנציה אנושית, כגון למידה, פתרון בעיות, קבלת החלטות והבנת שפה. AI מושגת על ידי פיתוח אלגוריתמים ומערכות שיכולים לעבד, לנתח ולהבין כמויות גדולות של נתונים ולקבל החלטות על סמך הנתונים הללו. Compute Unified Device Architecture (CUDA): CUDA היא דרך שבה מחשבים יכולים לעבוד על בעיות קשות וגדולות באמת על ידי פירוקן לחתיכות קטנות יותר ופתרון כולן בו זמנית. זה עוזר למחשב לעבוד מהר יותר וטוב יותר על ידי שימוש בחלקים מיוחדים בתוכו הנקראים GPUs. זה כמו כשיש לך הרבה חברים שעוזרים לך לעשות פאזל - זה הולך הרבה יותר מהר מאשר אם אתה מנסה לעשות את זה לבד. המונח "CUDA" הוא סימן מסחרי של NVIDIA Corporation, אשר פיתחה והפכה את הטכנולוגיה לפופולרית. עיבוד נתונים: תהליך הכנת נתונים גולמיים לשימוש במודל למידת מכונה, כולל משימות כמו ניקוי, שינוי ונימול של הנתונים. למידה עמוקה (DL): תת-תחום של למידת מכונה המשתמש ברשתות עצביות עמוקות עם רבדים רבים כדי ללמוד דפוסים מורכבים מנתונים. הנדסת תכונות: תהליך הבחירה והיצירה של תכונות חדשות מהנתונים הגולמיים שניתן להשתמש בהם כדי לשפר את הביצועים של מודל למידת מכונה. Freemium: ייתכן שתראה את המונח "Freemium" בשימוש לעתים קרובות באתר זה. זה פשוט אומר שלכלי הספציפי שאתה מסתכל עליו יש אפשרויות חינמיות וגם בתשלום. בדרך כלל יש שימוש מינימלי מאוד, אך בלתי מוגבל, בכלי בשכבה חינמית עם יותר גישה ותכונות שהוצגו בשכבות בתשלום. אמנות גנרטיבית: אמנות גנרטיבית היא צורה של אמנות שנוצרת באמצעות תוכנת מחשב או אלגוריתם ליצירת פלט חזותי או אודיו. לרוב זה כרוך בשימוש באקראיות או בכללים מתמטיים כדי ליצור תוצאות ייחודיות, בלתי צפויות ולעיתים כאוטיות. Generative Pre-trained Transformer(GPT): GPT ראשי תיבות של Generative Pre-trained Transformer. זהו סוג של מודל שפה גדול שפותח על ידי OpenAI. GitHub: GitHub היא פלטפורמה לאירוח ושיתוף פעולה בפרויקטי תוכנה

Google Colab: Google Colab היא פלטפורמה מקוונת המאפשרת למשתמשים לשתף ולהריץ סקריפטים של Python בענן Graphics Processing Unit(GPU): GPU, או יחידת עיבוד גרפית, הוא סוג מיוחד של שבב מחשב שנועד להתמודד עם המורכבות חישובים הדרושים להצגת תמונות ווידאו במחשב או במכשיר אחר. זה כמו המוח של המערכת הגרפית של המחשב שלך, והוא ממש טוב לעשות הרבה מתמטיקה ממש מהר. GPUs משמשים סוגים רבים ושונים של מכשירים, כולל מחשבים, טלפונים וקונסולות משחקים. הם שימושיים במיוחד למשימות הדורשות כוח עיבוד רב, כמו משחקי וידאו, עיבוד גרפיקה תלת-ממדית או הפעלת אלגוריתמים של למידת מכונה. מודל שפה גדול (LLM): סוג של מודל למידת מכונה שאומן על כמות גדולה מאוד של נתוני טקסט ומסוגל ליצור טקסט בעל צליל טבעי. Machine Learning (ML): שיטה ללמד מחשבים ללמוד מנתונים, מבלי להיות מתוכנתים במפורש. עיבוד שפה טבעית (NLP): תת-תחום של AI המתמקד בהוראת מכונות להבין, לעבד וליצור שפה אנושית רשתות עצביות: סוג של אלגוריתם למידת מכונה המבוססת על המבנה והתפקוד של המוח. שדות קרינה עצביים (NeRF): שדות קרינה עצביים הם סוג של מודל למידה עמוקה שיכול לשמש למגוון משימות, כולל יצירת תמונה, זיהוי אובייקטים ופילוח. NeRFs שואבים השראה מהרעיון של שימוש ברשת עצבית למודל של זוהר תמונה, שהוא מדד לכמות האור שנפלט או מוחזר על ידי אובייקט. OpenAI: OpenAI הוא מכון מחקר המתמקד בפיתוח וקידום טכנולוגיות בינה מלאכותית שהן בטוחות, שקופות ומועילות לחברה. Overfitting: בעיה נפוצה בלמידת מכונה, שבה המודל מתפקד היטב בנתוני האימון אך גרועים בחדשים, בלתי נראים. נתונים. זה מתרחש כאשר המודל מורכב מדי ולמד יותר מדי פרטים מנתוני האימון, כך שהוא לא מכליל היטב. הנחיה: הנחיה היא פיסת טקסט המשמשת לתכנון מודל שפה גדול ולהנחות את הדור שלו Python: Python היא שפת תכנות פופולרית ברמה גבוהה הידועה בפשטות, בקריאות ובגמישות שלה (כלי AI רבים משתמשים בה) למידת חיזוק: סוג של למידת מכונה שבה המודל לומד על ידי ניסוי וטעייה, מקבל תגמולים או עונשים על מעשיו ומתאים את התנהגותו בהתאם. מחשוב מרחבי: מחשוב מרחבי הוא השימוש בטכנולוגיה כדי להוסיף מידע וחוויות דיגיטליות לעולם הפיזי. זה יכול לכלול דברים כמו מציאות רבודה, שבה מידע דיגיטלי מתווסף למה שאתה רואה בעולם האמיתי, או מציאות מדומה, שבה אתה יכול לשקוע במלואו בסביבה דיגיטלית. יש לו שימושים רבים ושונים, כמו בחינוך, בידור ועיצוב, והוא יכול לשנות את האופן שבו אנו מתקשרים עם העולם ואחד עם השני. דיפוזיה יציבה: דיפוזיה יציבה מייצרת תמונות אמנותיות מורכבות המבוססות על הנחיות טקסט. זהו מודל AI של סינתזת תמונות בקוד פתוח הזמין לכולם. ניתן להתקין את ה-Stable Diffusion באופן מקומי באמצעות קוד שנמצא ב-GitHub או שישנם מספר ממשקי משתמש מקוונים הממנפים גם מודלים של Stable Diffusion. למידה מפוקחת: סוג של למידת מכונה שבה נתוני האימון מסומנים והמודל מאומן לבצע תחזיות על סמך היחסים בין נתוני הקלט והתוויות המתאימות. למידה ללא פיקוח: סוג של למידת מכונה שבה נתוני האימון אינם מסומנים, והמודל מאומן למצוא דפוסים ויחסים בנתונים בעצמו. Webhook: Webhook הוא דרך של תוכנת מחשב אחת לשלוח הודעה או נתונים לתוכנית אחרת דרך האינטרנט בזמן אמת. זה עובד על ידי שליחת ההודעה או הנתונים לכתובת URL ספציפית, השייכת לתוכנית האחרת. Webhooks משמשים לעתים קרובות כדי להפוך תהליכים לאוטומטיים ולהקל על תוכניות שונות לתקשר ולעבוד יחד. הם כלי שימושי למפתחים שרוצים לבנות יישומים מותאמים אישית או ליצור אינטגרציות בין מערכות תוכנה שונות.

WELCOME TO THE

5 STAR AI.IO

TOOLS

FOR YOUR BUSINESS