One of the things that make me very curious until now is how a Large Language Model (LLM) works. I have studied and implemented tokenizers and multi-head attention, but I still want to dive deeper the full architecture of an LLM to the level of building the loss function. Moreover, I think this a better start than jumping straight into implementing current versions of attention mechanism, such as flash attention.
In this project, I will build GPT-2 from scratch.
I have worked on prompting and fine-tuning with LLMs, it's time to build a one myself (based on Sebastian Raschka's book "Build a Large Language Model (From Scratch)").