3.4 Assignment: Portfolio - Generative AI LLM Infrastructure: Training Process and Costs

 

MODEL DEVELOPMENT

 

 

Indiana Wesleyan University

 

NOVEMBER  2025

 

 

 

 

 

 

 

 

 

 

 

 

 

LLM Infrastructure Training Flow Explanatory Document

For this artififact, I wanted to visually explain how large language models actually get trained, from the initial gathering of raw data all the way until deployment when you have millions of users hitting it. Instead of weaving in technical detail alone, I aimed to spotlight the relationship between engineering effort, computation costs and human input — three forces that help define the feasibility and ethics of A.I. development.

The first section of the diagram is about Data Sourcing, that aspect being at the kernel of any LLM. I designed this section to kinda jump-in-the-face of the viewer, that these models are so incredibly just statistical projection of the stuff they have been trained on. However, to retrieve huge amounts of high-quality text, filtering will be strict, copyrighted material to be respected and selection mindful in order to exclude harmful or biased information. This is an unexpectedly high cost in both money and staff time, which is why I made such a big deal about that above.

The graphic then slides into Tokenization, wherein the intricate text is degenerated to standardized tokens that are interpretable by the model. While this is a technical quibble I have, tokenization has also significant impact on model behavior and efficiency so I it felt important for the readers to know about some typical vocabulary sizes and context lengths just as frame of reference.

With Model Blueprint, I focused on the design choices that architects have to make before any training starts: how deep should your network be?How wide? How many attention heads?How are parameters sharded across hardware? I cited examples like LLaMA-3, Claude 3 and GPT-4 to help anchor these decisions in tangible scales that the audience can identify with.

Distributed Training is the most intense part and situated in the middle of the artifact. I illustrated this stage as a pivotal one in the workflow, because it's at this point where the reality of resource consumption becomes apparent. It is at this large-scale where the accelerators, 10G networking and multi-week training cycles cost a small fortune in electricity and need continuous engineering attention. The graphic incorporates cost estimates that can help disclose how much these programs really cost.

In the Optimization & Regularization section, I covered that model training isn’t simply about “pressing start.’ Every training run involves handling optimization algorithms, memory-efficient approaches and styles of parallelism to fit huge models across clusters of hardware. Such are the hidden intricacies that make or break a model’s train from being successful or crashing.

The Alignment & RLHF stage recognises the human aspect in model behavior. I included this part in hopes of demonstrating that generative AI is not purely automated — it’s people who review outputs, label data and assist with refining the model’s reward system. These interventions directly mold safety, tone and helpfulness.”

Conclusion The Deployment & Operations section notes that LLMs also continue to incur costs well beyond the termination of training. In the real world, it runs in production and you have to keep traffic balance and monitoring, automatic or manual scale up of resources when needed, reliable engineering and safety buttons. The operational cost of housing an LLM, in may cases outweighs training the LLM.

On the whole, our artifact is designed to provide a transparent and accessible look at the LLM lifecycle. The layer-cake format, clean color scheme and organized sequence are tools chosen to make a otherwise complex process digestible for the non technical viewer.