How to Run AI Models on your Computer

Objectives of this Guide

Why should you care and what does it mean to run an AI model locally?

What is an Large Language Model?

Turning Words into Numbers

Preforming Inference

Setting up LM Studio

Limitations and Risks

Appendix A: Survey of Students

Appendix B: Interview with Sharmila Duppala

Objectives of this Guide

The overarching goal for this guide is to increase Artificial Intelligence (AI) literacy while at the same time protecting privacy. This guide will accomplish this by explaining how to run AI models locally through LM Studio which should provide a venue to see some of the inner workings while providing a useable experience that does not compromise your privacy to online services.

This guide was originally made for the University of Maryland (UMD) community, especially students but may be used by anyone.

Note that this guide is up to date as of May 2025; AI is a fast moving field so information may become out out date if viewed in the future.

Why should you care and what does it mean to run an AI model locally?

AI tools, and in particular Large Language Models (LLMs) – the most commonly used variety – are rapidly becoming integrated into academic life. According to a survey conducted on 24 UMD students, 87% of the respondents make at least 5 queries to AI tools every week, and 20% make more than 50 queries per week. They offer valuable assistance for tasks like generating practice problems, exploring concepts, and various other non-academic uses, as acknowledged by UMD's own guidelines (UMD Guidelines). However, the current landscape is dominated by a few major online providers, with tools like ChatGPT being the most prominent choice for many students (UIC Survey). This concentration raises concerns about limited consumer choice and, more critically, significant privacy risks. When students use these popular online chatbots, their data is often transmitted to external servers, where it can be reviewed and used to train future models, potentially compromising confidentiality (IBM, Stanford HAI). In the survey conducted 62% of students believe AI poses a moderate or high risk to their privacy. Indeed, when using ChatGPT, you are agreeing to their terms and services that allow them to collect consumer data for use in training (OpenAI Terms & Policies). The extensive data collection practices inherent in developing powerful AI systems create privacy risks that are not adequately addressed by current regulations, placing the onus on consumers to protect their own privacy (IBM, Stanford HAI).

Beyond privacy, reliance on these readily available online tools often occurs without a deep understanding of the underlying technology. Research suggests that lower AI literacy can paradoxically lead to greater receptivity and less critical engagement with AI, sometimes fostering an uncritical sense of "awe" rather than informed understanding (Tully et al.). According to Sharmila Duppala, a UMD PhD student studying AI models, she notices that some undergraduate students at UMD often over rely on using AI for tasks involving problem solving leading to them not learning valuable skills. Studies also indicate a variance in AI literacy among university students, emphasizing a need for broader AI education to prepare students for a future where AI is ubiquitous (Hornberger et al.). Duppala defines literacy for a UMD student to be to Understand how AI works along with its limitations, and to use AI tools effectively and responsibly. Educational institutions are increasingly recognizing this need, with initiatives emerging to integrate AI literacy across the curriculum (Southworth et al.).

This guide tackles these dual issues of AI literacy and privacy. We will explain how to run AI models directly on your computer, using this hands-on approach as a venue to demystify what's happening "under the hood" and help you build a more concrete understanding of AI. Specifically, this guide will focus on running LLMs using LM Studio, as they are currently the most common and feasible type of AI model for local deployment. However, the principles can be extended to other models, such as those for vision and image generation. By following this approach, a small, open-source LLM can run entirely on your machine, producing output without communicating with external cloud servers. Beyond protecting your privacy and fostering a deeper understanding of how AI works, this guide aims to broaden your awareness of consumer choices and highlight alternatives to the options often pushed by dominant corporations.

What is an Large Language Model?

Fundamentally large language model are algorithms that take words as input then iteratively predicts the next word so well that it can generate full coherent paragraphs. This guide will only explain the very basics of how a simple LLM works without going into too much detail or explain newer variants like chain of thought and mixture of experts.

Additionally LLMs, and machine learning in general, are a very active area of research and most breakthroughs have happened through trial and error. Many specifics of how they work are engineering decisions arrived at through experimentation and experts are not sure exactly why or how LLMs work. See the book Understanding Deep Learning by Simon J. D. Prince for more details.

Turning Words into Numbers

If we would like to have an algorithm to turn input words into output, we would need to convert the string of input words into numbers that can be operated on to produce an output. LLMs do this using a process called tokenization where there is a dictionary that maps a numbers to a list of "words" called tokens. We can them tokens because they are not necessarily full words they can sometimes be parts of words, letters, punctuation marks, symbols, or special tokens to mark a code block or end of output (Karpathy, Wolfram).

For example the following words are tokenized as pictured:

are represented using the following numbers using OpenAI's GPT-4 model:

Preforming Inference

The next step is to take the list of numbers, called a vector in our case, that represent the tokens of the input, and preform operations onto it to get an output vector. The LLM that you use defines the sequence of operations that are preformed to get to the final output vector, modern LLMs typically use the decoder-only transformer architecture (T in GPT stands for transformer).

It involves several basic mathematical operations that are preformed including multiplying the input vector by a matrix of numbers called weights and adding a vector of numbers called biases (seen in "Feed forward" layer). The weights and biases that an LLM uses are called their parameters because different LLMs have different numbers that lead to very different outputs. Among these other operations are preformed such as normalizing the numbers so that they don't get too large. Another operation performed is the self attention where another series of weights, biases, and other operations are preformed to reflect how similar each token is to each other. In the previous image it would modify the vector to reflect how the word "The" refers to "fox" and how "brown" describes the "fox" (Karpathy, Wolfram).

It is not necessary to understand every detail of the diagram below, the details that are described here are sufficient.

A more detailed and animated 3D visualization of what happens can be found here.

The process described is repeated a certain number of times, usually around 16, each time the vector representation of the tokens is transformed closer to becoming the output vector. The final output vector assigns a probability to each token in the dictionary for how likely it is to be the next word. (Karpathy, Wolfram)

This probability distribution that has been created is then sampled to pick the most likely next token. The token is then appended to the original input and fed back into the model. This process is repeated until the special end of response token is chosen, at which point the LLM response is returned. It turns out that if you scale the size of the model to billions of parameters that are operating on the model with, you can get coherent paragraphs of text to come out (Karpathy, Wolfram).

The parameters of an LLM and how they were determined are the main distinguishing characteristics of current LLMs. The mechanism of an LLM is considered to be a "black box" as there is currently little understanding of what parameters represent and what is happening at each stage (Karpathy).

The actual values of the parameters are arrived at through training the model, the P in GPT stands for pre-trained. The values of the parameters starts at random values and are iteratively updated as the LLM tries to predict the next tokens from a large diverse training repository. This training data set is curated to ensure it contains examples of all types of text that an LLM may generate such as articles, poems, and code. This stage can take days and thousands of dollars and days of computing on graphics cards on the small scale; millions of dollars and a year of compute on the enterprise scale (Karpathy).

Setting up LM Studio

Now that you have a basic understanding of how LLMs work, we will use LM Studio, which is a free and open source app for running LLMs on your computer. It is typically the most popular and beginner friendly option.

First download LM Studio from the official website for your operating system and complete the installation. Your computer will need to meet the system requirements, according to a survey I conducted of UMD students, all 24 respondents had computers that would meet the system requirements so this is unlikely to be an issue.

When done with installation open LM Studio, on the left sidebar click on the discover button with the magnifying glass. If you don't see the left sidebar you may have to enable "Developer" mode in the bottom.

LM Studio has automatically installed a LLM runtime (llama.cpp), which means that the LLM algorithm is predefined and only the specific parameters of a model need to be installed (LM Studio Docs).

In the popup menu that appears is a list of available models you can download from the repository of the model. Clicking on a model will give you more information such as its specific architecture, total parameters, and size of download. Many many of the models you click on will be marked as "Likely too large" meaning your computer does not have enough memory to run that model. This guide will use "gemma 3 1B QAT", which is an open source LLM created by Google. The model has 1 billion total parameters which have been rounded to reduce the memory usage. The process of rounding the parameter values is called quantization and has been shown to reduce memory usage much more then it degrades performance. Based on my survey most student's had between 8 GB and 32 GB of RAM, so the 720 MB model should work on most computers.

Click the green "Download" button on the bottom right to download the parameters and afterwards you can click the "Use in new chat" button that appears in the same place. Now you should see a chat window with the model loaded into your computers memory. You can click the blue "Eject" button on the top to remove the model from your memory at any time, although deleting the model from your computer will require you to go to the "My Models" tab on the left above the "Discover" button.

We can now test the model by running a sample command such as "Can you teach me how to solve a Rubik's cube?".

The tokens that are being generated as output can be seen printed to the response window live as the model runs.

After the model finishes running there are a few additional bits of information that are outputted you may notice:

The number of tokens per second that the model generates: my computer generated output an average of 15.78 tokens per second which is sufficient as it is faster then most people can read and process information. If your computer has a graphics card or a processor with strong inbuilt graphics you can potentially get an order of magnitude faster performance.
Amount of context that is full: 35.6% in this case. The context is the amount of input tokens a model can take, in our case the Gemma model we are using has a maximum context of 32,000 tokens. This is important because for each follow up question asked in this chat, the previous questions and responses are appended to the end of the question for context. If the total context is exhausted the oldest chats are truncated until there is enough space.

You can next eject the model from memory, go back to the discover tab and try out more powerful models that your computer supports.

According to the survey most students generally use AI to generate code, help with studying, learning new subjects, and brainstorming ideas. A model like the one we used may be slightly insufficient for some of these tasks so a more powerful model such as the 27 billion parameter variant of the Gemma 3 model can be utilized if possible.

A more powerful model will typically have more parameters allowing it to encode more knowledge and connections between topics in its parameters. It will be more likely to produce more accurate and elegant responses to prompts. Parameter counts are not everything, newer models are trained with better data using more efficient techniques allowing for a small model today to potentially outperform a model with 100 times as many parameters from a few years ago. (Karpathy)

This also raises the question of how LLM performance is measured, there is unfortunate no silver bullet answer. There are several standard benchmarks and human evaluation websites that are commonly used to compare LLMs, some notable websites are Chatbot Arena, LiveBench, ARC Prize, and SEAL. Often domain specific benchmarks are used to evaluate performance on specific tasks like math competition problems and medical diagnosis. It is important to note however that some AI labs intentionally train LLMs on problems that are on the benchmarks to cheat them and make their models appear better then they are. The best rule of thumb asking models a few difficult prompts that are in the same domain as your typical LLM use case and seeing how they do.

Limitations and Risks

Despite all the benefits of LLMs there are drawbacks to be aware of when using them and reading output.

Hallucinations

During the training stage of creating LLMs the models see the patterns in the form of language, such as grammar and format of different types of writing, than it does the semantic meaning of the content. If a LLM is producing a response to a subject that it has not been sufficiently trained for in its training set to pick up on the semantic patterns, it will produce a response that sounds correct but has incorrect content. This is called hallucination (Karpathy). One example is when I previously asked the LLM for a guide on Rubix cubs, it included plausible looking but fake links to Youtube videos. You should double check any significant information and any information the model has likely been trained on few examples of to see the format but not the content, like links.

One way to prevent hallucinations is to enable a search functionality if available, where the LLM can output special tokens to make a call to a search engine and receive additional context from web results. These features are often implemented to show what source is used for each piece of information.

Bias in LLMs

Another significant limitation of LLMs is the presence of bias in their responses. Bias in language models stems largely from the data they are trained on. Since LLMs learn by analyzing vast amounts of text from the internet, books, forums, and other sources, they inevitably absorb the implicit and explicit biases present in those sources. This includes social, political, cultural, and historical biases.

For example, if a large portion of the training data overrepresents certain demographics or viewpoints while underrepresenting others, the model will reflect those imbalances. It might associate certain professions more with one gender, or produce responses that reflect stereotypes or outdated norms. A common example is when LLMs generate text that assumes doctors are male and nurses are female, this reflects historical biases in the data, not current reality.

Bias can also appear in more subtle ways, like prioritizing dominant cultures’ perspectives or failing to recognize minority voices. For instance, an LLM might provide more detailed responses about Western holidays than non-Western ones simply because of the skew in available data.

These biases are not intentional, but they can have real-world consequences, especially if LLM outputs are used in decision-making or public communications. Developers often try to mitigate bias through careful data selection, human feedback, and fairness tuning techniques, but it’s nearly impossible to eliminate bias completely.

Because of this, users should approach model outputs critically—especially in sensitive areas like race, gender, religion, or politics—and seek diverse perspectives to validate the information. Being aware of how bias is introduced helps users better interpret what the model says and why it might say it.

About the Author

My name is Areg Gevorgyan, I am a first year student at UMD studying Mathematics and Computer Science. I'm interested in everything related to these fields and their applications including artificial intelligence and machine learning.

This guide was created as my final project for my technical writing course with an aim to show the UMD community how they don't need to rely on online providers, protect their privacy, and increase their literacy around AI.

Feel free to contact me at areg@terpmail.umd.edu and please leave any feedback, comments, or questions on this form.

Works Consulted

This section uses MLA format for citations.

“About LM Studio: LM Studio Docs.” LM Studio - Docs, https://lmstudio.ai/docs/app. Accessed 11 May 2025.

Documentation of one of the most popular app for deploying AI models locally, LM Studio. The documentation contains all necessary instructions and information for using the app. This is mainly used for the main objective of the guide on instructing readers on how to run the model locally.

Andrej Karpathy. (2025, February 5). Deep Dive into LLMs like ChatGPT [Video]. YouTube. https://www.youtube.com/watch?v=7xTGNNLPyMI

Video created by Andrej Kaprpathy, a widely recognized expert in AI who cofounded OpenAI, currently runs a company focused on using AI in education, and educates the public on how AI works through his videos. The video provides additional information in an accessible manner on how AI language models work and specifically how they are packaged into a chat program. This will provide an additional source of information for some specific aspects of the guide such as the process of creating and preparing the models for being a chat tool. This will reduce the sense of magic around AI with more concrete information on how it works.

Gomstyn, A., & Jonker, A. (2025, January 24). AI privacy. IBM. https://www.ibm.com/think/insights/ai-privacy

Article by IBM, a large technology company with significant AI research division and editorial content for educating the public about technology. The article provides a broad explanation of many ways in which AI can cause privacy to be compromised and current AI privacy policy. This will be used to explain to the reader the many risks that can be averted by running models locally on their computers.

“Guidelines for the Use of Generative Artificial Intelligence (Genai) Tools at UMD.” Guidelines for the Use of Generative Artificial Intelligence (GenAI) Tools at UMD | AI @ UMD, https://ai.umd.edu/resources/guideline. Accessed 11 May 2025.

Report by UMD detailing the universities' stance towards AI tools and acceptable and suggested use cases for AI. This will be used to demonstrate how there are many legitimate use cases for AI as recognized by UMD beyond viewing AI tools as being "cheating tools".

Hornberger, Bewersdorff, & Nerdel. (2023). What do university students know about Artificial Intelligence? Development and validation of an AI literacy test. Computers and Education: Artificial Intelligence, 5. https://www.sciencedirect.com/science/article/pii/S2666920X23000449

Peer reviewed paper by professors from University of Munich published in a journal focusing on technology in education. The paper helps define what AI literacy means and sets clear goals for college students who will be the main audience of the guide. This will help set the initial educational goals of the guide in the beginning and give readers an idea of what they are going to gain by reading the guide.

King, J., & Meinhardt, C. (2024). Rethinking Privacy in the AI Era*. Stanford University. Retrieved April 13, 2025, from https://hai-production.s3.amazonaws.com/files/2024-02/White-Paper-Rethinking-Privacy-AI-Era.pdf

Report written by two leading experts in data privacy and policy at the Stanford Human-Centered AI Institute suggesting how policy should be set to protect privacy in the age of AI. The report contains information on current data privacy practices of AI companies and the current regulatory environment. This will be used to explain to the reader of my guide why they should be concerned about using online AI tools.

Report on Student Attitudes towards AI in Academia | Learning Technology Solutions | University of Illinois Chicago. (n.d.). https://learning.uic.edu/news-stories/report-on-student-attitudes-towards-ai-in-academia/

Report published by the University of Illinois Chicago Learning Technology Solutions that synthesizes the responses of their students who took a survey on their usage of AI tools. The report covers everything including frequency of use, types of tools used, use cases, academic integrity, and need for AI literacy. The report contextualizes the quantitative responses with some qualitative data collected from the students elaborating on their choices. The findings of the report will be used to guide the contextualization of the similar survey that I am going to conduct that will demonstrate to the reader of my guide it is with reading.

Southworth, Migliaccio, Glover, Reed, McCarty, Brendemuhl, & Thomas. (2023). Developing a model for AI Across the curriculum: Transforming the higher education landscape via innovation in AI literacy. Computers and Education: Artificial Intelligence, 4. https://www.sciencedirect.com/science/article/pii/S2666920X23000061

Recent peer reviewed journal article by professors at University of Florida, Gainesville published in a journal on using technology for education. The article focuses on the integration of AI into the curriculum to ensure students are literate on AI models, suggesting a hands-on approach to AI education. This will be used to briefly explain to the reader why teaching the user how to set up the model serves a dual purpose of educating them on AI.

Terms & policies. (n.d.). OpenAI. Retrieved April 13, 2025, from https://openai.com/policies/

Documentation of the most popular app for deploying AI models locally, Ollama. The documentation contains all necessary instructions and information for using the app. This will be used for the main objective of the guide on instructing readers on how to run the model locally.

Tully, S., Longoni, C., & Appel, G. (2024). Lower Artificial Intelligence Literacy Predicts Greater AI Receptivity. In Marketing Science Institute Working Paper Series (No. 24–132). Marketing Science Institute Working Paper Series*. Retrieved April 13, 2025, from https://thearf-org-unified-admin.s3.amazonaws.com/MSI_Report_24-132.pdf

Article published by the Marketing Sciences Institute, a leading organization that supports marketing research led by many experts in academia and industry. The report describes how students with less knowledge on AI have greater AI receptivity as they perceive AI to be more magical and have greater feeling of awe when AI complex an assignment for them. I will use the results of this report to persuade the readers of my guide to want to become more educated on the topic.

Wolfram, S. (2023, February 14). What Is ChatGPT Doing . . . and Why Does It Work?—Stephen Wolfram Writings. https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

Article by Stephan Wolfram, a computer scientist, mathematician, and physicist who specializes in computational science. The article clearly explains how AI language models such as ChatGPT work in an accessible manner. The information will be used to give brief explanations on what the AI model the reader is installing locally is doing. This would both serve to reduce the sense of magic and provide background knowledge that can help guide them on how they think AI will impact them and how they choose to utilize AI in their lives.

Appendix A: Survey of Students

A survey was conducted of 24 students during late April and early May. Many of the students where either STEM majors that I knew or students in my general education courses. The students were surveyed on questions to assess their attitudes and willingness towards running AI models locally.

Student's were surveyed on their typical use cases for AI, typical responses were brainstorming ideas, debugging code, answering questions, learning about a topic, and help studying.

Appendix B: Interview with Sharmila Duppala

I conducted an interview with UMD PhD student Sharmila Duppala who does research around LLMs. Since she is currently out of town, the interview was conducted on a shared Google document for ease of access and flexibility. The purpose of the interview was to receive expert feedback and guidance on the guide to better achieve its goals.

Her feedback was supportive of the guide and its direction, "I think your chosen approach (practical guide using LM Studio) is excellent for your stated goals. It's hands-on, directly addresses privacy, and naturally introduces literacy concepts."

From her perspective interacting with UMD undergraduates through courses that she was a teaching assistant for she saw that "Students sometimes treat AI outputs as infallible truths without sufficient critical evaluation. This is particularly concerning for tasks like research or problem-solving where accuracy is paramount."

When asked how she would define AI literacy for UMD students she responded, "For the typical UMD student, I'd define AI literacy as possessing the foundational knowledge and critical thinking skills to: Understand what AI (particularly LLMs and generative AI, given their current prevalence) is, how it generally works, and what its core capabilities and limitations are. Evaluate the outputs of AI systems critically, recognizing potential biases, inaccuracies (hallucinations), and ethical implications."

Page updated

Report abuse