vLLM (Virtual Large Language Model) is a fast and easy-to-use library for LLM inference and serving. Check the vLLM GitHub Page.
Create a dedicated python virtual environment referring the HPC Guide for vLLM to avoid conflict with the existing packages and source to it
Install vLLM package. It is going to take some time.
pip install vllm
Start the vLLM server with the Qwen2.5-1.5B-Instruct model - the latest series of Qwen large language models (LLM).
vllm serve Qwen/Qwen2.5-1.5B-Instruct
INFO 07-31 16:40:32 [__init__.py:235] Automatically detected platform cuda.
INFO 07-31 16:41:30 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:8000
INFO: Started server process [1496641]
Open the new terminal and type the following to query the server for the list of models
curl http://localhost:8000/v1/models
{"object":"list","data":[{"id":"Qwen/Qwen2.5-1.5B-Instruct","object":"model","created":1753994661,"owned_by":"vllm",
You can also open the browser and query using:
http://localhost:8000/v1/models # check output screenshot below
Copy the directory "LLM" from /usr/loca/doc/LLM, change directory to it, and find the Python script “vLLMCompletion.py”
cp -r /usr/loca/doc/LLM ./
Run the python script
python vLLMCompletion.py
You will get the text output for the prompte - “San Francisco is a" … text=' city with a history of being ahead of its time, and this has led to', …