vLLM

vLLM (Virtual Large Language Model) is a fast and easy-to-use library for LLM inference and serving. Check the vLLM GitHub Page.

Create a dedicated python virtual environment referring the HPC Guide for vLLM to avoid conflict with the existing packages and source to it

Install vLLM package. It is going to take some time.

pip install vllm

Start the vLLM server with the Qwen2.5-1.5B-Instruct model - the latest series of Qwen large language models (LLM).

vllm serve Qwen/Qwen2.5-1.5B-Instruct

- - INFO 07-31 16:40:32 [__init__.py:235] Automatically detected platform cuda.
  - INFO 07-31 16:41:30 [api_server.py:1818] Starting vLLM API server 0 on http://0.0.0.0:8000
  - INFO: Started server process [1496641]

Open the new terminal and type the following to query the server for the list of models

curl http://localhost:8000/v1/models

- - {"object":"list","data":[{"id":"Qwen/Qwen2.5-1.5B-Instruct","object":"model","created":1753994661,"owned_by":"vllm",

You can also open the browser and query using:

Copy the directory "LLM" from /usr/loca/doc/LLM, change directory to it, and find the Python script “vLLMCompletion.py”

cp -r /usr/loca/doc/LLM ./

Run the python script

You will get the text output for the prompte - “San Francisco is a" … text=' city with a history of being ahead of its time, and this has led to', …

Page updated

Report abuse