Cache Sharing in LLM serving system: a hidden privacy threat
While LLM serving system optimizations are well-studied, privacy risks linked to these methods remain insufficiently explored. Among these methods, cache sharing in LLM serving systems boosts service efficiency but also creates response time variations, opening the door to potential timing side-channel vulnerabilities. We systematically investigate two common caching mechanisms in LLM serving systems: KV Cache and semantic caching. Exploiting their timing side channels, we developed Prompt Stealing Attack (PSA) and Peeping Neighbor Attack (PNA) attacks to underscore the importance of integrating privacy safeguards alongside performance enhancements.
KV Cache: For each inference request, the LLM maintains an in-memory state called the KV cache, which is reused in every iteration throughout the request’s entire service time. Due to the causal attention mask in LLMs, each token’s activations are influenced only by preceding tokens in the sequence. Thus, if multiple requests share a common prefix, the key and value embeddings for those prefix tokens are identical and shared across sequences, which will be reflected in the TTFT.
Semantic Cache: The semantic cache boosts LLM performance by caching responses based on the semantic content of the requests. A cached response will be returned immediately. Therefore, a sharp decrease of TTFT indicates this kind of cache sharing.
We provide 3 end-to-end Demos in below to show how we can utilize these timing side channels to indicate the privacy of the users. For more detailed analysis of these timing side-channels and how we utilize them, please refer to our submitted paper.
Video Demos:
Captions in these demos will introduce the attack to you, so please turn it on. If the captions are still can't be shown, please open the demo videos and watch them in the Youtube. Thank you!😉
Demo1: Prompt Stealing Attack
Scenario:
In this demo, we consider a scenario where a victim develops a popular LLM chatbot using a proprietary system prompt. The LLM backend uses the SGLang API server, which supports automatic KV cache sharing for common prefixes.
Victim and Attacker:
Victim is the one who develops the LLM chatbot using a proprietary system prompt, while the attacker is a normal user of the chatbot.
The attacker interacts with the chatbot, measures the TTFT, and attempts to recover the system prompt based on timing differences.
Analysis and Discovery:
The attack begins by activating the system prompt. The attacker then uses a next-token predictor to generate requests based on the recovered tokens. After each request, the TTFT is measured, and a timing-based classifier is used to determine if the next token was correctly predicted. By repeating this process, the attacker can reconstruct the system prompt token by token. Our evaluation shows that the attack achieves 89.0% accuracy in predicting a single token, with an FPR of 0.04. And we found that in our experiments, up to 81 tokens of the proprietary system prompt can be recovered in a row. Even when the predictor is not so accurate, it shows great robustness in DEMO1 v2.
When predictor is accurate:
When predictor is not so accurate:
Demo2: Peeping Neighbor Attack
Scenario:
In this demo, we consider an LLM-powered chatbot that supports semantic caching and uses OpenAI's GPT-3.5-turbo as its backend service.
Victim and Attacker:
Both the victim and the attacker are normal users of the chatbot.
Analysis and Discovery:
The victim's queries may include private information, such as their "name," "destination," or "medical condition." We found that these private attributes can create significant semantic differences. As a result, the attacker can craft a set of specific queries (four in our demo) to determine whether the victim has submitted queries containing these private attributes. Our evaluation shows that the attack achieves an accuracy of over 95.4%, with an FPR of 0.05.
Demo3: Inferring Documents on Commodity LLM
Scenario:
In this demo, we consider an LLM application that generates summaries for user-uploaded documents. The application relies on public LLM services, such as the Deepseek API.
Victim and Attacker:
Both the victim and the attacker are normal users of the application.
Analysis and Discovery:
We discovered that the automatic KV sharing feature in these commodity LLM services can reduce the time it takes to summarize documents when they are accessed repeatedly. By exploiting this timing side-channel, the attacker can set a simple time threshold to determine whether the victim has accessed a particular document. Our evaluation shows that this attack achieves an average accuracy of 89%, with an average FPR of 0.05.