Abstract
Web content is an essential element for large language model (LLM) services, supporting both training and inference processes. To manage the content access of web bots from LLM service vendors (i.e., LLM bots), web content publishers are increasingly incorporating content access rules into robots.txt, a long-established web content management protocol. However, the rise of proprietary LLM bots, such as OpenAI’s ChatGPT-User and Google’s Google-Extended, has raised concerns about the transparency of web content access and whether these bots adhere to robots.txt rules. However, there is limited understanding of these LLM bots concerning their impact on web publishers and broader web content governance. To fill this gap, we present a systematic analysis of 18 LLM bots on 582,281 robots.txt files. Our findings reveal a significant increase in robots.txt rules associated with LLM bots, particularly in domains that fall into the finance and news categories. Despite the heightened integration, web publishers face challenges in managing robots.txt configurations due to the complexity of the LLM ecosystem and the involvement of third-party brokers. Furthermore, we identified several cases of robots.txt violations, including instances where LLMs memorized web content from restricted domains, and where ChatGPT-User ignored robots.txt and accessed restricted content. These results highlight the gaps in current web content governance and underscore the need for enforceable content management mechanisms to respect web publishers’ intentions and content control.
Code: https://github.com/jiancui-research/llmbot_compliance_ccs25
Dataset: https://drive.google.com/uc?export=download&id=16y_QrENhjra7lCDrRz7yIJqE-bwrGzXr
Our artifact obtained all three badges in the artifact evaluation phase: Artifact Available, Artifact Functional, and Results Reproduced