We found that 91 domains explicitly state in their content policy (Terms of Use, Privacy Policy, etc) that their content cannot be used for AI purposes. However, their robots.txt is configured to allow LLM bots to access their web content. Below details the number of domains that fully allow, partially allow, and fully disallow each LLM bot.
As a result, one of the LLM training datasets, Fineweb, contains the content of 75 domains that explicitly state their content cannot be used for AI purposes.
The full list of domains and number of URLs included in the Fineweb. We have reported this issue to the huggingface.