-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Blocking Bytedance "hidden" AI bot #69
Comments
The links provided do not give the impression that these crawlers are for AI purposes. Unless there is evidence that can be cited that these are AI crawlers, they are out of scope for this project. |
Just a note on this... We have tried to block using robots.txt and using htaccess. But this has never been successful. It is a known fact that AI and other other crawlers do not respect robots.txt files. It is a guide for crawlers to understand site structure. Pushing this work to Cloudflare or similar service has been way more effective. |
Bytespider is literally a AI-bot, it is listed in this repo: Line 7 in b7f908e
|
Thanks @bohwaz. @tinaponting It's still worth adding the Bytedance user agents here, for those of us actively blocking these crawlers. I'll look forward to a PR. |
Bytedance are trying to hide their traffic, but these user agents are very weird, for example, iOS 11 running Chrome, or old Android devices. The IP addresses are all over the place. They might be using Tiktok/Doubao users as proxies?
Here are the user agent strings you want to block:
Here is an example of requests, as they were just hammering my server with thousands of requests every second:
Of course you can't block them using a robots.txt alone, you will need to set up proper user-agent blocking on your web server.
Other sources on this:
https://www.webmasterworld.com/search_engine_spiders/5088284.htm
https://xenforo.com/community/threads/known-bots.148723/page-4
The text was updated successfully, but these errors were encountered: