Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blocking Bytedance "hidden" AI bot #69

Open
bohwaz opened this issue Jan 9, 2025 · 4 comments
Open

Blocking Bytedance "hidden" AI bot #69

bohwaz opened this issue Jan 9, 2025 · 4 comments

Comments

@bohwaz
Copy link

bohwaz commented Jan 9, 2025

Bytedance are trying to hide their traffic, but these user agents are very weird, for example, iOS 11 running Chrome, or old Android devices. The IP addresses are all over the place. They might be using Tiktok/Doubao users as proxies?

Here are the user agent strings you want to block:

MRA58N
OPD3.170816.012
LRX21T
CPU iPhone OS 11_0 like Mac OS X.*Chrome

Here is an example of requests, as they were just hammering my server with thousands of requests every second:

XXXX:443 74.221.151.32 - - [09/Jan/2025:00:01:39 +0100] "GET /xxx/doc/xxxx/www/admin/.htaccess HTTP/1.1" 200 3188 "-" "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.6530.1545 Mobile Safari/537.36"
XXXX:443 75.229.229.57 - - [09/Jan/2025:00:01:39 +0100] "GET /xxx/draft1/tree?ci=yyyy&name=src%2Ftemplates%2Fconfig%2Fcatxxxx&type=tree HTTP/1.1" 200 6970 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.9180.1767 Mobile Safari/537.36"
XXXX:443 75.180.26.109 - - [09/Jan/2025:00:01:40 +0100] "GET /xxx/draft1/finfo?ci=yyyy&name=doc%2Findex.md HTTP/1.1" 200 21074 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.9415.1534 Mobile Safari/537.36"
XXXX:443 68.12.64.235 - - [09/Jan/2025:00:01:40 +0100] "GET /xxx/draft1/finfo?ci=merge-in%3A59f69df&name=doc%2Fadmin%2Fbxxxs.md HTTP/1.1" 200 18713 "-" "Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2451.1463 Mobile Safari/537.36"

Of course you can't block them using a robots.txt alone, you will need to set up proper user-agent blocking on your web server.

Other sources on this:
https://www.webmasterworld.com/search_engine_spiders/5088284.htm
https://xenforo.com/community/threads/known-bots.148723/page-4

@glyn
Copy link
Contributor

glyn commented Jan 9, 2025

The links provided do not give the impression that these crawlers are for AI purposes. Unless there is evidence that can be cited that these are AI crawlers, they are out of scope for this project.

@raramuridesign
Copy link

Just a note on this... We have tried to block using robots.txt and using htaccess. But this has never been successful.
We have found the only really effective way to do this, is by using cloudflare WAF.

It is a known fact that AI and other other crawlers do not respect robots.txt files. It is a guide for crawlers to understand site structure.
Using htaccess, we have found that if the file becomes to large this then causes too many load issues, when the site has high traffic.

Pushing this work to Cloudflare or similar service has been way more effective.
Hope this helps.
M

@bohwaz
Copy link
Author

bohwaz commented Jan 9, 2025

The links provided do not give the impression that these crawlers are for AI purposes. Unless there is evidence that can be cited that these are AI crawlers, they are out of scope for this project.

Bytespider is literally a AI-bot, it is listed in this repo:

User-agent: Bytespider

Bytespider = Downloads data to train LLMS, including ChatGPT competitors.

@glyn
Copy link
Contributor

glyn commented Jan 10, 2025

Thanks @bohwaz.

@tinaponting It's still worth adding the Bytedance user agents here, for those of us actively blocking these crawlers.

I'll look forward to a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants