Blocking Bytedance "hidden" AI bot #69

bohwaz · 2025-01-09T04:29:39Z

Bytedance are trying to hide their traffic, but these user agents are very weird, for example, iOS 11 running Chrome, or old Android devices. The IP addresses are all over the place. They might be using Tiktok/Doubao users as proxies?

Here are the user agent strings you want to block:

MRA58N
OPD3.170816.012
LRX21T
CPU iPhone OS 11_0 like Mac OS X.*Chrome

Here is an example of requests, as they were just hammering my server with thousands of requests every second:

XXXX:443 74.221.151.32 - - [09/Jan/2025:00:01:39 +0100] "GET /xxx/doc/xxxx/www/admin/.htaccess HTTP/1.1" 200 3188 "-" "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.6530.1545 Mobile Safari/537.36"
XXXX:443 75.229.229.57 - - [09/Jan/2025:00:01:39 +0100] "GET /xxx/draft1/tree?ci=yyyy&name=src%2Ftemplates%2Fconfig%2Fcatxxxx&type=tree HTTP/1.1" 200 6970 "-" "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.9180.1767 Mobile Safari/537.36"
XXXX:443 75.180.26.109 - - [09/Jan/2025:00:01:40 +0100] "GET /xxx/draft1/finfo?ci=yyyy&name=doc%2Findex.md HTTP/1.1" 200 21074 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_0 like Mac OS X) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.9415.1534 Mobile Safari/537.36"
XXXX:443 68.12.64.235 - - [09/Jan/2025:00:01:40 +0100] "GET /xxx/draft1/finfo?ci=merge-in%3A59f69df&name=doc%2Fadmin%2Fbxxxs.md HTTP/1.1" 200 18713 "-" "Mozilla/5.0 (Linux; Android 8.0; Pixel 2 Build/OPD3.170816.012) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2451.1463 Mobile Safari/537.36"

Of course you can't block them using a robots.txt alone, you will need to set up proper user-agent blocking on your web server.

Other sources on this:
https://www.webmasterworld.com/search_engine_spiders/5088284.htm
https://xenforo.com/community/threads/known-bots.148723/page-4

The text was updated successfully, but these errors were encountered:

glyn · 2025-01-09T05:45:11Z

The links provided do not give the impression that these crawlers are for AI purposes. Unless there is evidence that can be cited that these are AI crawlers, they are out of scope for this project.

raramuridesign · 2025-01-09T07:51:40Z

Just a note on this... We have tried to block using robots.txt and using htaccess. But this has never been successful.
We have found the only really effective way to do this, is by using cloudflare WAF.

It is a known fact that AI and other other crawlers do not respect robots.txt files. It is a guide for crawlers to understand site structure.
Using htaccess, we have found that if the file becomes to large this then causes too many load issues, when the site has high traffic.

Pushing this work to Cloudflare or similar service has been way more effective.
Hope this helps.
M

bohwaz · 2025-01-09T15:22:10Z

The links provided do not give the impression that these crawlers are for AI purposes. Unless there is evidence that can be cited that these are AI crawlers, they are out of scope for this project.

Bytespider is literally a AI-bot, it is listed in this repo:

ai.robots.txt/robots.txt

Line 7 in b7f908e

User-agent: Bytespider

Bytespider = Downloads data to train LLMS, including ChatGPT competitors.

glyn · 2025-01-10T08:35:32Z

Thanks @bohwaz.

@tinaponting It's still worth adding the Bytedance user agents here, for those of us actively blocking these crawlers.

I'll look forward to a PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Blocking Bytedance "hidden" AI bot #69

Blocking Bytedance "hidden" AI bot #69

bohwaz commented Jan 9, 2025

glyn commented Jan 9, 2025

raramuridesign commented Jan 9, 2025

bohwaz commented Jan 9, 2025

glyn commented Jan 10, 2025

Blocking Bytedance "hidden" AI bot #69

Blocking Bytedance "hidden" AI bot #69

Comments

bohwaz commented Jan 9, 2025

glyn commented Jan 9, 2025

raramuridesign commented Jan 9, 2025

bohwaz commented Jan 9, 2025

glyn commented Jan 10, 2025