Recommended Posts

So I've hosted my own personal/family server for years now with services like Jellyfin, Nextcloud, etc.  In light of the recent retroactive editing of information I thought it might be a good idea to set up a Kiwix mirror as a "just in case" for certain resources that might come under attack in the current political climate.  It has done fine and is running without issue ( https://kiwix.marcusadams.me if you're curious ), but lately I have noticed a humongous influx of AI scraping bots trying to index the contents of that mirror.  On one hand, I don't mind search engines indexing it and directing people to it if the original source of that material is somehow unavailable to someone.  I can see in my log file that in all of yesterday "Googlebot" made 188 requests to it.  However, in that same period of time, "Claudebot" made over 800,000 requests.  I've been seeing these requests in the logs for a couple of weeks now, but started to get a little curious because it's been going 24 hours a day for at least a week or two now.  Upon investigating a little further I noticed that several AI scrapers from Claude, Meta and even Bytedance have been hammering my poor little home server non-stop for a while now.  It would be one thing if they just got a directory listing and then used that to inform a search engine, but no they're trying to fully index the entire contents of the site, which comprises nearly a terabyte of data, at a rate of about 100 kB/s; I'm guessing to try to stay under the radar with regards to rate limiting and such.  The problem is that this is a personal server and the RAID array sits on spinning rust hard drives, so even if it's only at 100 kB/s, the end result is that they're keeping my read heads moving around constantly to scrape data to train their AI with.

So I ultimately decided that I need to just block abusive AI scrapers; both because they steal data en masse to train an AI so you never have to visit an original source, but also because I don't think any one of those billion dollar companies are gonna send me a dollar to help pay for my electricity, internet or replacement drives when they fail, despite them benefiting from all those things.

image.png.f1427c53745d9a4bedb13de9e0457c1a.png

First I tried creating a custom filter for Fail2Ban.  That works, but my inbox has exploded.  Initially I had the limit set to 5 connections within a 10 minute window before triggering a ban.  As of writing I have right at 1,900 emails in my system inbox; all notifications from Fail2Ban of unique IP addresses that have been banned.  It seems like as soon as one IP gets banned they just move to another one and keep going.

image.thumb.png.1cc2a02dcae128d6bbfb957512ed851f.png

It's been almost 24 hours and Fail2Ban has slowed them down considerably.  Claude dropped from over 800,000 requests in yesterday's log to just over 75,000 today.  That's probably mostly up to the time it takes them to realize they've been blocked and switch to a new IP address.  You would think they'd take the blatant blocking as a sign that their behavior was not welcome.  Nope.  They're all still going.  So this evening I've made two more changes.

First I reduced the amount of requests they have to issue to trigger a ban from 5 to 1.  I figured allowing 5 requests within 10 minutes would be plenty on the off chance an image or something in a search result was pulled from something I'm hosting.

Second, I've modified the Apache site config for the archive to give a "403 Forbidden" response to AI bots matching the same user agents that I also have blocked in Fail2Ban.

It's been about an hour and the requests still haven't slowed down, even though now not only are they having to switch IPs more often to get around firewall bans after every single request, but they aren't even getting the one file they requested, they're just getting a 403 error.

image.png.c02fdc872cfca8c286a0352efe3e94d8.png

It seems to me like the recent release of Deepseek has instigated some kind of AI arms race where any data that is publicly accessible is free game for training your models.  It doesn't matter if it's straight from reputable news sites, or some random dude's home NAS running on an old PC tower in the back woods of Kentucky.

image.png.9682875e98396010cce89a193cca8263.png

I just wanted to share this little anecdote about what's going on right now; and maybe give a heads up to those of you who host similar services, especially if you have bandwidth caps or anything on your personal stuff or if you're hosting it on a VPS that may charge you based on bandwidth.

Link to comment
https://www.neowin.net/forum/topic/1452911-ai-scraping-is-getting-out-of-hand/
Share on other sites

Put your site behind Cloudflare, its free and they have a new Anti AI bot feature to send bots into a AI hell spiral of ###### data so they stop indexing your site(s)

  On 24/03/2025 at 04:43, binaryzero said:

Restrict your firewall to only allow connections from a known list of IPs (i.e. the people using the media server); welcome to having your infrastructure open to the world... 

The dreaded 'Any' rule strikes again.

Expand  

I would except I do occasionally use my Nextcloud instance to share files with friends and family members.  My wife and I don't have Facebook so whenever there's a birthday party or something I'll make an album on Nextcloud and then share a link to it with grandparents and other interested parties.  I do have a few things tightened down that way; such as SSH access not being forwarded and only accepting connections from the LAN/VPN IP ranges, but Apache is one that needs to remain open.

  • Like 2

I just saw an article the other day where iFixit said their website was hit over a million times in a 24 hour period by the ClaudeBot.  Mine racked up over 800,000.  It's getting wild; they're just hoovering up anything they can find to try to stay competitive with Chinese companies.

  • Like 1

The suggestion of putting your public mirrors behind Cloudflare by Mud W1ggle is a good shot. The data will be cached, so less load on your server. You get the protection of Cloudflare's web application firewall, plus they are pretty good at blocking bots / unwanted traffic too.

Something you could do is keep any publicly accessible data like the Wikipedia mirror on an NVMe drive, then only keep your media library and family photos on your array. That should mean the array would be spun down most of the time unless family are accessing this media.

I have various Docker containers running from an Nvme drive, including a game server. Yet my array is idle most of the time unless someone starts playing something via Plex or accesses family photos via an SMB share.

This results in very low idle power usage, despite having quite a few self hosted services running:

image.png.5cbb07aeaafebfaf3064fc00d30de439.png

  • Like 3
  On 25/03/2025 at 10:32, hornett said:

I've got the same issue, if you get a chance, would you mind sharing your fail2ban filter definition & the jail config? 

Thanks

Expand  

Contents of /etc/fail2ban/filter.d/aibots.conf (Rename it whatever you want):

#Fail2Ban filter for misbehaving AI scrapers and bots
#that don't respect robots.txt
#Marcus Dean Adams

[Definition]
failregex = ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*ClaudeBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*meta-externalagent.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*meta-externalfetcher.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*Bytespider.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*GPTBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*anthropic-ai.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*FacebookBot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*Diffbot.*$
            ^<HOST> -.*"(GET|POST|PUT|DELETE|HEAD|OPTIONS|CONNECT).*PerplexityBot.*$

Contents of /etc/fail2ban/jail.d/aibots.local:

[aibots]
enabled = true
port = 80,443
filter = aibots
maxretry = 1
bantime = 168h
findtime = 10m
logpath = /var/log/apache2/access.log

There are other bots out there with different user agent strings you may want to add to your filter, but Google and the few others I've seen haven't been spamming the living daylights out of me so I've left them alone.

I also made a change to the apache site configuration file and added this so that anything with one of the specified user agents gets a 403 - Forbidden error instead of actually getting the file they requested.

                #Block AI Bots
                RewriteEngine on

                RewriteCond %{HTTP_USER_AGENT}  ^.*Bytespider.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*ClaudeBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*meta-externalagent.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*meta-externalfetcher.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*GPTBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*anthropic-ai.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*FacebookBot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*Diffbot.*$
                RewriteRule . - [R=403,L]

                RewriteCond %{HTTP_USER_AGENT}  ^.*PerplexityBot.*$
                RewriteRule . - [R=403,L]

 

  On 25/03/2025 at 10:32, hornett said:

I've got the same issue, if you get a chance, would you mind sharing your fail2ban filter definition & the jail config? 

Thanks

Expand  

I got tired of being blown up with the emails from this jail (4,500+ unique IP addresses banned since turning this jail on a couple days ago), partly because of the notifications, partly because it was drowning out legitimate emails from the server, so I slightly modified the jail file to specify an action that doesn't include sending the email.  I also bumped up the ban time to 4 weeks.

New contents of /etc/fail2ban/jail.d/aibots.local

[aibots]
enabled = true
port = 80,443
filter = aibots
maxretry = 1
bantime = 672h
findtime = 10m
logpath = /var/log/apache2/access.log
action = %(action_)s

 

The spam has slowed down considerably.  I still get a couple new banned IPs every hour, but after I made this initial post where I thought things were slowing down they picked right back up because Bytedance seemed to be picking up the slack after Claude started slowing down; just hammering me non-stop.  Before instituting the block I was getting over a million automated bot queries a day (Predominantly Claude at first) and since implementing the block it's slowed them down considerably due to having to switch addresses constantly, but I've still racked up 4.5k unique IP addresses on the block list since Sunday.  I've bumped the ban time from 1 to 4 weeks and where I was getting 1 or more banned IPs every minute; this morning that slowed down to one every 3 or 4 minutes and it's now down to one IP every 6-10 minutes, so they're either turning their attention away from me or just straight up running out of IP addresses to swap to.

As long as other folks like Google keep their traffic reasonable I hopefully won't have to add anybody else to the list.

image.thumb.png.ed24be0376312226cd86bcf1e62ddef5.png

Edited by Gerowen

Anything that is public on my home server I tunnel through Cloudflare using Cloudflared https://github.com/cloudflare/cloudflared. I'd recommend it, you don't need to open any ports and you also get a lot of security features, caching etc for free.

  On 27/03/2025 at 07:04, SuperKid said:

Anything that is public on my home server I tunnel through Cloudflare using Cloudflared https://github.com/cloudflare/cloudflared. I'd recommend it, you don't need to open any ports and you also get a lot of security features, caching etc for free.

Expand  

I'll definitely check it out; you're the 2nd person who has mentioned Cloudflare.  I've just been busy with other stuff and haven't taken the time to sit down and take a look at it.

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now