this post was submitted on 29 Apr 2025
416 points (97.3% liked)

Technology

69491 readers
5306 users here now

This is a most excellent place for technology news and articles.


Our Rules


  1. Follow the lemmy.world rules.
  2. Only tech related news or articles.
  3. Be excellent to each other!
  4. Mod approved content bots can post up to 10 articles per day.
  5. Threads asking for personal tech support may be deleted.
  6. Politics threads may be removed.
  7. No memes allowed as posts, OK to post as comments.
  8. Only approved bots from the list below, this includes using AI responses and summaries. To ask if your bot can be added please contact a mod.
  9. Check for duplicates before posting, duplicates may be removed
  10. Accounts 7 days and younger will have their posts automatically removed.

Approved Bots


founded 2 years ago
MODERATORS
 

The one-liner:

dd if=/dev/zero bs=1G count=10 | gzip -c > 10GB.gz

This is brilliant.

top 50 comments
sorted by: hot top controversial new old
[–] billwashere@lemmy.world 4 points 5 hours ago

I want to know he they built that visualization

[–] fmstrat@lemmy.nowsci.com 13 points 7 hours ago

I've been thinking about making an nginx plugin that randomizes words on a page to poison AI scrapers.

[–] arc@lemm.ee 12 points 9 hours ago (3 children)

Probably only works for dumb bots and I'm guessing the big ones are resilient to this sort of thing.

Judging from recent stories the big threat is bots scraping for AIs and I wonder if there is a way to poison content so any AI ingesting it becomes dumber. e.g. text which is nonsensical or filled with counter information, trap phrases that reveal any AIs that ingested it, garbage pictures that purport to show something they don't etc.

[–] frezik@midwest.social 15 points 6 hours ago

When it comes to attacks on the Internet, doing simple things to get rid of the stupid bots means kicking 90% of attacks out. No, it won't work against a determined foe, but it does something useful.

Same goes for setting SSH to a random port. Logs are so much cleaner after doing that.

[–] echodot@feddit.uk 3 points 5 hours ago* (last edited 5 hours ago)

I don't know as to poisoning AI, but one thing that I used to do was to redirect any suspicious bots or ones that were hitting their server too much to a simple html page with no JS or CSS or forward links. Then they used to go away.

[–] mostlikelyaperson@lemmy.world 6 points 9 hours ago (1 children)

There have been some attempts in that regard, I don’t remember the names of the projects, but there were one or two that’d basically generate a crapton of nonsense to do just that. No idea how well that works.

[–] moopet@sh.itjust.works 16 points 10 hours ago

I'd be amazed if this works, since these sorts of tricks have been around since dinosaurs ruled the Earth, and most bots will use pretty modern zip libraries which will just return "nope" or throw an exception, which will be treated exactly the same way any corrupt file is - for example a site saying it's serving a zip file but the contents are a generic 404 html file, which is not uncommon.

Also, be careful because you could destroy your own device? What the hell? No. Unless you're using dd backwards and as root, you can't do anything bad, and even then it's the drive contents you overwrite, not the device you "destroy".

[–] dwt 43 points 14 hours ago (1 children)

Sadly about the only thing that reliably helps against malicious crawlers is Anubis

https://anubis.techaro.lol/

[–] alehel@lemmy.zip 12 points 13 hours ago (7 children)

That URL is telling me "Invalid response". Am I a bot?

[–] doorknob88@lemmy.world 38 points 12 hours ago

I’m sorry you had to find out this way.

[–] xavier666@lemm.ee 3 points 8 hours ago

Now you know why your mom spent so much time with the Amiga

[–] L_Acacia@lemmy.ml 1 points 9 hours ago

https://anubis.techaro.lol/docs/user/known-broken-extensions

If you have JShelter installed, it breaks the proof of work from anubis

load more comments (4 replies)
[–] Bishma@discuss.tchncs.de 75 points 18 hours ago (1 children)

When I was serving high volume sites (that were targeted by scrapers) I had a collection of files in CDN that contained nothing but the word "no" over and over. Scrapers who barely hit our detection thresholds saw all their requests go to the 50M version. Super aggressive scrapers got the 10G version. And the scripts that just wouldn't stop got the 50G version.

It didn't move the needle on budget, but hopefully it cost them.

[–] sugar_in_your_tea@sh.itjust.works 20 points 16 hours ago (1 children)

How do you tell scrapers from regular traffic?

[–] Bishma@discuss.tchncs.de 39 points 16 hours ago (4 children)

Most often because they don't download any of the css of external js files from the pages they scrape. But there are a lot of other patterns you can detect once you have their traffic logs loaded in a time series database. I used an ELK stack back in the day.

load more comments (4 replies)
[–] palordrolap@fedia.io 87 points 19 hours ago (3 children)

The article writer kind of complains that they're having to serve a 10MB file, which is the result of the gzip compression. If that's a problem, they could switch to bzip2. It's available pretty much everywhere that gzip is available and it packs the 10GB down to 7506 bytes.

That's not a typo. bzip2 is way better with highly redundant data.

[–] just_another_person@lemmy.world 77 points 19 hours ago* (last edited 13 hours ago) (1 children)

I believe he's returning a gzip HTTP response stream, not just a file payload that the requester then downloads and decompresses.

Bzip isn't used in HTTP compression.

[–] sugar_in_your_tea@sh.itjust.works 18 points 16 hours ago* (last edited 15 hours ago)

Brotli is an option, and it's comparable to Bzip. Brotli works in most browsers, so hopefully these bots would support it.

I just tested it, and a 10G file full of zeroes is only 8.3K compressed. That's pretty good, though a little bigger than BZip.

[–] sugar_in_your_tea@sh.itjust.works 19 points 15 hours ago (1 children)

Brotli gets it to 8.3K, and is supported in most browsers, so there's a chance scrapers also support it.

[–] Aceticon@lemmy.dbzer0.com 3 points 8 hours ago (1 children)

Gzip encoding has been part of the HTTP protocol for a long time and every server-side HTTP library out there supports it, and phishing/scrapper bots will be done with server-side libraries, not using browser engines.

~~Further, judging by the guy's example in his article he's not using gzip with maximum compression when generating the zip bomb files: he needs to add -9 to the gzip command line to get the best compression (but it will be slower).~~ (I tested this and it made no difference at all).

[–] sugar_in_your_tea@sh.itjust.works 2 points 6 hours ago* (last edited 6 hours ago) (1 children)

You can make multiple files with different encodings and select based on the Accept-Encoding header.

[–] Aceticon@lemmy.dbzer0.com 1 points 5 hours ago

Yeah, good point.

I forgot about that.

load more comments (1 replies)
[–] aesthelete@lemmy.world 19 points 16 hours ago* (last edited 15 hours ago) (1 children)

This reminds me of shitty FTP sites with ratios when I was on dial-up. I used to push them files full of null characters with filenames that looked like actual content. The modem would compress the upload as it transmitted it which allowed me to upload the junk files at several times the rate of a normal file.

[–] MeThisGuy@feddit.nl 1 points 4 hours ago* (last edited 3 hours ago)

that is pretty darn clever

I use a torrent client that will lie on the upload (x10 or x11, or a myriad of other options) so as to satisfy the upload ratio requirement of many members only torrent communities

[–] deaddigger@lemm.ee 20 points 16 hours ago (3 children)

At least in germany having one of these on your system is illegal

[–] dzso@lemmy.world 13 points 13 hours ago (1 children)

Out of curiosity, what is illegal about it, exactly?

[–] deaddigger@lemm.ee 10 points 13 hours ago* (last edited 13 hours ago) (3 children)

I mean i am not a lawyer.

In germany we have § 303 b StGB. In short it says if you hinder someone elses dataprocessing through physical means or malicous data you can go to jail for up to 3 years . If it is a major process for someone you can get up to 5 and in major cases up to 10 years.

So if you have a zipbomb on your system and a crawler reads and unpacks it you did two crimes. 1. You hindered that crawlers dataprocessing 2. Some isp nodes look into it and can crash too. If the isp is pissed of enough you can go to jail for 5 years. This applies even if you didnt crash them die to them having protection agsinst it, because trying it is also against the law.

Having a zipbomb is part of a gray area. Because trying to disrupt dataprocessing is illegal, having a zipbomb can be considered trying, however i am not aware of any judgement in this regard

Edit: btw if you password protect your zipbomb, everything is fine

[–] raltoid@lemmy.world 1 points 1 hour ago* (last edited 1 hour ago)

TL;DR: It's illegal to have publically available or share.

Making it illegal to create one for research purposes on your own hardware is not illegal as far as I know. And if it is, I wouldn't mind seeing someone challenge that with the EU.

[–] MimicJar@lemmy.world 10 points 10 hours ago

I wonder if having a robots.txt file that said to ignore the file/path would help.

I'm assuming a bad bot would ignore the robots.txt file. So you could argue that you put up a clear sign and they chose to ignore it.

[–] barsoap@lemm.ee 19 points 12 hours ago* (last edited 12 hours ago)

Severely disrupting other people's data processing of significant import to them. By submitting malicious data requires intent to cause harm, physical destruction, deletion, etc, doesn't. This is about crashing people's payroll systems, ddosing, etc. Not burning some cpu cycles and having a crawler subprocess crash with OOM.

Why the hell would an ISP have a look at this. And even if, they're professional enough to detect zip bombs. Which btw is why this whole thing is pointless anyway: If you class requests as malicious, just don't serve them. If that's not enough it's much more sensible to go the anubis route and demand proof of work as that catches crawlers which come from a gazillion IPs with different user agents etc.

[–] lka1988@lemmy.dbzer0.com 9 points 14 hours ago (1 children)

Maybe bots shouldn't be trying to install malicious code? Sucks to suck.

[–] lennivelkant@discuss.tchncs.de 5 points 13 hours ago

Still illegal. Not immoral, but a lot of our laws aren't built on morality.

load more comments (1 replies)
[–] lemmylommy@lemmy.world 62 points 19 hours ago (2 children)

Before I tell you how to create a zip bomb, I do have to warn you that you can potentially crash and destroy your own device.

LOL. Destroy your device, kill the cat, what else?

[–] archonet@lemy.lol 42 points 19 hours ago

destroy your device by... having to reboot it. the horror! The pain! The financial loss of downtime!

[–] Albbi@lemmy.ca 14 points 17 hours ago (3 children)

It'll email your grandmother all if your porn!

load more comments (3 replies)
[–] cy_narrator@discuss.tchncs.de 27 points 18 hours ago (1 children)

First off, be very careful with bs=1G as it may overload the RAM. You will want to set count accordingly

[–] sugar_in_your_tea@sh.itjust.works 8 points 16 hours ago (1 children)

Yup, use something sensible like 10M or so.

[–] cy_narrator@discuss.tchncs.de 3 points 11 hours ago* (last edited 11 hours ago)

I would normally go much lower,

bs=4K count=262144 which creates 1G with 4K block size

[–] tal@lemmy.today 28 points 18 hours ago (2 children)

Anyone who writes a spider that's going to inspect all the content out there is already going to have to have dealt with this, along with about a bazillion other kinds of oddball or bad data.

[–] catloaf@lemm.ee 21 points 16 hours ago (1 children)

Competent ones, yes. Most developers aren't competent, scraper writers even less so.

[–] idriss@lemm.ee 1 points 7 hours ago

That's true. Scrapping is a gold mine for the people that don't know. I worked for a place which crawls the internet and beyond (fetches some internal dumps we pay for). There is no chance a zip bomb would crash the workers as there are strict timeouts and smell tests (even if a does it will crash an ECS task at worst and we will be alerted to fix that within a short time). We were as honest as it gets though, following GDPR, honoring the robots file, no spiders or scanners allowed, only home page to extract some insights.

I am aware of some big name EU non-software companies very interested in keeping an eye on some key things that are only possible with scraping.

[–] lennivelkant@discuss.tchncs.de 6 points 13 hours ago

That's the usual case with arms races: Unless you are yourself a major power, odds are you'll never be able to fully stand up to one (at least not on your own, but let's not stretch the metaphor too far). Often, the best you can do is to deterr other, minor powers and hope major ones never have a serious intent to bring you down.

In this specific case, the number of potential minor "attackers" and the hurdle for "attack" mKe it attractive to try to overwhelm the amateurs at least. You'll never get the pros, you just hope they don't bother you too much.

[–] mbirth@lemmy.ml 36 points 19 hours ago

And if you want some customisation, e.g. some repeating string over and over, you can use something like this:

yes "b0M" | tr -d '\n' | head -c 10G | gzip -c > 10GB.gz

yes repeats the given string (followed by a line feed) indefinitely - originally meant to type "yes" + ENTER into prompts. tr then removes the line breaks again and head makes sure to only take 10GB and not have it run indefinitely.

If you want to be really fancy, you can even add some HTML header and footer to some files like header and footer and then run it like this:

yes "b0M" | tr -d '\n' | head -c 10G | cat header - footer | gzip -c > 10GB.gz
[–] comador@lemmy.world 25 points 19 hours ago* (last edited 19 hours ago)

Funny part is many of us crusty old sysadmins were using derivatives of this decades ago to test RAID-5/6 sequencial reads and write speeds.

load more comments
view more: next ›