Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Not to mention they have to store the data after they download it. In theory storing garbage data is costly to them. However I have a nagging feeling that the attitude of these scrapers is they get paid the same amount per gigabyte whether it's nonsense or not.




If they even are AI crawlers. Could be just as well some exploit-scanners that are searching for endpoints they'd try to exploit. That wouldn't require storing the content, only the links.

If you look at the pages which are hit and how many pages are hit by any one address in a given period of time it's pretty easy to identify features which are reliable proxies for e.g. exploit scanners, trawlers, agents. I publish a feed of what's being hit on my servers, contact me for details (you need to be able to make DNS queries to a particular server directed at a domain which is not reachable from ICANN's root).



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: