Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It won't be long before generalized bots stop requesting links that don't have a visually rendered link in a page.




If bots get good enough to know what links they're scraping, chances are they'll also avoid scraping links they don't need to! The problem solves itself!

Maybe you're joking, but assuming you're not: This problem doesn't solve itself at all. If bots get good enough to know what links have garbage behind them, they'll stop scraping those links, and go back to scraping your actual content. Which is the thing we don't want.

That's sort of the point: almost nobody runs a site as large as Reddit. The average website has a relatively small handful of pages. Even a very active blog has few enough pages that it could be fully scraped in under a few minutes. Where scrapers get hung up is when they're processing links that add things like query parameters, or navigating through something like a git repository and clicking through every file in every commit. If a scraper has enough intelligence to look at what the link is, it _surely_ has enough intelligence to understand what it does and does not need to scrape.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: