Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not scraping, you're just consuming an API.

Scraping is when you're parsing human-readable content (HTML) and extracting data, as the parent comment correctly points out.



I don't think that definition is universally agreed upon. I have had many conversations over the years where the term "scraping" referred to activities that didn't involve things like parsing HTML.


Scraping specifically refers to extracting information from source documents/data. Merely downloading them is just retrieval or, when following links, crawling.

“Git scraping” would intuitively refer to extracting specific information from Git repositories. The naming in the article is therefore confusing. “Snapshotting into Git” would be more accurate. (Git itself uses the term “snapshot” for a reason.)


Agree to disagree. To me, scraping implies a level of fragility that hitting an endpoint that returns JSON does not have.


Undocumented endpoints that return JSON are pretty fragile!

One of the benefits of catching them in a Git repo is that it helps you spot when their structure changes in ways that may break code that you write on top of them.


Sure, they are prone to being changed out from under you, but I think we can agree they're not fragile in the same way that parsing html for the 3rd div tag with the id w9j8f (thanks react!) and the 2nd a href tag under that is. It's very clear when the endpoint changes, or the outputted JSON changes, but assuming it's still JSON, it should still be fairly readable, and if the data's still in the JSON blob, finding it is quick work. Whereas if the HTML changes, you're in for a slog.


From the Wikipedia entry for "data scraping":

> the key element that distinguishes data scraping from regular parsing is that the output being scraped is intended for display to an end-user, rather than as an input to another program

Snapshotting JSON files can be incredibly useful, but I don't think you should call it "scraping".


I think that article actually further supports my position here: https://en.wikipedia.org/wiki/Data_scraping

It has sections covering things called "data scraping" and "web scraping" and "screen scraping" and "report mining", and links to articles about "data mining" and "data mining" and "search engine scraping" as well.

To me, that indicates that the terminology around this stuff is already extremely vague and poorly defined... and the suffix "scraping" is up for grabs for anyone who wants to further define it!

(If you don't like me calling this technique "Git scraping" you're going to /really/ hate the name I picked for my shot-scraper tool https://shot-scraper.datasette.io )


Scraping also has, in some contexts, negative associations. In a project for a non-profit that I'm involved with that coincidentally was originally a remix of some of Simon's code for one of these "Git scraping" projects + Datasette, I recently made the decision to refer to it strictly as what it is: a crawler.

I'm less warm at this point to the general idea behind the hack of dumping the resulting JSON crawl data to GitHub. It's a very roundabout way of approaching basically what something like TerminusDB was made for. It definitely feels like the main motivation was GitHub-specific stuff and not Git, really—namely, free jobs with GitHub Actions—and everything else flowed from that. It turns out that GitHub Actions proved to be too unreliable for executing on schedule, anyway, so we ported the crawler to JS with an eye towards using Cloudflare Workers and their cron triggers (which also come in a free flavor).


My first implementation of this pattern predated GitHub Actions and used CircleCI - though GitHub Actions made this massively more convenient to build.


Exactly - "scraping" is the final resort when sites don't make data available via an API. It's almost exactly synonymous with "parsing HTML".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: