ETag and HTTP Caching

alganet · on April 11, 2024

This is nice. It reminds of how miserable my life is.

— Which HTTP code I should return for my API? I already used 404, 403, but I need another one. Damn, HTTP is so old and it makes no sense.

— You can't use HTTP codes like that Bob, they're not a free choice. They're for the protocol, not for your app.

— Let's look at the list. Hm... "412 Precondition Failed". Hey, it sounds nice. It fits to my use case. I'm gonna document it. It means the account is out of balance.

— What is this garbage? Please read the spec. This is going to make our API gateways, CDNs, everything go crazy. Can't let you move on with this PR.

— Look. I documented it, made an enum with the code, it's clean. I'm an experienced REST developer.

— It... it doesn't work like that Bob. Please, read the spec.

— Hey, got enough approvals, "412 Account Out Of Balance" it is! It passes the tests.

For each dev that knows proper HTTP, there's 10.000 Bobs.

AdieuToLogic · on April 11, 2024

For the shortcoming of conveying errors strictly through HTTP status codes, consider:

RFC-7807, Problem Details for HTTP APIs[0]

From the introduction:

  HTTP [RFC7230] status codes are sometimes not sufficient to convey
  enough information about an error to be helpful.  While humans behind
  Web browsers can be informed about the nature of the problem with an
  HTML [W3C.REC-html5-20141028] response body, non-human consumers of
  so-called "HTTP APIs" are usually not.

  This specification defines simple JSON [RFC7159] and XML
  [W3C.REC-xml-20081126] document formats to suit this purpose.  They
  are designed to be reused by HTTP APIs, which can identify distinct
  "problem types" specific to their needs.

HTH

0 - https://datatracker.ietf.org/doc/html/rfc7807

thecopy · on April 11, 2024

Nice one, but seems to be replaced with https://datatracker.ietf.org/doc/html/rfc9457

alganet · on April 11, 2024

Good spec.

stingraycharles · on April 11, 2024

My coworkers insisted on always returning 200 and having the status code in JSON.

At least at that point it’s clearly not HTTP anymore, and it’s better than pretending to be compliant like your Bob. But something dies inside me whenever I have to work with it.

alganet · on April 11, 2024

I'd say returning 200 for all successes is reasonable if the responses are simple.

Returning 200 for an error makes no sense. Having 400s and 500s is the simplest way to have observability over protocol behavior (think logs, error rates, etc). If you use all 200s, you'd have to re-implement observability by yourself, so you lose simplicity that you gained by ignoring those statuses.

It's the same thing with caching stuff. You could implement those outside the protocol, but then you'd be writing your own protocol (trying to be smarter than decades of engineering efforts).

sa46 · on April 11, 2024

I prefer to treat HTTP as a transport layer. A 200 means the transport succeeded but doesn’t indicate that the RPC succeeded.

infogulch · on April 11, 2024

This makes no sense at all. "you got a response" means the transport succeeded, "it timed out" means the transport failed.

sa46 · on April 12, 2024

That’s a strong assertion. There’s plenty of status codes that indicate something bad happened at the HTTP level but don’t convey information about the RPC.

- 404 might mean the URL was wrong

- 405 might mean swapping POST with PUT

jbverschoor · on April 12, 2024

Not the http level… at the level above, the user of your api.

Both of these are simple programming errors. You might as well have used the wrong host or protocol for that matter

jerf · on April 11, 2024

I've taken a very operational view of HTTP errors, which is "What do I want things receiving this error to do?" Unfortunately, that's not a clean question since there's no list you can simple consult to get all behaviors that all HTTP error messages cause. The most important of these is, if this is being accessed by a browser, what will the error code make it do?

Fortunately, for a lot of my API-type work, I also get to not care. I don't want some smart cache to think it knows how to cache my responses or anything and don't care about the sort of infrastructure that thinks it understands HTTP doing anything with my request.

200 {"error": "..."} is not necessarily invalid from this point of view, either. 200, the request was successfully processed and the successful result of that request as far as HTTP is concerned is an error. There doesn't seem a great need to tell HTTP there's an error, HTTP doesn't really care. Telling the browser there's an error has some marginal utility, but if it's an API and there's no browser involved, that doesn't matter much either. The 200 isn't going to fool it into thinking it should put the error into the history or whatever.

I've also learned to avoid getting too fancy with the codes. You will invoke some weird behaviors from systems you didn't even know cared about your connection. 200 {"error": "..."} may seem "wrong", but it is also generally safe. It will do what you expect.

It might be nice to live in a world where there are HTTP error codes that are suitable for everything I need, instead of a big pile of useless codes for abortive standards that never came to be and things nobody uses, and an underspecified set of codes for the things I actually want and use, but there's no point pretending that the standard is something other than it is, and as it stands now, a lot of times the HTTP result code is almost useless.

alganet · on April 11, 2024

You're right about being safe rather than unecessarily fancy.

A lot of codes get real usage though. CDNs use ETags extensively to save bandwidth. It can save a lot of server time.

  - https://developers.cloudflare.com/cache/reference/etag-headers/
  - https://techdocs.akamai.com/property-mgr/docs/val-entity-tag

Varnish, HAProxy, Apache mod-proxy, nginx all can do similar things. Some of them can do this even if you always return 200 (by having rewrite rules and so on). It is often better to leave this kind of work to some upper abstract layer. Some of thse codes are only applicable in a layered system (502, for example, often seen when nginx can't reach a backend application), so they seem useless to developers, but they're not.

For APIs, other stuff uses those codes. Tools like DataDog and NewRelic will get better if you use generic 400 and generic 500 for client and server errors respectively. You can make them work with 200s and a little configuration though, but it's extra work.

If you never needed any of this, it's better not to use it.

jerf · on April 11, 2024

I should indeed have clarified that the browser web has a much richer set of headers and response codes in use, and they are truly useful, and anyone serving web pages at scale should indeed learn about them. IIRC it's still about 1/3rd to 1/4th of the nominally defined HTTP response codes that are useful, but it's still something.

The non-browser web, they approach useless. Which I'm not happy about and not celebrating or advocating for. It's just how it is.

dragonwriter · on April 11, 2024

400 you screwed up vs 500 I screwed up is always better than 200 OK not really.

You can get more specific, and for REST style APIs the correct specific HTTP status code is usually apparent for both successful and unsuccessful requests, but 2xx/4xx/5xx is simple and should be trivial to determine for anything you are using HTTP for even if it's not REST-like.

jerf · on April 11, 2024

I do tend to use those in exactly that way, yes.

However, while your mileage may vary, I end up getting the same complaint from the users either way. Even when my 400 contains an exact reason why the input is incorrect.

Granted, on the one hand, this can be fixed on the individual level, but on the other hand, it's the same effect writ small that when writ large makes the response codes nearly useless, so this post is maybe half cathartic grousing. I can't push caring about response codes. I can document it, I can yield detailed errors, and I can be as careful as I like, but this is a "it takes two to tango" situation and at scale, on average, the other end doesn't want to tango.

ahepp · on April 11, 2024

I think that's one of the key differences between REST and other kinds of RPC architectures.

I've used SOAP and JSON-RPC, both of which (at least in many implementations) send RPCs as HTTP POST requests and receive 200 responses with any error messages in the body. They're just tunneling over HTTP. It's not necessarily wrong, although I'm convinced that leveraging the HTTP verbs and error codes with REST is a fundamentally better design for the use cases I've seen.

peoplefromibiza · on April 11, 2024

That's exactly what SOAP and GraphQL do, they always return 200

ahepp · on April 11, 2024

Isn’t SOAP theoretically “transport agnostic”. It doesn’t leverage http verbs either.

peoplefromibiza · on April 11, 2024

Correct. theoretically SOAP could work using SMTP/POP or whatever transport mechanism.

knallfrosch · on April 11, 2024

Ah, the eternal battle. Do you return an HTTP error code if the problem lies in the domain?

I think I've seen everything on that scale.

On one end: { status: 200, error: InsufficientFunds }

On the other: - "let's use 409, it perfectly fits our use case" + "but we don't have useful error codes for all the other domain errors."

alganet · on April 11, 2024

Status 200 with "InsufficientFunds" can be correct.

Let's assume your resource is "/account/1234/withdraw-availability"

It's a hypothetical endpoint you can GET to know if you can withdraw money. You hit it, and the request is sucessful (the server understood and will inform you whether withdrawaw is available or not).

Let's assume your resource is "/account/1234/withdraw"

This other hypothetical endpoint you can POST a request for money withdraw. Returning 200 here means the server understood and processed your request, so a 200 that does not withdraw makes no sense.

The same endpoint could also return a success "201" accepted (the server understood the request, but it is not processed yet). In the body, there would be a link for "/withdrawaws/48957987593845983475/status", a resource specific to this future processing, which you can GET later (maybe 1ms later, within the same socket). This GET could also return a 200, saying that such withdrawaw was not possible (the server sucessfully understood the request and will inform you about the status of the withdrawaw).

For this modeling stuff, the Roy Fielding dissertation about REST is more enlightening than the spec. The spec is still needed though.

lstamour · on April 11, 2024

One more example, if your request contains a batch of operations, you generally have to return 200, or maybe 204, if it was successfully received and should not be retried in full. In the response body, you might give other response codes for specific errors or failures to retry in a new request. So it can easily make sense to return 200 when there are e.g. partial failures and partial success and the request was properly formatted, authorized and acted upon.

HdS84 · on April 11, 2024

The problem ist that there are multiple status codes which should be used (400, 422, 500 etc). But others are massive footguns. Would have been better to have a "technical" status code and a "application status code", but hey that would have required better engineering.

alganet · on April 11, 2024

RTFM. It's all there. Not a blogpost, not a cheat sheet, the manual (in this case, the spec). Yes, it takes a while.

Anything can be a footgun if misused.

quectophoton · on April 11, 2024

To be fair, to me HTTP looks like:

- The first line of an HTTP request has its own format (space-separated-ish), and mixes method and URL path.

- The URL path in that first line has its own format and weird escaping, and mixes one path with zero-or-more key value pairs.

- The headers have their own format.

- The body has its own format.

Most (all?) HTTP libraries for clients and servers abstract all that mess away into a neat object that could be easily represented in JSON (or bencode if you want something simple-ish while supporting binary data), but it's like using a nice program while knowing it's written in C[1].

Of course, JSON has its own problems, but in that case at least we only need to deal with the problems of one format that supports with proper nesting, not 4 different weird formats masquerading as one.

[1]: Disclaimer, I don't like Rust, so don't take this as a RIIR thing.

EDIT: To be clear, I don't agree with how Bob misuses HTTP in your example. I just find it sad that we're locked into this weirdly complex protocol.

alganet · on April 11, 2024

URLs are opaque:

  - https://www.w3.org/2000/12/drm-ws/pp/connolly/slide8-0.html
  - https://www.w3.org/DesignIssues/Axioms.html#opaque

For the protocol, any structure or meaning in URLs is irrelevant. It is one more example of mixing application domain with protocol level stuff.

HTTP header format goes back to ARPA times. Email reused them, so did HTTP.

- https://datatracker.ietf.org/doc/html/rfc822#section-3.2

Many of these choices are there for backwards compatibility and reuse. I can totally understand why.

An HTTP body has no predefined format. It can be anything. It can be a stream (HTTP into WebSocket upgrade, for example). It is the media type that defines how the body should be interpreted.

HTTP requests are meant to be used before they are fully transmitted, and are formtted in a way to leverage socket communication. JSON, on the other hand, needs the whole document to be read before it can be safely interpreted.

HTTP has more moving parts, but it also does so much more. These two aren't even comparable, they're not in the same layer.

I understand the urge to "improve" on all of this "legacy", however, one must consider how much was built upon these standards and if there's anything real to gain by changing them.

HdS84 · on April 11, 2024

I would love if we could improve on HTTP - even just clearly separating protocol from application would be so great. Maybe dropping some cruft, like the accept headers and so on. But yes, doing that is total folly.

alganet · on April 11, 2024

The separation is your choice. It is not enforced, as this would require limitations that are not worth having.

You can totally drop accept headers if you write your own client and server implementation. HTTP works fine without them. The web as a living organism, not so much.

But hey, we don't need content type negotiation, right? XML will reign forever, mp3 is the ultimate audio format. It's not like new codecs and document types appear all the time and some kind of underlying architecture has to reserve space for that kind of change.

im3w1l · on April 11, 2024

Don't people use ?format=json or /foobar.json anyway?

alganet · on April 11, 2024

As I said, if you're writing your own client and server, you don't need Accept. You probably use just one homogenized content type.

Using `?format=json` is not offensive. It won't mess up some cache layer like improper status code semantics, so I don't really care that much about these if I see it. I wouldn't block a PR on that.

The overall web on the other hand, is supposed to be made of many different client and server implementations. Your browser still relies on Accept headers for displaying images, detecting language, uncompress gzipped responses, resuming paused downloads, showing that JSON API in a nice UI when you open in a dedicated tab, so many things.

To me, it makes sense to leverage the same content negotiation ideas for home grown stuff, even if only one content type is being used.

It starts being a problem if you're working on microservices, and one of them uses `.json` while other uses `?format=json`, made by different teams. The standard is the obvious solution. Instead, they'll either create inefficient clients full of complexities or fight until one of the workarounds prevail. So much easier to follow the standard.

quectophoton · on April 11, 2024

Ah yeah no disagreements there. Better something that already works and is widely used with good ecosystem around it, than risk an xkcd 927 situation.

I understand that the people at the time (presumably) did it the best they could with the information and knowledge they had at the moment, while keeping backwards compatibility.

I only know a bit about HTTP/0.9 and can see how we moved from that to what we have today. I just find the current situation sad.

Like when something only supports ASCII, or only supports IPv4, or assumes I only have 1 CPU thread. Or like when some binary file is encoded in base64 before being sent over the network, only to be decoded on the receiving end. Stuff like that.

But I only feel that way thanks to having the huge benefit of hindsight and modern technology.

alganet · on April 11, 2024

If ASCII is obsolete compared to Unicode, IPv4 is obsolete compared to IPv6, and single threadness is obsolete compared to multi-threading (I don't think that is a valid comparison, but let's go with it), then which standard makes HTTP obsolete?

HTTP/3 was published in 2022.

yencabulator · on April 11, 2024

HTTP/2 or HTTP/3 don't look like that, anymore. The HEADERS frame is just a key-value map where the reserved keys :method, :scheme, :path etc are used for the previously top-level message elements.

vbezhenar · on April 11, 2024

I'm using 400 for everything, to the hell with it.

rednafi · on April 11, 2024

Yeah, same. Otherwise, domain and protocol error codes become a confusing mess.

rpastuszak · on April 11, 2024

I’d just send Bob a link to https://http.cat/412 and hope they get too distracted to continue the conversation.

esafak · on April 11, 2024

Which article would you ask Bob to read to learn the right way to do it?

mozey · on April 11, 2024

Have to admit I've never used this code, and didn't know what it was about. Quickly read up about it. So ETag is a hash of the resource. You must provide it with requests that modify the resource. If your hash doesn't match the server hash, then 412 Precondition Failed is returned?

https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/412

https://www.rfc-editor.org/rfc/rfc9110#status.412

alganet · on April 11, 2024

You can provide all sorts of conditions using HTTP headers such as "If-Match", "If-None-Match", "If-Modified-Since", "If-Range", etc. The server can chose an HTTP code to indicate some sort of cache invalidation signal.

304 means "you're good, your cached version satisfies the conditions"

412 means "you're not good, your cached version does not satisfy the conditions"

412 usually applies to modifications, but it could be for reading too, in the case of ranged requests (getting a specific range of bytes from a large representation). See the "Range" header.

These are all interconected. The headers, the codes, etc. They are very useful for caching and can save a lot of bandwidth. Browsers and CDNs use them extensively. Server-to-server communcation could use them as well, but I haven't seen popular implementations (let's say, a web framework that provides abstraction over these mechanisms).

alganet · on April 11, 2024

Also, other popular HTTP codes have cache implications.

For example, 404 implies "No indication is given of whether the condition is temporary or permanent". Cache invalidation headers don't apply to this code because a 404 means there's something about the _resource_ and not only the _representation_ that could not be found. That client cannot cache the 404 result, not even for a fraction of a second.

404's brother 410 implies "This condition is expected to be considered permanent.". A client that gets a 410 can cache that result, never needing to reach the server again. It means it's gone forever. That client can decide to never look up that URI again.

Very often, "400 Bad Request" is the best HTTP you can use if you are not sure what to use. Then, describe what the error means using other HTTP components and/or the response body.

HTTP can be very simple. GET (ask for a representation) and POST (send a representation) as methods only. 200 (success), 400 (client error) and 500 (server error) as response codes only. It's the best way to start, then move to more elaborate protocol features as you learn.

troupo · on April 11, 2024

HTTP decision diagram: https://github.com/for-GET/http-decision-diagram and Know Your HTTP Well: https://github.com/for-GET/know-your-http-well

alganet · on April 11, 2024

Bob has since moved on to crypto, leaving the cache invalidation eternally crippled. New Bob decided that everything is useless and wants to rewrite the whole backend using a faster language.

akira2501 · on April 11, 2024

The ETag can be _anything_. I have an API that serves "files" from a backend storage system. Whenever files are written a revision number is incremented. This is perfect for a weak validator and so my ETags are also blisfully short and semantically useful, typically:

    ETag: W/"750"

This also means the API can just check the revision number and avoid pulling out and decompressing some of the larger payloads that are stored there and the implementation is absolutely minimal. It's a great standard.

choppaface · on April 11, 2024

A great standard but also provides an effective cookie-less mechanism for user tracking e.g.

https://levelup.gitconnected.com/no-cookies-no-problem-using...

LtWorf · on April 11, 2024

That's why tracking requires consent, regardless of the specific hack used to do it.

kevincox · on April 11, 2024

Can't it be a strong validator if the files don't change at all between revisions? Or does the revision number only increment on "significant" revisions?

akira2501 · on April 11, 2024

The data does not but certain metadata elements might. It probably could be a strong validator anyways for it's use cases, but I made the decision in a hurry.

ajayvk · on April 11, 2024

An approach like https://github.com/benbjohnson/hashfs allows file names to be updated at runtime to be content hashed. This removes the need for the extra "304 Not Modified" API calls from the client. This content hash based file renaming is usually done using a build step which renames files. For applications where the static file serving and HTTP request processing are done in the same application, this can be done in memory without a build step for file renames.

I am using that approach in my project https://github.com/claceio/clace. It removes the need for a build step while making aggressive static file caching possible.

codetrotter · on April 11, 2024

I use content hashes for some of the images in parts of my site. And I use the IPFS scheme for it, and have the path be under /ipfs/ or some such.

And so you could find the same file on IPFS if anyone served it there, as the content hash in the url tells you what to look for.

Even though on my side I’ve done this all completely manually, so much so that it’s literally just me calculating the IPFS hash on my machine one time and then having symlinks with those content hashes so that /ipfs/ directory on my sites contains content that is served by their IPFS content hash, even though my server does not run an IPFS node or anything.

A very interesting side effect of this is that one time I loaded one of my pages, the web browser actually picked up on the pattern and offered to load those files over actual IPFS!

EGreg · on April 11, 2024

Wait what?

Browsers now offer to load over IPFS?

I was wondering how you can trust an IFPS gateway, does does browser verify the file is legit using some checksum? Maybw subresource integrity supports IPFS content hashing or something? How does it generate cid anyway ?

https://docs.ipfs.tech/concepts/content-addressing/#cids-are...

How would you use SRI here to verify the cid (and not an additional out-of-band hash) make sure the gateway isn’t returning some crap to, say, inject malicious JS?

codetrotter · on April 11, 2024

So in my case, I am not using SRI. But I am using the CID as the name in the path.

Using the example file from your link, I would host it as something like

https://mywebsite.example.com/ipfs/QmPK1s3pNYLi9ERiq3BDxKa4X...

And this was enough for that particular browser I was using to recognize that this file can be attempted to be retrieved directly from IPFS

EGreg · on April 11, 2024

Yea but the gateway can be compromised, so that is insecure

codetrotter · on April 11, 2024

Brave wrote in 2021 that they had plans to verify the CID. Not sure if they have added that or not yet.

But either way, I run Brave with a local IPFS gateway on my laptop. So I think when it said that it could retrieve those files via IPFS for me, it meant in my case that those files would be retrieved from actual IPFS and not via a public IPFS gateway hosted on the web

infogulch · on April 11, 2024

I've also been dissatisfied with http caching not utilizing content hashes enough. If you're using server side templating one issue is that it's not efficient to calculate the hash while you're running the template, it would need to be precalculated to be efficient enough to use.

So I wrote https://github.com/infogulch/xtemplate to scan all assets at startup to precalculate the hash for templates that use it, and if a request comes in with a query parameter ?hash=sha384-xyz and it matches then it gives it a 1 year immutable Cache-Control header automatically. If a file x.ext has a matching x.ext.gz/x.ext.zst/x.ext.br file then (after hashing the content to make sure it matches) client requests that support it are sent a compressed version streamed directly from disk with sendfile2. I call this "Optimal asset serving" (a bit bold perhaps).

JaggedJax · on April 11, 2024

How is the sample `calculateETag()` function generating a weak ETag? It looks like it will generate a different hash due to any JSON formatting changes.

It seems like generating a weak ETag would take more effort since you'd need to either ensure consistent ordering and formatting of the JSON, or generate the Etag on the content before converting it to a JSON string.

clessg · on April 11, 2024

That's because it isn't really generating a weak ETag. From the article:

> You could make the `calculateETag` function format-agnostic, so the hash stays the same if the JSON format changes but the content does not. The current `calculateETag` implementation is susceptible to format changes, and I kept it that way to keep the code shorter.

They seem to agree, a true weak ETag implementation would probably be trickier and require more code :P I'd be fascinated to see how that might work in practice, though.

lstamour · on April 11, 2024

I could probably write one pretty quick, under the assumption that we are storing only JS-compatible JSON with no encoding hiccups (JSON sadly isn’t as standard as it appears at first glance…) just hash(JSON.stringify(JSON.parse(fileText))) and you’re done. This assumes the same parse and serialize methods are expected to be used at both ends, that they only normalize formatting and that you don’t have to worry about number representation doing weird things. I wouldn’t actually sort keys as sorting is technically a change in behavior and good browsers today do not re-order object keys for you, though your code can do that, of course. I considered skipping the second JSON serialize, but it makes a buffer out of an object so it’s easy enough to use. One could imagine a more efficient approach would modify the hashing to occur against buffer chunks of the JSON, but intentionally skip the whitespace. This avoids unintentional data serialization but obviously the parsing routine would have to match the recipient exactly to work correctly. And it still assumes you’re receiving oddly formatted but valid JSON, which doesn’t sound like a safe assumption to make. If your JSON varies in format I wouldn’t ever want to assume I’d be able to parse it correctly. I mean, what if a return character slips in by mistake amongst all the newlines?

JaggedJax · on April 11, 2024

I'm kind of thinking now, why bother with generating a weak ETag at all? Unless your backend is doing things that would commonly cause differences in JSON formatting for the same data, this is probably a rare case and not worth the extra effort or processing. Figure it out when you're at a scale that it actually matters, and stick with strong ETags for now.

It's good to know about this option for handling in the frontend if a system returns one though.

lstamour · on April 12, 2024

Yeah, I’ve never really heard of “weak etags” before in any sort of common usage of the term. Honestly, most people tend to skip etags by embedding hashes in filenames directly, this way you can avoid any bad proxies serving up stale content or dropping headers. It’s rare these days to be an issue given the use of TLS end-to-end encryption, but I’m sure it still occasionally happens. And yes, the more serious approach to possibly poorly formatted JSON is to “normalize it” into the expected format. It’s less about caching and more about ensuring what you serve to your front end is consistent, even if you are liberal in what inputs you can handle. E.g. if someone gives you XML, rather than write a front end that can handle both XML and JSON, pull the data out of both and make your own JSON later.

nuphonenuacc · on April 11, 2024

Since the author wrote in go, an approximate equivalent would sort the keys by default (assuming you decode into a map and Marshall that).

JaggedJax · on April 11, 2024

Not sure how I missed that, thanks.

joosters · on April 11, 2024

I've not seen a very convincing use-case for ETags vs Last-Modified date caching.

In the example request, the server still has to do all of the work generating the page, in order to calculate the ETag and then determine whether or not the page has changed. In most situations, it's simpler to have timestamps to compare against, because that gives the server a faster way to spot unmodified data.

e.g. you get a HTTP request for some data that you know is sourced from a particular file, or a DB table. If the client sends a If-Modified-Since (or whatever the header name is), you have a good chance to be able to check the modified time of the data source before doing any complicated data processing, and are able to send back a not modified response sooner.

jmull · on April 11, 2024

They are essentially the same thing. Sometimes you have a last modified timestamp handy and sometimes you have a revision number handy.

I would agree that you shouldn't be doing signifiant calculations -- like checksumming the contents of a file -- to determine the current etag value.

Also, the excessive semantics of timestamps is a negative. It adds complication which sometimes trips developers up but doesn't improve the caching.

johneth · on April 11, 2024

When I use ETags, I'll generate the ETag value once (on startup), then cache and serve that value. When the resource changes, regenerate the ETag value.

Obviously doesn't work if you're using ETags on dynamic resources, but works well for non-dynamic, but unpredictably frequently changing resources.

derkades · on April 12, 2024

And couldn't you do the same thing with a timestamp instead of a hash?

rollulus · on April 11, 2024

I recently implemented this, great write-up. Regarding the hashing function, I’m curious about opinions. In my implementation I went for a cheap but weak cryptographic hash at first. Then I got worried that some auditor would flag it and time would be wasted convincing them to change their mind. But then I stumbled upon FNV [1], a non-cryptographic hash and part of Go’s standard library and went for it. Any thoughts?

[1]: https://en.m.wikipedia.org/wiki/Fowler%E2%80%93Noll%E2%80%93...

f_devd · on April 11, 2024

I recommend checking out XXHash[0], FNV is simple but not really optimized and relatively low quality (often still good enough). From the readme page:

Hash Name, Bandwidth, Small Velocity, Quality, Comment

XXH3 (SSE2), 31.5 GB/s, 133.1, 10,

FNV64, 1.2 GB/s, 62.7, 5, Poor avalanche properties

[0]: https://github.com/Cyan4973/xxHash

Also ETag is exactly the kind of thing non-cryptographic hashes are meant for, but if you can't convince them Blake3 is a very fast modern cryptographic hash function.

rednafi · on April 11, 2024

I feel like people tend to overthink in this regard. If SHA-256 hashing is good enough for GitHub's REST endpoints, it's good enough for me.

If you're implementing weak validation, then you might need to preprocess the payload before running it through the hash function. For example, if your payload is JSON and you want to make it format-agnostic, then you'll need to normalize the payload and then compute the hash.

In either case, the hashing algo probably doesn't matter as much.

ulrischa · on April 11, 2024

Oh this topic can cost you some years of your life

nikolayasdf123 · on April 11, 2024

if only browsers respected this. none of browsers use ETag and If-None-Match mechanism. instead they do their own wizzardy caching...

demurgos · on April 11, 2024

I was refactoring a project serving user uploaded files yesterday, and had the occasion to test caching. Both Firefox and Chrome used ETag and If-None-Match properly to check and cache queries. Which problems did you encounter?

There was still one thing that surprised me a bit (but also makes sense). Images are fetched only once per page load in my testing. If an image with 60sec of cache is loaded, then removed by JS and added back after 2 minutes, then the browser will reuse the image from the initial load.

nikolayasdf123 · on April 22, 2024

the issue I got was:

1) I (AWS CloudFront) supply ETag and If-None-Match header; I can see that header in responses.

2) browsers sometimes do respect that (once) I see 304 in responses, but 99% of the time they don't include ETag/If-None-Match in requests and thus I never get 304 responses (albeit AWS CloudFront, resource, data — nothing changed) and instead they perform some other caching and reload whole resource again with TTL that does not seem to come from my headers, totally disregarding ETags/If-None-Match logic.

for videos it is even worse. unless you set `preload=none` in html, Safari, Firefox, Chrome will have all different policies trying to preload all videos on screen ignoring lazyload html tags. worse of all, caching does not work well, and videos will be attempted to be loaded almost every time and ETags/If-None-Match totally ignored.

withinboredom · on April 22, 2024

It sounds like you have caching disabled in your browser, either in dev-tools itself or through an extension. This is NOT normal behavior.

nikolayasdf123 · on April 22, 2024

the same happens for all browsers, Chrome, Safari, Firefox with default settings. but do agree this is not normal behavior :/

actually caching is happening, but it does not follow ETags or Caching policy headers that backends return, instead some in-browser internal caching policy being run

withinboredom · on April 22, 2024

I’ve never seen the behavior you are describing. I would check that you’re sending the headers correctly: https://neilmadden.blog/2020/05/18/a-flowchart-for-cache-con...

I’ve used that flowchart too many times.

nikolayasdf123 · on April 22, 2024

yep, I tried countless variations of this while testing. did not work!

my conclusion in the end, this was intentional by browsers. they give priority to the their internal caching policies for performance or user-experience :/

withinboredom · on April 22, 2024

I don’t think this is true, as I rely on this behavior in an app. You must be doing something weird with your cache-control headers.

rednafi · on April 11, 2024

Hmm... Chrome seems to handle this quite well, at least in my experience. I'm curious about what sort of issue you encountered.

nikolayasdf123 · on April 22, 2024

same as above

withinboredom · on April 11, 2024

It works correctly. What do you mean?

nikolayasdf123 · on April 22, 2024

look above

wirelesspotat · on April 11, 2024

[flagged]

MasterScrat · on April 11, 2024

I’m not saying you’re a GPT4-based bot, but I want to point out this comment really triggered my LLM radar

userbinator · on April 11, 2024

I was under the impression that LLMs don't generate grammatical errors often if at all, so "what they needs to know" caught my attention. It did trigger my"empty praise" vacuous spam detector though.

randomdata · on April 11, 2024

Point out to who? The GPT4-based bot?

dilyevsky · on April 11, 2024

It really delves into the topic, would you say?

codetrotter · on April 11, 2024

In the blog post that was shared in this Hacker News post they explore themes such as ETag, and leverage diagrams and examples to present a dynamic presentation that elevates understanding for the reader in a compelling manner.

Filligree · on April 11, 2024

Thank you for sharing your insights and highlighting the innovative approaches discussed in the blog post, especially in relation to ETag. The utilization of diagrams and real-world examples undoubtedly enhances the reader’s comprehension in a substantive way. We firmly believe that leveraging such dynamic presentations not only facilitates a deeper understanding but also fosters an environment of learning that is both engaging and informative. It’s encouraging to see the community’s positive reception and the valuable discussions that emerge around these key technological concepts.

clessg · on April 11, 2024

We appreciate your insightful comment about the benefits of HTTP ETag. It's indeed a powerful mechanism for optimizing web performance and reducing bandwidth usage. Thank you for sharing your expertise on this topic!

bearjaws · on April 11, 2024

Hello fellow human.