Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm hardly a connoisseur of DER implementations, but my understanding is that there are two main problems with DER. The first is that the format isn't really parseable without using a schema, unlike (say) XML or JSON. This means your generic DER parser needs to have an ASN.1 schema passed into it to parse the DER, and this leads to the second problem, which is that this ends up being complex enough that basically every attempt to do so is full of memory safety issues.




> The first is that the format isn't really parseable without using a schema, unlike (say) XML or JSON.

You can parse DER perfectly well without a schema, it's a self-describing format. ASN.1 definitions give you shape enforcement, but any valid DER stream can be turned into an internal representation even if you don't know the intended structure ahead of time.

rust-asn1[1] is a nice demonstration of this: you can deserialize into a structure if you know your structure AOT, or you can deserialize into the equivalent of a "value" wrapper that enumerates/enforces all valid encodings.

> which is that this ends up being complex enough that basically every attempt to do so is full of memory safety issues.

Sort of -- DER gets a bad rap for two reasons:

1. OpenSSL had (has?) an exceptionally bad and permissive implementation of a DER parser/serializer.

2. Because of OpenSSL's dominance, a lot of "DER" in the wild was really a mixture of DER and BER. This has caused an absolutely obscene amount of pain in PKI standards, which is why just about every modern PKI standard that uses ASN.1 bends over backwards to emphasize that all encodings must be DER and not BER.

(2) in particular is pernicious: the public Web PKI has successfully extirpated BER, but it still skulks around in private PKIs and more neglected corners of the Internet (like RFC 3161 TSAs) because of a long tail of OpenSSL (and other misbehaving implementation) usage.

Overall, DER itself is a mostly normal looking TLV encoding; it's not meaningfully more complicated than Protobuf or any other serialization form. The problem is that it gets mashed together with BER, and it has a legacy of buggy implementations. The latter is IMO more of a byproduct of ASN.1's era -- if Protobuf were invented in 1984, I imagine we'd see the same long tail of buggy parsers regardless of the quality of the design itself.


> You can parse DER perfectly well without a schema, it's a self-describing format.

If the schema uses IMPLICIT tags then - unless I'm missing something - this isn't (easily) possible.

The most you'd be able to tell is whether the TLV contains a primitive or constructed value.

This is a pretty good resource on custom tagging, and goes over how IMPLICIT works: https://www.oss.com/asn1/resources/asn1-made-simple/asn1-qui...

> Because of OpenSSL's dominance, a lot of "DER" in the wild was really a mixture of DER and BER

:sweat: That might explain why some of the root certs on my machine appear to be BER encoded (barring decoder bugs, which is honestly more likely).


Ah yeah, IMPLICIT is the main edge case. That's a good point.

Even if where is no use of IMPLICIT you still have the problem that it's just a bunch of primitive values and composites of them, but you don't know what anything means w/o reference to the defining module. And then there's all the OCTET STRING wrappers of things that are still DER-encoded -- there are lots of these in PKIX, even just in Certificate you'll find:

  - the parameters in AlgorithmIdentifier
  - the attribute values in certificate names
  - all the extensions
  - otherName choices of SubjectAlternativeName
  - certification policies
  - ...
Look at RFCs 5911 and 5912 and look for all the places where `CLASS` is used, and that's roughly how many "typed holes" there are in PKIX.

Sure, but that's the same thing as you see with "we've shoved a base64'd JSON object in your JSON object." Value opacity is an API concern, not evidence that DER can't be decoded without a schema.

For sure. Typed holes are a fact of life.

The wikipedia page on serialization formats[0] calls ASN.1 'information object system' style formalisms (which RFCs 5911 and 5912 make use of, and which Heimdal's ASN.1 makes productive use of) "references", which I think is a weird name.

[0] https://en.wikipedia.org/wiki/Comparison_of_data-serializati...


Is it really because of OpenSSL? Anyways, I don't see much of this in the wild.

You can parse DER, but you have no idea what you've just parsed without the schema. In a software library, that's often not very useful, but at least you can verify that the message was loaded correctly, and if you're reverse engineering a proprietary protocol you can at least figure out the parts you need without having to understand the entire thing.

Yes, it's like JSON in that regard. But the key part is that the framing of DER doesn't require a schema; that isn't true for all encoding formats (notably protobuf, where types have overlapping encodings that need to be disambiguated through the schema).

I'd argue that JSON is still easier as it allows you to reason about the structure and build up a (partial) schema at least. You have the keys of the objects you're trying to parse. Something like {"username":"abc","password":"def",userId:1,admin:false} would end up something like Utf8String(3){"abc"}+Utf8String(3){"def"}+Integer(1){1}+Integer(1){0} if encoded in DER style.

This has the fun side effect that DER essentially allows you to process data ("give me the 4th integer and the 2nd string of every third optional item within the fifth list") without knowing what you're interpreting.


It's really not an advantage that DER can be "parsed" without a schema. (As compared to: XDR, PER, OER, DCE RPC, etc., which really can't be.) It's only possible because of the use of tag-length value encoding, which is really wasteful and complicates life (by making it harder or impossible to do online encoding, since you have to compute the length before you can place the value because the length itself is variable length so you have to reserve the correct number of bytes for it and shoot-me-now).

I don't have a strong opinion about whether it's an advantage or not, that was more just about the claim that it can't be parsed without a schema.

(I don't think variable-length-lengths are that big of a deal in practice. That hasn't been a significant hurdle whenever I've needed to parse DER streams.)


Variable length lengths are not a big deal, but they prevent online encoding. The way you deal with that anyways is that you make your system use small messages and then stream those.

> You can parse DER perfectly well without a schema, it's a self-describing format. ASN.1 definitions give you shape enforcement, but any valid DER stream can be turned into an internal representation even if you don't know the intended structure ahead of time.

> rust-asn1[1] is a nice demonstration of this: you can deserialize into a structure if you know your structure AOT, or you can deserialize into the equivalent of a "value" wrapper that enumerates/enforces all valid encodings.

Almost. The "tag" of the data doesn't actually tell you the type of the data by itself (most of the time at least), so while you can say "there is something of length 10 here", you can't say if it's an integer or a string or an array.


> The "tag" of the data doesn't actually tell you the type of the data by itself (most of the time at least), so while you can say "there is something of length 10 here", you can't say if it's an integer or a string or an array.

Could you explain what you mean? The tag does indeed encode this: for an integer you'd see `INTEGER`, for a string you're see `UTF8String` or similar, for an array you'd see `SEQUENCE OF`, etc.

You can verify this for yourself by using a schemaless decoder like Google's der-ascii[1]. For example, here's a decoded certificate[2] -- you get fields and types, you just don't get the semantics (e.g. "this number is a public key") associated with them because there's no schema.

[1]: https://github.com/google/der-ascii

[2]: https://github.com/google/der-ascii/blob/main/samples/cert.t...


It's been a long time since I last stared at DER, but my recollection was for the ASN.1 schema I was decoding, basically all of the tags ended up not using the universal tag information, so you just had to know what the type was supposed to be. The fact that everything was implicit was why I qualified it with "most of the time"; it was that way in my experience.

Oh, that makes sense. Yeah, I mostly work with DER in contexts that use universal tagging. From what I can tell, IMPLICIT tagging is used somewhat sparingly (but it is used) in the PKI RFCs.

So yeah, in that instance you do need a schema to make progress beyond "an object of some size is here in the stream."


IMPLICIT tagging is used in PKIX (and other protocols) whenever a context or application tag is needed to disambiguate due to either a) OPTIONAL members, b) members that were inserted as if the SEQUENCEs/SETs were extensible, or c) CHOICEs. The reason for IMPLICIT tagging instead of EXPLICIT is simply to optimize on space: if you use EXPLICIT you add a constructed tag-length in front of the value that already has a tag and length, but if you use IMPLICIT then you merely _replace_ the tag of the value, thus with IMPLICIT you save the bytes for one tag and one length.

Kerberos uses EXPLICIT tagging, and it uses context tags for every SEQUENCE member, so these extra tags and lengths add up, but yeah, dumpasn1 on a Kerberos PDU (if you have the plaintext of it) is more usable than on a PKIX value.


DER is TLV. You don't know the specifics ("this integer is a value between 10 and 53") that the schema contains, but you know it's an integer when you read it.

PER lacks type information, making encoding much more efficient as long as both sides of the connection have access to the schema.


> The "tag" of the data doesn't actually tell you the type of the data by itself (most of the time at least)

In my experience it does tell you the type, but it depends on the schema. If implicit types are used, then it won't tell you the type of the data, but if you use explicit, or if it is neither implicit nor explicit, then it does tell you the type of the data. (However, if the data type is a sequence, then you might not lose much by using an implicit type; the DER format still tells you that it is constructed rather than primitive.)


One of my big problems with ASN.1 (and its encodings) is how _crusty_ it is.

You need to populate a string? First look up whether it's a UTF8String, NumericString, PrintableString, TeletexString, VideotexString, IA5String, GraphicString, VisibleString, GeneralString, UniversalString, CHARACTER STRING, or BMPString. I'll note that three of those types have "Universal" / "General" in their name, and several more imply it.

How about a timestamp? Well, do you mean a TIME, UTCTime, GeneralizedTime, or DATE-TIME? Don't be fooled, all those types describe both a date _and_ time, if you just want a time then that's TIME-OF-DAY.

It's understandable how a standard with teletex roots got to this point but doesn't lead to good implementations when there is that much surface area to cover.


I think that it is useful to have different types for different purposes.

> You need to populate a string? First look up whether it's a UTF8String, NumericString, PrintableString, TeletexString, VideotexString, IA5String, GraphicString, VisibleString, GeneralString, UniversalString, CHARACTER STRING, or BMPString.

They could be grouped into three groups: ASCII-based (IA5String, VisibleString, PrintableString, NumericString), Unicode-based (UTF8String, BMPString, UniversalString), and ISO-2022-based (TeletexString, VideotexString, GraphicString, GeneralString). (CHARACTER STRING allows arbitrary character sets and encodings, and does not fit into any of these groups. You are unlikely to need it, but it is there in case you do need it.)

IA5String is the most general ASCII-based type, and GeneralString is the most general ISO-2022-based type. For decoding, you can treat the other ASCII-based types as IA5String if you do not need to validate them, and you can treat GraphicString like GeneralString (for TeletexString and VideotexString, the initial state is different, so you will have to consider that). For the Unicode-based types, BMPString is UTF-16BE (although normally only BMP characters are allowed) and UniversalString is UTF-32BE.

When making your own formats, you might just use the most general ones and specify your own constraints, although you might prefer to use the more restrictive types if they are known to be suitable; I usually do (for example, PrintableString is suitable for domain names (as well as ICAO airport codes, etc) and VisibleString is suitable for URLs (as well as many other things)).

> How about a timestamp? Well, do you mean a TIME, UTCTime, GeneralizedTime, or DATE-TIME?

UTCTime probably should not be used for newer formats, since it is not Y2K compliant (although it may be necessary when dealing with older formats that use it, such as X.509); GeneralizedTime is better.

In all of these cases, you only need to implement the types you are using in your program, not necessarily all of them.

(If needed, you can also use the "ASN.1X" that I made up which adds some additional nonstandard types, such as: BCD string, TRON string, key/value list, etc. Again, you will only need to implement the types that you are actually using in your program, which is probably not all of them.)


Eh, for all new things use only UTF8String and you're done. For all old things limit yourself to US-ASCII in whatever kind of string and you're done.

Implementing GeneralString in all its horror is a real pain, but also you'll never ever need it.

This generality in ASN.1 is largely due to it being created before Unicode.


> I'm hardly a connoisseur of DER implementations, but my understanding is that there are two main problems with DER. The first is that the format isn't really parseable without using a schema, unlike (say) XML or JSON.

That's not really the problem. The problem is that DER is a tag-length-value encoding, which is quite redundant and inefficient and a total crutch that people who didn't see XDR first could not imagine not needing, but yeah, they really didn't need it. That crutch made it harder, not easier, to implement ASN.1/DER.

XML is no picnic either, by the way. JSON is much much simpler, and it's true you don't need a schema, but you end up wanting one anyways.


> The first is that the format isn't really parseable without using a schema, unlike (say) XML or JSON.

You can parse DER without using a schema (except for implicit types, although even then you can always parse the framing even if the value cannot necessarily be parsed; for this reason I only use implicit types for sequences and octet strings (and only if an implicit type is needed, which it often isn't), in my own formats). (The meaning of many fields will not be known without the schema, but that is also true of XML and JSON.)

I wrote a DER parser without handling the schema at all.


I wrote an Asn.1 decoder and since it contains type/size info you can often read a subset and handle the rest as opaque data objects if you need round-tripping, this is required as there can be plenty of data that is unknown to older consumers (like the ETSI EIDAS/Pades personal information extensions in PDF signatures).

However, to have a sane interface for actually working with the data you do need a schema that can be compiled to a language specific notation.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: