> The first is that the format isn't really parseable without using a schema, unlike (say) XML or JSON.
You can parse DER perfectly well without a schema, it's a self-describing format. ASN.1 definitions give you shape enforcement, but any valid DER stream can be turned into an internal representation even if you don't know the intended structure ahead of time.
rust-asn1[1] is a nice demonstration of this: you can deserialize into a structure if you know your structure AOT, or you can deserialize into the equivalent of a "value" wrapper that enumerates/enforces all valid encodings.
> which is that this ends up being complex enough that basically every attempt to do so is full of memory safety issues.
Sort of -- DER gets a bad rap for two reasons:
1. OpenSSL had (has?) an exceptionally bad and permissive implementation of a DER parser/serializer.
2. Because of OpenSSL's dominance, a lot of "DER" in the wild was really a mixture of DER and BER. This has caused an absolutely obscene amount of pain in PKI standards, which is why just about every modern PKI standard that uses ASN.1 bends over backwards to emphasize that all encodings must be DER and not BER.
(2) in particular is pernicious: the public Web PKI has successfully extirpated BER, but it still skulks around in private PKIs and more neglected corners of the Internet (like RFC 3161 TSAs) because of a long tail of OpenSSL (and other misbehaving implementation) usage.
Overall, DER itself is a mostly normal looking TLV encoding; it's not meaningfully more complicated than Protobuf or any other serialization form. The problem is that it gets mashed together with BER, and it has a legacy of buggy implementations. The latter is IMO more of a byproduct of ASN.1's era -- if Protobuf were invented in 1984, I imagine we'd see the same long tail of buggy parsers regardless of the quality of the design itself.
Even if where is no use of IMPLICIT you still have the problem that it's just a bunch of primitive values and composites of them, but you don't know what anything means w/o reference to the defining module. And then there's all the OCTET STRING wrappers of things that are still DER-encoded -- there are lots of these in PKIX, even just in Certificate you'll find:
- the parameters in AlgorithmIdentifier
- the attribute values in certificate names
- all the extensions
- otherName choices of SubjectAlternativeName
- certification policies
- ...
Look at RFCs 5911 and 5912 and look for all the places where `CLASS` is used, and that's roughly how many "typed holes" there are in PKIX.
Sure, but that's the same thing as you see with "we've shoved a base64'd JSON object in your JSON object." Value opacity is an API concern, not evidence that DER can't be decoded without a schema.
The wikipedia page on serialization formats[0] calls ASN.1 'information object system' style formalisms (which RFCs 5911 and 5912 make use of, and which Heimdal's ASN.1 makes productive use of) "references", which I think is a weird name.
You can parse DER, but you have no idea what you've just parsed without the schema. In a software library, that's often not very useful, but at least you can verify that the message was loaded correctly, and if you're reverse engineering a proprietary protocol you can at least figure out the parts you need without having to understand the entire thing.
Yes, it's like JSON in that regard. But the key part is that the framing of DER doesn't require a schema; that isn't true for all encoding formats (notably protobuf, where types have overlapping encodings that need to be disambiguated through the schema).
I'd argue that JSON is still easier as it allows you to reason about the structure and build up a (partial) schema at least. You have the keys of the objects you're trying to parse. Something like {"username":"abc","password":"def",userId:1,admin:false} would end up something like Utf8String(3){"abc"}+Utf8String(3){"def"}+Integer(1){1}+Integer(1){0} if encoded in DER style.
This has the fun side effect that DER essentially allows you to process data ("give me the 4th integer and the 2nd string of every third optional item within the fifth list") without knowing what you're interpreting.
It's really not an advantage that DER can be "parsed" without a schema. (As compared to: XDR, PER, OER, DCE RPC, etc., which really can't be.) It's only possible because of the use of tag-length value encoding, which is really wasteful and complicates life (by making it harder or impossible to do online encoding, since you have to compute the length before you can place the value because the length itself is variable length so you have to reserve the correct number of bytes for it and shoot-me-now).
I don't have a strong opinion about whether it's an advantage or not, that was more just about the claim that it can't be parsed without a schema.
(I don't think variable-length-lengths are that big of a deal in practice. That hasn't been a significant hurdle whenever I've needed to parse DER streams.)
Variable length lengths are not a big deal, but they prevent online encoding. The way you deal with that anyways is that you make your system use small messages and then stream those.
> You can parse DER perfectly well without a schema, it's a self-describing format. ASN.1 definitions give you shape enforcement, but any valid DER stream can be turned into an internal representation even if you don't know the intended structure ahead of time.
> rust-asn1[1] is a nice demonstration of this: you can deserialize into a structure if you know your structure AOT, or you can deserialize into the equivalent of a "value" wrapper that enumerates/enforces all valid encodings.
Almost. The "tag" of the data doesn't actually tell you the type of the data by itself (most of the time at least), so while you can say "there is something of length 10 here", you can't say if it's an integer or a string or an array.
> The "tag" of the data doesn't actually tell you the type of the data by itself (most of the time at least), so while you can say "there is something of length 10 here", you can't say if it's an integer or a string or an array.
Could you explain what you mean? The tag does indeed encode this: for an integer you'd see `INTEGER`, for a string you're see `UTF8String` or similar, for an array you'd see `SEQUENCE OF`, etc.
You can verify this for yourself by using a schemaless decoder like Google's der-ascii[1]. For example, here's a decoded certificate[2] -- you get fields and types, you just don't get the semantics (e.g. "this number is a public key") associated with them because there's no schema.
It's been a long time since I last stared at DER, but my recollection was for the ASN.1 schema I was decoding, basically all of the tags ended up not using the universal tag information, so you just had to know what the type was supposed to be. The fact that everything was implicit was why I qualified it with "most of the time"; it was that way in my experience.
Oh, that makes sense. Yeah, I mostly work with DER in contexts that use universal tagging. From what I can tell, IMPLICIT tagging is used somewhat sparingly (but it is used) in the PKI RFCs.
So yeah, in that instance you do need a schema to make progress beyond "an object of some size is here in the stream."
IMPLICIT tagging is used in PKIX (and other protocols) whenever a context or application tag is needed to disambiguate due to either a) OPTIONAL members, b) members that were inserted as if the SEQUENCEs/SETs were extensible, or c) CHOICEs. The reason for IMPLICIT tagging instead of EXPLICIT is simply to optimize on space: if you use EXPLICIT you add a constructed tag-length in front of the value that already has a tag and length, but if you use IMPLICIT then you merely _replace_ the tag of the value, thus with IMPLICIT you save the bytes for one tag and one length.
Kerberos uses EXPLICIT tagging, and it uses context tags for every SEQUENCE member, so these extra tags and lengths add up, but yeah, dumpasn1 on a Kerberos PDU (if you have the plaintext of it) is more usable than on a PKIX value.
DER is TLV. You don't know the specifics ("this integer is a value between 10 and 53") that the schema contains, but you know it's an integer when you read it.
PER lacks type information, making encoding much more efficient as long as both sides of the connection have access to the schema.
> The "tag" of the data doesn't actually tell you the type of the data by itself (most of the time at least)
In my experience it does tell you the type, but it depends on the schema. If implicit types are used, then it won't tell you the type of the data, but if you use explicit, or if it is neither implicit nor explicit, then it does tell you the type of the data. (However, if the data type is a sequence, then you might not lose much by using an implicit type; the DER format still tells you that it is constructed rather than primitive.)
You can parse DER perfectly well without a schema, it's a self-describing format. ASN.1 definitions give you shape enforcement, but any valid DER stream can be turned into an internal representation even if you don't know the intended structure ahead of time.
rust-asn1[1] is a nice demonstration of this: you can deserialize into a structure if you know your structure AOT, or you can deserialize into the equivalent of a "value" wrapper that enumerates/enforces all valid encodings.
> which is that this ends up being complex enough that basically every attempt to do so is full of memory safety issues.
Sort of -- DER gets a bad rap for two reasons:
1. OpenSSL had (has?) an exceptionally bad and permissive implementation of a DER parser/serializer.
2. Because of OpenSSL's dominance, a lot of "DER" in the wild was really a mixture of DER and BER. This has caused an absolutely obscene amount of pain in PKI standards, which is why just about every modern PKI standard that uses ASN.1 bends over backwards to emphasize that all encodings must be DER and not BER.
(2) in particular is pernicious: the public Web PKI has successfully extirpated BER, but it still skulks around in private PKIs and more neglected corners of the Internet (like RFC 3161 TSAs) because of a long tail of OpenSSL (and other misbehaving implementation) usage.
Overall, DER itself is a mostly normal looking TLV encoding; it's not meaningfully more complicated than Protobuf or any other serialization form. The problem is that it gets mashed together with BER, and it has a legacy of buggy implementations. The latter is IMO more of a byproduct of ASN.1's era -- if Protobuf were invented in 1984, I imagine we'd see the same long tail of buggy parsers regardless of the quality of the design itself.