It can also have embedded binary streams. It was not made for text. It was made for layout and graphics. You give nice examples, but each of those lines could have been broken up into one call per character, or per word, even out of order.
It can also use fonts which map glyphs via characters which do not represent the final visual item e.g. "PDF" could be "1#F" and you only really know what it looks like by rendering then viewing/OCR.
A nice file won't, but sometimes the best work is in not dealing with nice things.