Reply to post • Re: "UTF-8 is self-clocking so random access is quite trivial" • The Register Forums

Thursday 15th March 2018 16:24 GMT Kristian Walsh

Re: "UTF-8 is self-clocking so random access is quite trivial"

Apologies for the long post, but...

I used to teach developers about Unicode adoption and localisability, and the "objection" that you can't random-access UTF-8 text came up a lot. It sounds like a problem, but it's based on a flawed conflation of "text fragment" and "byte array". Random access is something you do with byte arrays, not human-readable text.

(You mention configuration keys being English. They're not. They're machine-readable byte sequences. They happen to look like English words, but that's because of the same incorrect conflation of "human-readable text" and "bytes". Anyone who uses what is, in effect, a variable name and shows it directly to the user is already doing the wrong thing, and UTF-8 or ASCII doesn't come into it)

I have never seen an application that cared about "characters" at the data-format level. Bytes, yes; characters, never. But because C called its 8-bit integer type "char", generations of programmers think they're one and the same. They are not. Unicode TR-17 (https://www.unicode.org/reports/tr17/) gives a truer picture of what storing textual information actually involves - it's a quick read, but it clears a lot of misconceptions that you may not even know you had. Note also how Unicode uses the term "Code-point" rather than "Character" because, linguistically, "characters" are not defined by a hard or fast rule: is the French word "cœur" four or five letters long? And how many characters is that? Or ask a Dutch speaker how many letters are in the word "Ijs"...

The only time I've seen code doing a random-access jump into "strings" of data is in chunked data formats, or Pascal-like strings ({ Length, bytes[] }), but they do so in order to skip over "text" chunks, or copy them wholesale and pass them to a renderer for display. In these cases, the data inside those chunks can be anything - its encoding is assumed to be Someone Else's Problem, as it should be at the data-unpacking layer of any protocol.

If you've got code that's randomly accessing a "string", it's a good sign that you're not actually dealing with text; you're manipulating a byte-array, and any textual content within it must therefore be properly delimited by the rules of the data format (if it wasn't, the data format couldn't even work with ASCII text runs, let alone UTF-8).

UTF-8 data is transparent to any application that's "8-bit clean". Applications that hit problems with UTF-8 are ones that made invalid assumptions about text encodings, and these would also fail with other ISO-8859 codepages than Latin1, or with (horrible) stateful systems like Shift-JIS or Big5. However, there's still nothing to stop you using a byte 0xB9 ("·" in ISO-8859-1) as a placeholder within some "text" that is otherwise UTF-8 encoded, provided you replace that placeholder with another UTF-8 string (or nothing) before passing it out of your process, and provided you use some kind of escape sequence to allow that byte to be part of the "text" without having special meaning. [Seriously, don't do things like this: it's just storing a world of pain for yourself. Keep your "special" characters within the ASCII code range, and allow them to be escaped, and you're good. That way, the cases of "is the previous character a backslash" work in UTF-8 just fine, as '\' is one of the ASCII codes, and guaranteed to be a single byte, and no byte that's part of a UTF-8 sequence is allowed contain that value.]

Random access isn't the same as searching, something that's entirely valid to do with human-readable text strings (e.g., to find special formatting tokens). You can find the location of a particular codepoint inside a string of UTF-8 text really easy: strstr() can find any Unicode code point at all (give it the UTF-8 representation of that code), and strchr() will do it for any codepoint that's below U+0080 (i.e., same as ASCII); or, from the nasty example above, strchr(...,'\xB9') will work fine, as your input is, strictly speaking a series of UTF-8 text runs, delimited by the byte-value 0xB9, which your process will turn into a single UTF-8 text run. (Differentiating between the "text" 0xB9 values and "special-char" values is your problem, though...)

I would be genuinely interested if you can provide a piece of pseudocode showing an application that needs to reach a character boundary by index within a block of UTF-8 text. I've yet to find an example that isn't already broken for certain valid ASCII inputs.

Nonetheless, all of the above is not an argument in favour of text-based data formats - I still believe that binary streams are quicker and more compact. However, the issue of byte-order for multi-byte integers has always made interchange risky, and in the days the WWW was being constructed there really were still systems that were 7-bit only when it came to text, and also stripped any code below U+0020; hence the use of HTML "&" entities for characters like quotes that are actually already encodable in HTML's default character set (ISO-8859-1). Interoperability often means "lowest common denominator", and that's a stream of printable ASCII codes, plus CR, LF, and TAB.

Topics

Special Features

Vendor Voice

Resources

User topics

Article topics

Reply to post: Re: "UTF-8 is self-clocking so random access is quite trivial"

Airbus ditches Microsoft, flies off to Google

Re: "UTF-8 is self-clocking so random access is quite trivial"

POST COMMENT House rules

Enter your comment

Add an icon

About Us

Our Websites

Your Privacy