Reply to post • Re: Ugh! • The Register Forums

Tuesday 20th October 2015 09:00 GMT Kristian Walsh

Re: Ugh!

UTF-16 is horrible and breaks all of the native C/C++ string handling and all legacy text applications. At least UTF-8 is usable, even if you have the unpleasantness of off characters in old editors and variable length strings for a fixed number of "characters" when outside of the ASCII Latin alphabet.

UTF-8 also breaks legacy text applications. The thing is, most of the time the text can pass through such systems unmolested, so that it looks like it's working. But only most of the time. Sooner or later, you'll hit a service that wants ISO-8859 or UTF-8, and you'll give it the other one, and you'll get garbled output, or mysterious runtime exceptions.

The idea of eight-bit text is so deeply ingrained into UNIX and its descendants that it can be hard to explain that this is is only one of several possible approaches to text processing, and the reason for it had more to do with what language the systems designers spoke, rather than any sound technical judgement (Had the Japanese invented the mainframe, we would have had 12-bit or 16-bit characters, because 8 bits are inadequate for the set of symbols they need to write their language). The other UNIX thinking trap is conflating the idea of a machine-readable "text file" with the idea of human-readable "text". It's the mental equivalent of "YY/MM/DD" date recoding, or 32-bit IP addresses; fine for the initial use scenario, but seriously limiting in the long term.

On your other criticisms: as pointed out above, C++ strings can have any size of character. I wrote and used a 16-bit Unicode string class for years in my code. It took me less than a morning to write it (it's just a specialisation of the existing basic_string<T> template).

On the second point, you're raising a non-issue. As a rule, a programmer never needs to know the number of "characters" in a string of natural-language (i.e., not command-line switches, configuration verbs, or other stuff that UNIX-like OSes send around) text, only the amount of storage it occupies... which is just as well, because you never can know this in most cases.

If you're wondering why not, then first you must define what a "character" is, and it's not the same as "byte", or even "Unicode code-point". If this input form has preserved my input correctly, the two sequences "éiginnte" and "éiginnte" will not be the same length (different code-sequences can produce the same output glyphs). And that's before we get to the questions of language - Dutch treats the pair i,j as a single letter in many cases: is the word "rijk" [four codepoints] four or three characters long; is it the same "length" as "rĳk" [three codepoints]?. How about "sœur"?

Luckily, all you need to care about is how much space these code-points occupy (8,9,4,3,4 codepoints, in order). Any ideas that this number is easily related to the number of displayed glyphs, or the "length" of the text the reader sees, is a fallacy.

Topics

Special Features

Vendor Voice

Resources

User topics

Article topics

Reply to post: Re: Ugh!

Temperature of Hell drops a few degrees – Microsoft emits SSH-for-Windows source code

Re: Ugh!

POST COMMENT House rules

Enter your comment

Add an icon

About Us

Our Websites

Your Privacy