Reply to post: Re: Case..

Apple quietly launches next-gen encrypted file system

Kristian Walsh Silver badge

Re: Case..

I'm not kidding, and you miss the point. The two characters I showed ARE the same, but coded differently. The lexical meaning of both codepoint sequences is identical: "a Latin-alphabet lowercase a with acute accent above". That the underlying codepoints are different is only visible when you hexdump the encoded stream. One version is "lowercase a-acute" (a standalone codepoint inherited from ISO-8859-1), the other is "a" then "combining acute accent" (the general case of applying an acute accent to a base letter). The result, however, is the same displayed glyph: "á".

Glyphs are what you see, characters the concept that a glyph represents; codepoints are how you specify the characters in a text stream; bytes are how you represent codepoint. C and its standard library has probably misled you, so a read of this will give a more accurate explanation of how text is actually handled in computer systems: http://www.unicode.org/reports/tr17/ (Unicode TR17: Character Encoding Model).

Case insensitivity is the dumbest, least-effort way to handle text. In effect, you're not interpreting it at all - it's just a bunch of bytes, with no meaning. Some bytes aren't allowed because you use them to delimit your directory paths, but after that, it's all up to you. The idea of byte=character is a fallout from the first commercialised computing systems being developed in the USA, that most monolingual of nations - hence a standardised alphabetic code that couldn't accommodate any other language except English). Linux takes this approach, and it's often a difficult conversation to convince someone that Ext4 filenames are just a bunch of bytes and don't have any encoding - it's only libraries like glib that assumes filenames are UTF-8.

But that's the machine's view of the world, not the user's: if it's so natural that case should matter, tell me how you would pronounce "INSTALL" differently to "Install" ?

Regarding the "A.xpm" and "a.xpm"; I'd have called them "0041.xpm" and "0061.xpm", because if you ever needed an image for "☞", it's a lot easier to map it to "261E.xpm" than any other alternative. This convention would also allow you to create a forward-slash glyph without the special-case in your code.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon