back to article Universal Unix tool AWK gets Unicode support

In Unix terms, this news is akin to Moses appearing and announcing an amendment to the 10 commandments. AWK, a programming language for analyzing text files, is a core part of the Unix operating system, including Linux, all the BSDs and others. For an OS to be considered POSIX compliant, it must include AWK. AWK first appeared …

  1. Pascal Monett Silver badge
    Thumb Up

    Two men

    Two individuals. That's all it took to define an entire industry that still lives under their shadow.

    Salute to the masters.

    1. Anonymous Coward
      Anonymous Coward

      Sed, awk and grep...

      a bit of pipework and a large data file used to be all you needed for whats now called 'AI'!

      Just kidding! Its all new now and really clever!

      1. Yet Another Anonymous coward Silver badge

        Re: Sed, awk and grep...

        Unix - it's all a series of pipes !

        1. Anonymous Coward
          Anonymous Coward

          Re: Sed, awk and grep...

          The secret is having at least seven kinds of socket to hold it all together.

          1. Vometia has insomnia. Again. Bronze badge

            Re: Sed, awk and grep...

            Yikes, I'm having flashbacks to My First Network Server written 30+ years back and initially being side-tracked into System V's Streams subsystem. I never did get my head around it. When I finally worked out how to do TCP sockets (we didn't have any source code so it took a bit longer than was ideal) I was amazed and very relieved by its simplicity, especially in comparison with that. I wonder why Streams never caught on...?

  2. I Am Spartacus
    Unhappy

    GIT- Aptly named

    I'm so glad that I am not the only one who struggles with the complexities of GIT

    1. m4r35n357

      Re: GIT- Aptly named

      Git is only considered "hard" because it enables really sophisticated operations compared with earlier source control systems. Try setting up a Subversion server yourself, and if you manage that you can marvel at how hard it is to do just a simple merge.

      Git is aptly named _by its own author_, it is no accident!

      1. F. Frederick Skitty Silver badge

        Re: GIT- Aptly named

        Git got its name as a dig at Andrew Tridgell, since his reverse engineering of the proprietary Bitkeeper version control system lead to Git's creation. It was always a bit dodgy entrusting the Linux source to a closed product like Bitkeeper, but Linus Torvalds felt it was the only one that offered the features he needed to manage such a massively distributed development effort. That dodginess came home to roost when the owner of Bitkeeper threw the toys out his pram at Tridgell's efforts to make the source control more accessible and he ended the free licenses to use Bitkeeper for Linux development.

        1. m4r35n357

          Re: GIT- Aptly named

          Not true - Linus named Git after himself!

          Nothing to do with Tridgell: https://en.wikipedia.org/wiki/Git#Naming

          1. F. Frederick Skitty Silver badge

            Re: GIT- Aptly named

            That's somewhat revisionist - at the time it was very clear who the name was directed at

            1. m4r35n357

              Re: GIT- Aptly named

              Not true, I was following the story at the time! Linus at first tried to keep the peace, but when the BK author would not cooperate Linus came down on Tridgell's side.

              1. F. Frederick Skitty Silver badge

                Re: GIT- Aptly named

                Your memory is faulty, as a simple search would show. It was Tridgell who was asked by Linus to stop work on his Sourcepuller tool. Here's a source, an InfoWorld article of the time, which quotes Torvald's criticism of Tridgell:

                Torvalds begins work on Git

                As an interesting footnote, Bitkeeper was eventually open sourced a few years later.

      2. Dan 55 Silver badge

        Re: GIT- Aptly named

        Git is only considered "hard" because it enables really sophisticated operations compared with earlier source control systems.

        Well, that and Git concepts have the same name as SVN/CVS/the rest concepts but are actually different things.

        Also if you have several files with similar content and do a bunch of add/delete operations in one commit, it can quite happily get stuff round its neck and decide you've renamed files instead.

        1. Richard 12 Silver badge

          Re: GIT- Aptly named

          That's far better than the alternatives, which decide that all rename operations mean you've deleted and created new files, with no possibility of history or merging across said line.

          Even if you explicitly use the "rename file" tool within said source control software.

          Gods, bringing forward changes across a restructure is basically impossible in, say Perforce. Yet it "just works" in git, almost every time.

      3. karlkarl Silver badge

        Re: GIT- Aptly named

        >> Try setting up a Subversion server yourself

        I don't know, SVN's svnserve was pretty darn convenient. The alternative for Git to prep an inetd daemon, a httpd/CGI or locking down a restricted SSH shell account is in my opinion a little fiddly.

        The rest I agree with though, Git makes very tricky things possible compared to others. Even in small teams where some of the complexity is overkill, it is still worth using Git, if anything to avoid the need to use different RCS systems per project.

      4. the spectacularly refined chap Silver badge

        Re: GIT- Aptly named

        Git is only considered "hard" because it enables really sophisticated operations compared with earlier source control systems.

        ISTR Torvalds himself admitted the interface sucked at the time of its introduction, he stated it was temporary until something else replaced it. Of course that new interface never appeared. It isn't one of those cases were you can legitimately claim it's the result of power - for source code control there is no excuse for the simple stuff not to be simple.

      5. Michael Wojcik Silver badge

        Re: GIT- Aptly named

        Try setting up a Subversion server yourself,

        I have. More than one, in fact.

        and if you manage that you can marvel at how hard it is to do just a simple merge.

        Merges in Subversion are generally trivially easy – certainly since the introduction of merge tracking, and they weren't that difficult before that. Reintegration merges require a grand total of two commands if there are no conflicts, and resolving conflicts with Subversion is certainly no more difficult, and generally more straightforward, than with git. Cross-branch cherry-picked merges are rarely any more effort. I do dozens of Subversion merges of various sorts a month.

        git merges, conversely, can be quite baffling for people who don't understand git's data model and arcane command set. Just look at the unending battles over whether and when rebasing is a good idea.

        git does very well at what it was created for, namely truly distributed change control (of text files; it doesn't do well with non-text formats). When used with a single centralized repository, which is how probably the vast majority of its users use it, it's simply extra complexity and obscurity for little or no benefit.

    2. Tom 38 Silver badge

      Re: GIT- Aptly named

      Git certainly takes some onboarding to get fully up to speed with "what to do when I'm in X situation and want Y to happen", but when I think back to dealing with CVS (shudder) and even SVN - with SVN, we used to set an entire week aside to merge feature branches in to production, it was truly horrific - what did you do this week Bob? Oh I merged 1700 commits on to production, there were a couple of mismerges but we got there eventually!

      1. F. Frederick Skitty Silver badge

        Re: GIT- Aptly named

        Subversion's branching and merging support was a "proof of concept", whose author never intended to be released in that state. That's why it's such a kludge and makes it so hard to deal with merge conflicts. Things improved a little after Subversion 1.5 was released, but it still built on the terrible foundation of that initial implementation.

        I can't find a link to the discussion about all this that the original author posted to, but it was related to a comparison in the O'Reilly book about Mercurial. On the book's forum, a Subversion fanboi took issue with criticism of his fave version control system until the code's author waded in to confirm his work had been flawed.

    3. bombastic bob Silver badge
      Devil

      Re: GIT- Aptly named

      In at least one case the terminology (especially when directly related to github) seems bass-ackwards to me.

      Specifically, "pull request" - usually when you UPload things, or submit things for review. it's more of a PUSH, not a PULL. I think "change request" or "patch request" or "update request" would make more sense.

      And so, the "confusing side" of git, which in THIS case, is really a "GitHub-ism".

      (otherwise to me it is just another source control revisioning system that happens to be popular)

      1. Richard 12 Silver badge

        Re: GIT- Aptly named

        I think it makes sense in the underlying system, but not in the UI that github puts over the top.

        Originally it's "please pull the work I've done from my server into yours"

        But now it's "I've pushed this work to the server we both use, please merge it".

        So "merge request" is a better term. I think that's what gitlab calls it now, tbh.

  3. Pete B Silver badge

    I hope I'm still mentally able to wrangle code at 80!

    1. Gene Cash Silver badge

      Especially wrangling code written over 40 years ago!

      1. A Non e-mouse Silver badge

        How many of us fully understand code we wrote 40 minutes ago, let along 40 years ago?!?

        1. Anonymous Coward
          Anonymous Coward

          Never mind 40 minutes. "Why the hell did I just put that bracket *there*?

  4. MacroRodent

    Still work in progress

    Doesn't look like the Unicode support is yet merged in the master branch, but there is a working branch unicode-support with notes like "more to do".

    Will be interesting to see how the old Master has managed to retrofit Unicode. I once looked at the problem in the context of another very old piece of code, and thought it too much work. If a program does anything except copy the strings, it has to parse the UFT-8 encoding, and on top of that many old programs assume characters have only 7 bit of data, and use the MSBs for something, or get negative array index errors if it is set (often crashing the program).

    1. vtcodger Silver badge

      Re: Still work in progress

      One wonders how many mainstream programs are really still 7-bit ASCII only. If they try to use the eighth bit for some internal purpose, they won't work with Latin-1, CP-1252 or other encoding systems that I assume to be in common use in Europe. Surely that'd be a nuisance. But you're surely right about having to deal with UTF-8 multibyte characters without trashing or misunderstanding them.

      1. Richard 12 Silver badge

        Re: Still work in progress

        It Depends.

        If there's no typesetting, it doesn't need to know much at all, just "how to determine the beginning and end of a character" and "what is whitespace".

        That covers trimming, text search/replace and token splitting.

        So if it worked outside of the USA it's mostly a case of fixing everywhere it jumps forward or backwards in the stream.

        1. gnasher729 Silver badge

          Re: Still work in progress

          It depends on whether you want to operate on code points and assume different code points are different, or if you want to operate on characters, where the same character can and will be represented in different ways. For example "café" can be four or five code points, but it is four characters, and they should be considered the same.

          1. alain williams Silver badge

            Re: Still work in progress

            "café" can be four or five code points, but it is four characters, and they should be considered the same

            Unicode equivalence is a pain, it allows different byte sequences to represent the same character. In the above example the 'é' can either be U+E9 (a single code point) or 'e' followed by a combining acute accent (U+65 followed by U+301). Searching algorithms are thus more complicated.

    2. The Indomitable Gall
      Coat

      Re: Still work in progress

      > it has to parse the UFT-8 encoding,

      Yeah, and that stray U+200F there just shows how problematic UTF-8 can be. ;-D

    3. runt row raggy

      Re: Still work in progress

      if you can't wait, plan9port has a Unicode support in its lightly modified awk. has had for 20 years.

  5. Fonant
    Happy

    Who wrote grep?

    Best bit at the end of the interview: "Who wrote grep?"

  6. karlkarl Silver badge

    I am a big fan of Awk. I tend to use it in scripts which requires lists which is just a little out of reach of /bin/sh scripts. I still find it renders Perl and Python a little unneccesary for most sysadmin tasks.

    Since it is a requirement for SUS and POSIX platforms so is pretty much always around in various forms, I am fairly surprised it isn't used more as a Makefile generator. I used it for this not long ago and I was surprised how effective it was:

    https://gitlab.com/osen/openbsd_drmfb_gnuboy/-/blob/main/configure.awk

    1. Flocke Kroes Silver badge

      For a simple one-off query awk is excellent. For anything that might get used again I go straight to python. It might be simple today but an awk script will grow into a maintenance nightmare once feature creep inevitably sets in.

      1. A Non e-mouse Silver badge

        I used to do a lot more with awk in the past but nowadays Python tends to be my go-to scripting language.

        Is Python as fast or as efficient as tool X? Maybe not. But I rarely need absolute performance and, being a jack-of-all-trades, keeping a reasonable understanding of a small number of tools is the way to a simpler life.

      2. Michael Wojcik Silver badge

        I find my complex awk scripts are quite maintainable, but then I use advanced features like functions, comments, and sensible application of whitespace, which seem to be beyond many developers regardless of scripting language.

        I mostly use awk because I have much of it memorized, whereas I use Python so rarely that I have to keep looking things up, and if I'm writing a script it's often to massage diagnostic data to help me diagnose a problem, so I don't feel inclined to spend a lot of time buffing my skills.

        Also I don't find Python particularly attractive as a language, to be honest. I mean, it's better than Perl – but that's faint praise. Scoping-by-indentation is OK for blocks that fit on the page, but problematic if they go longer, so to create maintainable Python I want to do a lot of prefactoring into small abstractions, and that takes time I probably don't want to spend if I'm not writing product code.

  7. Colin Bull 1
    Happy

    a programming language for analyzing text files

    I think this is the understatement of the year. It is more an Extract Transform Load machine.

    Thanks to A and W and K.

    1. Liam Proven (Written by Reg staff) Silver badge

      Re: a programming language for analyzing text files

      [Author here]

      I know... but I have to find a way to keep it relatively simple. ETL isn't, and calling it an Extraction and Reporting Language would make people think of Perl and its name's expansion (even if it's probably a backronym).

  8. Martin-73 Silver badge
    Coffee/keyboard

    Aptly named "Git"

    C>N|K

    1. Liam Proven (Written by Reg staff) Silver badge

      Re: Aptly named "Git"

      [Author here]

      After 5 years working in Linux vendors in Central Europe, the thing is this:

      English is the language of business across much Europe now. It's the one language you can rely on most people speaking.

      *But* the version they speak is, by and large, *US* English. For very advanced speakers, they use British pronunciation and spelling, but the vocab is US.

      And in US English, the word "git" is meaningless or a variant of "get" in the sense of "go" -- "git out of here".

      So they have no clue that the name _means_ something.

      1. Martin-73 Silver badge

        Re: Aptly named "Git"

        No clue... oof. one would have thought Ron's predilection for calling Harry a GIT in harry potter and the deadly server or whatever... would have resonated somewhat with the now 20 somethings

  9. John Smith 19 Gold badge
    Thumb Up

    AWK has unicode support?

    Interesting.

    What I would have really liked was the ability to store procedure addresses in an array.

    This is a quintisential C idiom that's really handy for implementing fast responsind UI's.

  10. geoff61

    POSIX conforming awks have supported UTF-8 for almost 30 years

    Interesting that the author mentions POSIX systems are required to include awk but doesn't seem to realise that POSIX also effectively requires awk to support UTF-8, and that all of the systems which were updated to conform to POSIX.2-1992 and which added UTF-8 locales in the early to mid 1990's added UTF-8 support to awk (and all the other POSIX text-processing utilities) at that time. I say "effectively" because it's only required on systems that have at least one UTF-8 locale installed, so there's a loophole there; for some systems there may have been a short time where they supported older multi-byte encodings in awk before they added any UTF-8 locales. (UTF-8 was invented by Ken Thompson and Rob Pike right around the same time that POSIX.2-1992 was approved by IEEE).

    1. Liam Proven (Written by Reg staff) Silver badge

      Re: POSIX conforming awks have supported UTF-8 for almost 30 years

      That _is_ interesting and I didn't realise that. Thanks!

  11. Anonymous Coward
    Anonymous Coward

    Off topic trivia

    > this news is akin to Moses appearing and announcing an amendment to the 10 commandments.

    Well, there are actually around 614 of them. If memory serves right, the difference is that those ten apply to everyone whereas the remaining ones only concern Israelites.

    1. Anonymous Coward
      Anonymous Coward

      Re: Off topic trivia

      What I don't recall is whether Bram came down the hill with all 614 in writing.

      If so, I hope it was in very small letters.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like