back to article Confused by crazy crashes? Check your Linux kernel virtual Ethernet code

A chunk of code added to the Linux kernel to help inter-container communication turned out to mess up checksum handling on Ethernet networks. Described here, the bug was in veth (Virtual Ethernet). As the description notes, the coding error allowed corrupt packets to get passed to a veth device for delivery to the application …

  1. Bronek Kozicki
    Facepalm

    As Donald Knuth says

    Premature optimization is the root of all evil.

    1. Destroy All Monsters Silver badge
      Holmes

      Re: As Donald Knuth says

      But this one is not premature so the saying doesn't apply.

      It seems to be a missing enum value, rather

      When going to the network:

      > CHECKSUM_UNDEFINED_AS_YET

      > CHECKSUM_UNNECESSARY_BUT_CAN_BE_DONE

      > CHECKSUM_TO_BE_DONE_BY_HARDWARE

      > CHECKSUM_ALREADY_DONE_BY_SOFTWARE

      When coming from the network:

      > CHECKSUM_UNVERIFIED_AS_YET

      > CHECKSUM_VERIFIED_BY_HARDWARE_AND_PASSED

      > CHECKSUM_VERIFIED_BY_HARDWARE_AND_FAILED

      > CHECKSUM_VERIFIED_BY_SOFTWARE_AND_PASSED

      > CHECKSUM_VERIFIED_BY_SOFTWARE_AND_FAILED

      Probably not possible to implement these because of reasons, but still.

      1. Bronek Kozicki

        Re: As Donald Knuth says

        From the article this very much smells to me as an optimisation, because of "Pandurangan concluded that the bug was intended to be a feature. If two containers are passing packets on the same machine, there's no need for checksums"

  2. Warm Braw

    Why do we accept flaky network hardware?

    Checksums were added to TCP/IP because of bit rot in the early IMPs and we've kind of accepted that network hardware is going to be dodgy ever since. Note that the IP header checksum isn't to protect against transmission errors (that are detected by the link layer checksum), but against errors occurring after an error-free reception.

    Yet we don't extend the same leeway to storage hardware, for example: write a block of data from memory onto a disk and there's no checksum to protect against bus errors in the transfer. If there was over time an accumulating corruption of our data, I think we'd have noticed, so I can only assume these devices simply work better.

    So the question, really, is how are these network card vendors getting away with it?

    1. Hugh McIntyre

      Re: Why do we accept flaky network hardware?

      Re: "Yet we don't extend the same leeway to storage hardware, for example: write a block of data from memory onto a disk and there's no checksum to protect against bus errors in the transfer."

      Systems like ZFS do indeed use checksums (or other techniques in some non-ZFS servers) to protect against errors in server or storage hardware, including errors in data busses, disk firmware, etc.

    2. Bronek Kozicki

      Re: Why do we accept flaky network hardware?

      You cannot have 100% guaranteed error-free data transmission, because it relies on cables and other hardware which, at the lowest level, is analogue. And we like to push it to the limits. Network card vendors are "getting away with" implementing checksums at the hardware level but of course, you cannot use that if you want network-hardware-free communication between two containers sitting on the same box. Unless you really like to use interrupts more than strictly necessary.

    3. Anonymous Coward
      Anonymous Coward

      Re: Why do we accept flaky network hardware?

      The shielded coxail cable used in 10 base 2 ethernet in the 80s and 90s did mitigate against errors due to enviromental EM factors. Unfortunately money talks and unshielded twisted pair was cheaper and less hassle for lazy network engineers to install and so became the standard. So the error rate went up and instead of a few cables for a room full of computers you have one cheap cable per computer and network switches with more cables coming out of them than a small telephone exchange and ducting thicker than someones arm. And good luck to figure out where the cable plugs into the switch if they're not labelled and they're all the same colour.

      1. Anonymous Coward
        Anonymous Coward

        Re: Why do we accept flaky network hardware?

        "shielded coxail cable used in 10 base 2 ethernet in the 80s and 90s did mitigate against errors due to enviromental EM factors"

        Tell that to a trapped, pinched co-ax cable run under a desk or across a doorway, where heavy foot traffic and damage occurs, with a badly crimped BNC connector in the mix.

        No physical transmission of data can be guaranteed to be free of error, however good the quality of the medium used to transmit that data. ALWAYS assume there is the possibility that an external factor is messing with your data transmission quality and reliability, and therefore code to catch these issues so the data be rebuilt correctly or requested to sent again.

        Checksums is one quick method to do this - it's not 100% reliable - but it's far better than turning them off altogether because you are working in 'the cloud/container' where data never gets messed up.

        Hope for the best. Code for the worse.

        1. Anonymous Coward
          Anonymous Coward

          Re: Why do we accept flaky network hardware?

          "Tell that to a trapped, pinched co-ax cable run under a desk or across a doorway, where heavy foot traffic and damage occurs, with a badly crimped BNC connector in the mix."

          Thats why it had to be installed by someone who knew what they were doing, not just hooked up ad-hoc by anyone who could wield a pair of pliers. And twisted pair if its not installed properly in the room becomes a hopeless rats nest a lot faster.

          I should also add that 10-base-T completely defeats the point of using a CSMA/CD system like ethernet in the first place, you might just as well wire the machines up using a serial protocol like USB.

          1. BinkyTheMagicPaperclip Silver badge

            Re: Why do we accept flaky network hardware?

            Riiiight - so you want absolutely all networks, from the most basic home network, to the most complex enterprise network to be installed by a professional. That's not going to fly with Mrs Miggins linking her router and two computers at home.

            For all its faults <n>BaseT Ethernet works well for most people, 10Base2 was an abomination.

            I'd also point out that USB in no way compares with Ethernet, especially given its length limits..

            1. Down not across
              Pint

              Re: Why do we accept flaky network hardware?

              10Base2 wasn't that bad. I guess you never had to try to find a fault in 10Base5. One bad vampire tap either badly cored or not at 2.5 meter interval could do interesting things to your network.

              Icon. Well because just remembering 10Base5 means I need a few, or something stronger.

            2. Anonymous Coward
              Anonymous Coward

              Re: Why do we accept flaky network hardware?

              "I'd also point out that USB in no way compares with Ethernet, especially given its length limits.."

              I said a serial protocol like USB. And properly implemented serial protocols shouldn't have length limits measured in single digits which tells you all you need to know about the original cheap-as-chips USB design. Christ, even the parallel centronics interface could go further than USB and parallel systems are notorious for their length limitations. Besides, 10-base-T essentially IS a serial protocol in all but name but with pointless extra link layer packet header parsing overhead.

      2. Down not across

        Re: Why do we accept flaky network hardware?

        Unfortunately money talks and unshielded twisted pair was cheaper and less hassle for lazy network engineers to install and so became the standard. So the error rate went up and instead of a few cables for a room full of computers you have one cheap cable per computer and network switches with more cables coming out of them than a small telephone exchange and ducting thicker than someones arm.

        It's not all bad though. You get full speed to the switch rather share it with every other workstation on the segment. Also one bad NIC/cable don't take the whole segment and everything on it down.

  3. Anonymous Coward
    Childcatcher

    Code inspection?

    >someone noticed that there was an important difference between the test environments and the live systems<

    That's why we need devops, so there is only one system!

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like