back to article Tiny Linux kernel tweak could cut datacenter power use by 30%, boffins say

Hardware keeps getting faster, but it's still worth taking a step back periodically and revisiting your code. You might just uncover a little tweak that wrings out more efficiency or extra throughput than you'd expect. That's exactly what researchers at Cheriton School of Computer Science at the University of Waterloo have …

  1. Mage Silver badge
    Coffee/keyboard

    if hardware is twice as fast next year

    That was very true 1975 to 1995, a bit less 1996 to 2006. Generally not true at all now. The party is over if you care about power and cooling. An SSD vs HDD for the OS has now more impact for laptop users for new vs 5 year old model.

    1. An_Old_Dog Silver badge
      Windows

      Re: if hardware is twice as fast next year

      I'd find it humorous if designers hit an previously-unknown hardware-speed ceiling, and everyone went back to assembler and optimised FORTRAN to achieve further effective-speed increases.

      1. Herring` Silver badge

        Re: if hardware is twice as fast next year

        I feel old in that I have spent a lot of time getting C++ code to run faster - execution profilers, custom heap management, that sort of stuff. These days even assembler is a high-level language - the processor is performing all sorts of shenanigans with register renaming, out-of-order execution, speculative execution, contributing to the climate breakdown that may kill us all ...

        1. navidier

          Re: if hardware is twice as fast next year

          >I feel old in that I have spent a lot of time getting C++ code to run faster - execution profilers, custom heap management, that sort of stuff. These days even assembler is a high-level language - the processor is performing all sorts of shenanigans with register renaming, out-of-order execution, speculative execution, contributing to the climate breakdown that may kill us all ...

          Have you ever looked at your memory map? I find many people don't recognise that a C++ object is basically just a C struct with added subroutines (methods). This means that the memory layout is *forcibly* constrained by the order of member declarations. The compiler *must* add padding between members of different sizes where needed to maintain "natural" memory alignment. So, e.g. on a modern computer, if you first declare a _char_ object followed by a _double_ the compiler must add 7 bytes of unused padding so that the double is naturally aligned.

          Unfortunately, at least in the software I've had the pleasure/pain to investigate, developers tend to declare object members in order of perceived functionality rather than memory size. I recall one instance where I went through a fairly complex object, re-ordering declarations so that all methods (presumed to be 64 bit pointers) came first, then all doubles. then all ints/floats, and finally byte-sized members (manually adding dummy members where necessary to maintain element alignment). I was staggered at the performance improvement that one change had on the overall performance of a fairly complex programme. It was in the tens of percents!

          I'm not aware if such refactoring is available in modern toolchains. I hope it is, because the compiler itself is handicapped to follow what is laid out in the object definition.

          1. Mage Silver badge
            Boffin

            Re: re-ordering declarations

            That is something that could be done automatically by a compiler. However while it might "waste" a bit of RAM it would have a negligible effect on performance.

            Actually the early C++ compilers for Xenix, MS DOS, Atari, Amiga etc in 1987 by Glockenspiel in Dublin simply converted the source to C and then used a pre-existing C compiler for that platform.

            1. Brewster's Angle Grinder Silver badge

              Re: re-ordering declarations

              "However while it might "waste" a bit of RAM it would have a negligible effect on performance."

              Two words, my friend: "cache lines."

              Although normal methods don't appear in the structure so their location doesn't matter. And IIRC there are compilers switches and pragmas to deal with efficient packing and ordering of members because mucking about with some structure will cause ABI issues so the compiler can't do it by default. (These days you even have _Alignof/alignof to query the alignment.)

            2. bazza Silver badge

              Re: re-ordering declarations

              I remember C++ compilers of that style. Alas I didn't get into C++ until long after "real" c++ compilers came along, and thought it a pity that I didn't pay attention earlier. Seeing how C++ renders down to C is probably an excellent way for C developers to thoroughly understand what C++ is actually all about. Just what is a v-table and why would I iterate over it? Well, here it is!

              Possibly the same would be true of Rust. It could be interesting in the C vs Rust debate if Rust could be illustrated as "here's the equivalent C", and see what people make of it.

              1. Blazde Silver badge

                Re: re-ordering declarations

                Since you've mentioned it. Modern languages like Rust and Swift - possibly others - do reorder fields to save space. Example: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=da1828bceb324bb7ce62457f0c5bf8e2

                C & C++ are constrained by language rules which say fields can't be reordered, in part for predictable interfacing with external code, and in part because that's the way it's always been done and plenty of hacky, otherwise memory-unsafe legacy code relies on it.

            3. navidier

              Re: re-ordering declarations

              > That is something that could be done automatically by a compiler. However while it might "waste" a bit of RAM it would have a negligible effect on performance.

              It can't, it *MUSTN'T*! Otherwise libraries written on disparate systems would not be able to intercommunicate.

              My awareness of this goes back to the days of the Atari ST. People complained that hand-crafted assembler code crashed when accessing a particular C struct. It turned out that the particular struct was defined as something like (simplified)

              {byte, b,int i ;}

              The 68000 processor could not access odd-numbered memory. The C compiler correctly added a byte padding between *b and *i, but naive assembler implementations tried to access i as (*b)+1.

              1. UncleDavid

                Re: re-ordering declarations

                Reminds me: I once wrote an efficient math library for the Atari ST. When I released it, someone compiled it on an Apollo box (also M68000) and complained it didn't work.

                The floating point representations were entirely different.

                Not only that, but he said "and I don't understand the source code". No shit.

          2. Anonymous Coward
            Anonymous Coward

            Re: if hardware is twice as fast next year

            Meanwhile in W11 they do the opposite -make it as slow as possible so you 'need' that new AI PC!

          3. Sceptic Tank Silver badge
            Trollface

            #pragma pack(0)

            It's going to be mayhem if compilers start moving struct members around when calling a shared library that expects that struct to be organised in the way it appears in the header. Would make exposing APIs next to impossible.

            1. yoganmahew

              Re: #pragma pack(0)

              The start of the struct is the boundary, not the contents. Within the struct, you have to optimise manually.

              This is the way, well, this has been the way on IBM mainframes since I was a boy. LTORG will do a bunch of smart organising of literals for you, DSects you do yourself and pay attention to what you're doing.

              Obligatory "don't they teach anything in school these days?"

              1. Spamfast

                Re: #pragma pack(0)

                Using a compiler flag (such as GCC's __attribute__((packed))) to fully pack a structure removes all alignment padding and so reduces the size in memory and cache usage.

                However, as mentioned, on some architectures attempting to read/write things bigger than a byte non-aligned causes a fault which can abort the program (or hard fault the CPU in the kernel or when using embedded bare metal or RTOS usually causing a reboot).

                Some CPUs allow an excepton handler to analyse the fault and pull/put the bytes out/in one by one so that the user's assembler or higher level code doesn't have to worry about it but that's going to be very CPU intensive itself and may reset the instruction pipelining and caching.

                Some architectures handle unaligned access transparently entirely within the hardware but there is genernally a bus cycle penalty if two 32 or 64 words have to be read or written across the memory bus and again this can cause stalls in the hardware optimizers.

                So it's a good idea to have an understanding of your hardware platform even when writing apps in userland on a POSIX, Windows or other high level OS.

          4. really_adf

            Re: if hardware is twice as fast next year

            I'm not aware if such refactoring is available in modern toolchains. I hope it is, because the compiler itself is handicapped to follow what is laid out in the object definition.

            As has been pointed out by others, the compiler can't reorder things because it would break interoperability. But even ignoring that (ie if the compiler somehow knows it could reorder), the order that is optimal for performance will typically (on systems with caching and prefetching) depend on access patterns and would be extremely difficult, if not impossible, for the compiler to correctly predict.

            I don't see the compiler as "handicapped". Rather I see it just doing what the developer told it to, simultaneously solving interoperability and keeping it simpler. There is an underlying assumption that the developer knows what they are doing, but for C++ (or C) you have bigger problems if that isn't the case.

          5. that one in the corner Silver badge

            Re: if hardware is twice as fast next year

            > re-ordering declarations so that all methods (presumed to be 64 bit pointers) came first

            Ah, no.

            If you have any virtual methods, then you have one pointer at the start of the instance, pointing to the VMT[1]. Any non-virtual take up no space in the instance, they are just normal functions in the object file.

            [1] hopefully; if you inherited from a non-virtual base class then the VMT pointer is not at the start and you possibly more VMT pointers in odd places if you have been playing around with multiple inheritance; good luck trying to spot the padding bytes needed then!

            [2] albeit with funky mangled names to make the magic work.

      2. Mage Silver badge
        Facepalm

        Re: went back to assembler

        No, that would be worse. Since late 1980s or early '90s it's been possible to write video games and drivers in a HLL that runs as fast as assembler. Then there is the debugging and maintenance. And Fortran is 1956 technology. Language design and compiler development made that obsolete by the late 1970s. It's just some legacy scientific programs not worth re-writing that has kept Fortran alive.

        1. John Smith 19 Gold badge
          Unhappy

          "It's just some legacy scientific programs not worth re-writing that has kept Fortran alive."

          Except a lot of those programs are in fact whole systems that represent staff centuries of effort and whose numeric parts have been very carefully worked out in order to preserve numeric accuracy, so the results are actually useful.

          The issue with a ground-up re-write is you spend a f**ktonne of effort (which you have to validate against the old code to confirm it does exactly what the old code did correctly) before you add anything new.

          Pretty much the issues with Y2K in fact, for every package.

          So when PHBs look at benefit/cost they think "Naah. Not worth it."

          Note this does not mean FORTRAN development is static, it does mean you need a different approach, focussed more on re-factoring the code and use of preprocessors, for example (and this is hypothetical) converting real constants into rational integers, possibly giving more numeric precision than the reals. For example 355/113 gives pi to 6 decimal places.

          1. Claptrap314 Silver badge

            Re: "It's just some legacy scientific programs not worth re-writing that has kept Fortran alive."

            If you bring people in with the equivalent education of the ones who did the original work, this really shouldn't be that bad. In fact, a lot of old Fortran predates IEEE-754, which means that a lot of algos can be rewritten to take advantage of the guarantees.

            Sure, I wouldn't trust this unless someone was a mathematician with a background in floating point, but I cannot be the only one....

            1. John Smith 19 Gold badge
              Unhappy

              "this really shouldn't be that bad. "

              Consider this from a management PoV (and with codebases this big there is always someone managing it).

              To replicate this system will take 1 dev x <dev salary where you are in new language X > x 100 years of build and test = fu***onne of money.

              Time to next release = time to rebuild existing codebase in new language + time to build/test new features = A very long time.

              In an ideal world people would periodically rewrite systems from the ground up to flush problems out and allow best practice to be applied to the whole codebase. *

              But I don't know anyone who lives in that world.

              *It has been done occasionally. I think an Israeli company did it with mainframe software and their tool used the Plan Calculus to map the existing code and rewrite in (IIRC) C.

        2. Phil O'Sophical Silver badge

          Re: went back to assembler

          Fortran is 1956 technology.

          I suspect that the people who develop the Fortran-2023 compilers would disagree.

          1. jake Silver badge

            Re: went back to assembler

            "I suspect that the people who develop the Fortran-2023 compilers would disagree."

            I suspect that the people working with Fortran today would just laugh at the ignorance being displayed.

    2. Mage Silver badge

      Re: if hardware is twice as fast next year

      Actually for cost, reliability and capacity I have Linux on the laptop PCIe SSD, but /var and /home are two partitions on a 2T 2.5″ SATA laptop drive added (one small and the other the balance as the laptop isn't running a humongous web site from /var). A bootable USB stick Linux makes it easy to configure that after the OS is installed on the SSD.

    3. Mike007 Silver badge

      Re: if hardware is twice as fast next year

      > That was very true 1975 to 1995, a bit less 1996 to 2006. Generally not true at all now.

      My personal desktop was built 7 years ago. I went overboard and splashed out on the highest spec machine it was possible to build. So high end that I gave it the hostname "BEAST".

      Recently for shits and giggles I did a comparison to the low spec laptops we buy for office users. They have faster machines than me. No giggles when I saw that. (However they do at least only get 16GB RAM compared to my desktops 32GB).

      (For the record, still a pretty fast machine... Why are we giving office users such ridiculously over spec CPUs?)

      1. HorseflySteve

        Re: if hardware is twice as fast next year

        I had this issue too. I specced my own desktop & spent what for me was a lot of money so I would have a fast, future proof machine that would see through a decade, or so I thought.

        Within 3 years, entry level machines outperformed it: Lesson learned.

        As to why everyone has incredibly powerful machines, well there are some use cases, such as CAD & physical system simulations a.k.a CAE, that really need them, there are managers that really want them as status symbols, and there are IT departments that want to keep inventory simple by keeping variance to a minimum. Also, just as work increases to fill the available time, so OS and office tool bloat increases to waste the available resources and the hi-spec PC, in the majority of cases, produces no more useful output than the DOS/Windows-for-Workgroups PC did but (arguably) looks a lot prettier and has lots of toys & adverts

  2. Doctor Syntax Silver badge

    The server would have had to have been spending an awful lot of time in kernel to get that sort of advantage.

    1. diodesign (Written by Reg staff) Silver badge

      Kernel

      Well, it is the network stack. Not exactly the device tree parser O:-)

      C.

      1. Doctor Syntax Silver badge

        Re: Kernel

        Indeed it is. But even if the result of the fix is that virtually no time is spent on handling the network stack prior to that nearly a third of the time was along with all the other stuff the kernel has to do. I doubt that the time not now spent on handling it will be spent in an idle state using no power, there'll always be more stuff to do on a server.

        1. sabroni Silver badge

          Re: there'll always be more stuff to do on a server.

          Did you read the article?

          This is about stopping the interruptions that a busy network causes when the server is under load. ATM the code processing a network packet gets interuppted because another package has arrived. That task switch increases the server load without doing anything useful and slows down the processing of both packets.

          This fix makes the server code wait until the first packet is processed before picking up packet two when the server is under load.

          1. Andrew Scott Bronze badge

            Re: there'll always be more stuff to do on a server.

            Doesn't most network hardware support interrupt mitigation?

      2. cyberdemon Silver badge

        Re: Kernel

        Well true, but I am also skeptical about the headline claim, unless it is specifically talking about routers and perhaps fileservers, rather than servers in general.

        It certainly wouldn't make a dent in "datacenter power use" on a global scale, since that figure is utterly swamped by AI malarkey.

        1. jake Silver badge

          Re: Kernel

          I suspect it's 30% of the power used by networking specifically, not the total power used by the data center.

          When you consider the power cost of cooling alone, at around 50% ... well, need I day more?

          Shit, even loss due to power conversion can run to around 10% of the total ...

          1. cyberdemon Silver badge
            Flame

            Re: Kernel

            Exactly, but that's NOT what the headline, the article, or the original linked-to article from the University says.

            > Researchers at the Cheriton School of Computer Science have developed a small modification to the Linux kernel that could reduce energy consumption in data centres by as much as 30 per cent. The update has the potential to cut the environmental impact of data centres significantly, as computing accounts for as much as 5 per cent of the world’s daily energy use.

            That could lead naiive and nontechnical readers to believe that this could save almost 2% of global energy use, which is of course nonsense

      3. Miko

        Re: Kernel

        Your mention of the device tree parser reminded me that on Windows, disabling the Microsoft Device Association Root Enumerator is reported to fix many Elden Ring microstutters...

        I have no idea what that thing is actually doing in the background, but kernel interrupts being somehow involved would not surprise me

    2. Anonymous Coward
      Anonymous Coward

      Something like quad 10Gbe (40gb?) network interfaces with an intel atom synchronizing things.

    3. TReko Silver badge

      This research smells of academic theoretical BS to me.

      The typical interrupt service routine and kernel context-switch overhead is tiny. Parsing a few lines of json strings in user-space probably has more instructions.

      Besides, this, all network hardware in the last 15 years has had buffer coalescing, so interrupts come through at variable rates depending on the load.

    4. DS999 Silver badge

      If you're pushing data out 100 GbE ethernet interfaces

      You're already well past what a single core can accomplish just in the kernel - and that's despite all the stuff like checksumming that's offloaded to the NIC in such high end devices.

      So I would readily believe a 30% speedup for the kind of loads where you have dozens of cores spending 100% of their time in kernel/interrupt processing.

      1. Solviva

        Re: If you're pushing data out 100 GbE ethernet interfaces

        You can flood a 100 GbE interface with a single core, but this requires kernel bypass techniques - DPDK or libVMA to name a couple. At high packet rates the context switching between userspace and kernel is a significant overhead which these schemes avoid, with various disadvantages on the way so they aren't completely plug in replacements.

  3. Solviva

    Wonder how this compares to good old interrupt moderation? Sounds like a similar idea - instead of interrupting on every interrupt, wait for either X interrupts before the application is interrupted (heavy traffic) else if Y time has passed since it was last interrupted (light traffic) fire an interrupt.

    1. bazza Silver badge

      The patch docs talk about NICs coalescing interrupts anyway, and explain that as the NIC knows nothing about the application's behaviour it's "guessing", and therefore suboptimal.

      The mechanism that's been introduced in effect allows an application to give a firm hint as to what degree of coalescing does make sense from the application's point of view, with the side effect that (if judged well) the application will always be asking for data just as the NIC / kernel was beginning to think about raising an interrupt.

      1. Caver_Dave Silver badge

        I've been dynamically changing the interrupt coalescing parameters for (more than 10) years.

        Very simple code means that if there are few packets then the coalescing becomes 0, if there are many the coalescing will approach either the H/W limit or a S/W set maximum response time limit.

        1. bazza Silver badge

          It'd be interesting to know if your experiences with this accord with what is being reported for this in-kernel implementation.

  4. Anonymous Coward
    Anonymous Coward

    Economy's kernel

    Similar or larger gains are achievable in the economy, if the tax code etc were properly reviewed and cleaned up.

    Somehow the status quo has been maintained in the countries' governance, as if the actual policies are the wisdom itself. But so much has changed and the changes are accelerating. So maybe the Elon's Dept of Efficiency is not such a bad idea.

  5. Rich 2 Silver badge

    Confused

    I’m confused about how this works in the kennel. Surely your application needs to do the switching between interrupt and polling because (a) only the application knows when it wants to read more data and (b) it is the application that needs to implement the interrupt handler or polling calls

    So how is this made “automatic” in the kernel? What am I missing?

    Or does this switching only relate to feeding data into the internet buffers for the application to read later? If yes then it still doesn’t answer point (a)

    1. Mage Silver badge

      Re: Confused

      The application doesn't know it wants to read more data till the device driver (in the kernel?) tells it. The application only knows when it wants to send data.

      A simplification

      Obviously a Web server can't know when network requests happen, and then later responds by sending. A Web browser issues requests (on the whim of the user) which is a small amount sent (unless a form post or upload) and then is waiting for a response. So a server (any kind of protocol) in a data centre (like a telephone exchange) can only roughly know when busy period are, but can't anticipate exactly the instant of demand, or the density of it. It's quite a different network load & traffic to a workstation/Laptop with a single user on a GUI.

    2. Brewster's Angle Grinder Silver badge

      Re: Confused

      Wild guess: the kernel knows whether a process is actively waiting on the socket or busy eating processor cycles. That's probably enough of a clue.

    3. containerizer

      Re: Confused

      Applications indicate they want to read more data by notifying the operating system via select()/poll()/epoll(). These mean "stop me, then wake me up when more data is available". The details of how to discover whether or not data is available are hidden by the kernel so the application doesn't need to care, it just needs to correctly use the proper API. Note that the poll(), epoll() calls are not a direction to the kernel to use polling instead of interrupts.

      Imagine it's a web browser waiting for a response from a server. Simplified, the application opens a socket and sends data to the server and waits for the response by calling poll() on the socket.

      When that happens, the kernel kicks in. If there is no data on the socket, the kernel won't reschedule the application. Later, the server responds - an interrupt occurs, the kernel reads the data and then places it in the application's socket buffer. The application is then rescheduled and can immediately read the data as soon as it is resumed.

      In the alternative world with polling, the application does not change. Instead of waiting for an interrupt, a timer inside the kernel periodically wakes up and checks for data. If it is there, it follows the same series of steps that it did when it got an interrupt. The application does not have to change.

      Some network cards, typically the higher-end ones, already effectively support this with a feature called "interrupt coalescing" where they will wait for their buffers to fill up, or for a timer to expire, before notifying the CPU.

      The approach mentioned in the article is likely to be beneficial in high throughput scenarios, but not all. There is a crossover point; if your I/O is frequent and regular, polling is more efficient than interrupts due to the extra interrupt servicing overhead. If the I/O is more patchy, polling may waste CPU cycles through doing polling work which rarely finds available data, and it may also introduce latency as a packet will have to wait for the next polling interval before being serviced. According to the article they're adopting a hybrid approach to switch between polling and interrupts depending on the conditions, which is clever. Tune that right and I'd say this feature will end up being enabled by default for most deployments.

      Interrupts are great if your I/O comes and goes at random, relatively infrequent intervals, which might be the case for compute-bound workloads

    4. Rich 2 Silver badge

      Re: Confused

      I’m still confused….

      @Mage

      “ The application doesn't know it wants to read more data till the device driver (in the kernel?) tells it. The application only knows when it wants to send data.”

      > No! The application knows full well when it is READY to process more data. The kernel has no idea. Whether there is any data to read is another issue and is the whole point of this change. If the application is waiting for the nic driver to tell it that there is data then the nic driver must be polling the nic or waiting for an interrupt from it - which was exactly the question I posed - ie - dues the work by just controlling how the kernel reads data into its internal buffers?

      @Brewsters Angle….

      “ Wild guess: the kernel knows whether a process is actively waiting on the socket or busy eating processor cycles. That's probably enough of a clue”

      > No! That would be just wild speculation by the kernel. Just because the application is busy, it doesn’t mean it wouldn’t rather be processing incoming data and would like to be interrupted and told there is some.

      @containetizer

      “Applications indicate they want to read more data by notifying the operating system via select()/poll()/epoll()”

      > THAT is exactly the point I was making - it is the application that must make this decision. Not the kernel. Of course the application needs to convey that decision to the kernel so the kernel can modify the driver’s behaviour. But it’s still the application that is making the decision

      1. containerizer

        Re: Confused

        > THAT is exactly the point I was making - it is the application that must make this decision. Not the kernel.

        I don't understand. What decision ?

        select/poll/epoll do not make any decision. They simply cause the application to block until certain conditions (eg arrival of data) arise. Whether or not that condition is detected via the arrival of an interrupt or via polling is hidden from the application.

  6. bazza Silver badge

    Have They Measured the Whole Problem?

    It's an interesting idea, but I do wonder.

    One of the reasons to respond to NIC IRQs is to get that packet in, in memory somewhere, fast, so that more packets arriving on the network can all make their way through the limited buffers within the NIC. If the traffic load is such that the application isn't really keeping up, and one is now polling for available packets, it seems to me that there's the potential for network packets to get dropped. There is a parameter irq_suspend_timeout involved, which seems to be a timeout to ensure that the OS will start paying attention to the NIC if the application has taken too long to ask for more data. The suggestion is that this is tuned by the app developer "to cover the processing of an entire application batch"; but, what about the NIC's ability to continue to absorb packets in the meantime?

    The patch documentation doesn't mention packet loss, drop, etc at all, so I'm presuming that actually that's covered off somehow, hence no need to explain the risk of it.

    The thing is, dropped network packets will start having a big impact on the amount of energy consumed by the network itself. It costs quite a lot of power to fire bits down lengths of fibre or UTP, and the energy cost of dropped packets starts being more than a doubling of that power (because there's more network traffic than just re-sending the dropped ones). So that's why I'm interested in whether or not they have got packet dropping covered off somehow.

    However, on the whole, a clever idea and well worthwhile!

    Tuning Architectures?

    Ultimately, if more network traffic is being fired at a host than the host can consume, then the architecture is perhaps wrong, or wrongly scaled. Our networks effectively implement Actor Model systems, which are notable in that a lack of performance gets hidden in increased system latency, because it muddles through the data backlog eventually (or at least, that's the hope). Thus, it's tempting to write off the increased latency as "who cares", and move on. That is often entirely acceptable (which is why all networking and nearly all software kinda works that way).

    However, if one adopts a more Communicating Sequential Processes view of networking (think Golang's go routines / channels, but across networks instead, or a http put), this has the trait that if a recipient of data isn't keeping up, the sender knows all about it (send / receive block until the transfer is complete - an execution rendezvous). There's no hiding a lack of performance in buffers in NICs, networks, because there aren't any (not ones that count, anyway). It sounds like a nightmare, but actually it's quite refreshing; inadequate performance is never hidden, and you know for sure what you have to do to address it. However, if you do get the balance of data / processing right, all the "reading" is started just as the next "send" happens, and the intervening data transport shouldn't find a need to buffer or interrupt anything.

    This new mechanism brings the opportunity to kinda blend both Actor and CSP. It's "Actor", in that data could build up in buffers, but if an application / system developer did tune their architecture scale just right in relation to processing performance, the packets would just keep rolling in and be consumed immediately with barely an interrupt in sight, as if it were a CSP system but without any explicit network transfer to ensure the synchronisation of sending and receiving.

    Of course, achieving that in real life is hard for many applications. There are some where there's constant data rates, e.g. I/Q data streaming from a software defined radio (well, the ADC part of it anyway). This new mechanism paired with the fact that PREEMPT_RT has just become a 1st rank kernel component (another hooray for that!) does some interesting things to the performance that could be achieved.

    1. Caver_Dave Silver badge

      Re: Have They Measured the Whole Problem?

      Nearly all NICs keep their descriptors (which point to the data buffers) and the data buffers themselves in a cache coherent area of RAM. Many now allow a dynamic setting of the number of descriptors. Running out of descriptors should not happen in any reasonably defined system.

    2. containerizer

      Re: Have They Measured the Whole Problem?

      Slight nitpick on an excellent comment : enterprise NICs (and most of the good cheap ones) do DMA, so there is no issue with getting packets out of the NICs internal buffers. This is usually set up as a ring buffer and you can usually configure the size. Of course, you absolutely can get issues if you fail to service the ring buffer quickly enough.

      Avoiding dropped packets then becomes a case of tuning the ring buffer size and the parameters which flip between interrupt and polling mode.

      No, you definitely do not want PREEMPT_RT in most enterprise settings where maximising workload per watt are prioritised. PREEMPT_RT significantly reduces throughput in order to improve latency. This is rarely what people want.

      1. bazza Silver badge

        Re: Have They Measured the Whole Problem?

        No problems with excellent nitpicking :-)

        I'm well aware of PREEMPT_RT's lack of appropriateness for enterprise settings! I have used it extensively for real time signal processing applications, with excellent results. It's quite good fun loading up a machine to 95% CPU utilisation, leaving it there for a few years, never missing a beat, and still be able to log in via ssh and do things without breaking it. I don't know if one still has to run the application as root to be able to have it real-time scheduled and to set it to an appropriate priority (e.g. more important than all the device driver kernel threads).

      2. John Smith 19 Gold badge
        Unhappy

        "PREEMPT_RT significantly reduces throughput in order to improve latency"

        Depends.

        On anything involving VoIP that'sexactly what you want.

  7. Chris Gray 1
    Happy

    Hmm

    My experience here is from long ago, but...

    It may matter as to how long the polling takes. If control registers/whatever have to be used to enable access to whatever can report that work is ready, you have to include that cost (plus the cost of putting the system back to normal operating state, even if that ends up being done in a quite different context) in the cost of polling.

    In a very low-end old-style system, you could have a UART/whatever that makes it status available to the CPU with one access (likely a longer access than to RAM, but still). Polling will be cheap.

    On the other end of the scale, perhaps your device is smart enough that you can give it a bunch of memory resources that it can fill up without needing the host CPU to do anything other than handle whatever is ready, when asked. Streaming right along with few needed interrupts and little overhead. I thought ethernet chips started doing this decades ago. What am I missing?

    Stuff in the middle might need more host attention, so the tradeoffs will differ.

    Given the reported improvements, clearly I'm missing something.

    My direct experience was made a bit simpler by having a CPU with only one core, so full/empty flags in memory were enough. Incoming messages went directly into buffers, all within one interrupt (the hardware wasn't quite smart enough to properly clear the "empty" flag, if I recall correctly). The further-processing thread could take buffers that the interrupt code had marked ready, and could return buffers to that code similarly, all without anything waiting. Only when there was no more work to do in the in-memory buffers from the interrupt code would the thread go to sleep (after setting the "wake me up" flag of course!) Another variant for different hardware had a Linux device driver and my code in our hardware use similar conventions between them. These are essentially combinations of interrupts and polling, which is of course what you likely want.

    1. Roland6 Silver badge

      Re: Hmm

      >” I thought ethernet chips started doing this decades ago. What am I missing?”

      With the focus on the Personal Computer and thus cost, I suspect we have enshrined the idea of dumb NICs, Modems, Printers, etc. because there is an underutilised and expensive cpu spending much of its time idling…

      Now we are wanting (particularly in cloud datacentres) to make full use of cpu power to do stuff, that old model is starting to look less attractive.

  8. Vometia has insomnia. Again.

    VaxClusters

    Sounds similar to the method used by CI-based VaxClusters back in the day, especially the HSC storage control units: DEC knew interrupts were computationally expensive and kept those generated by traffic to and from the HCSs to an absolute minimum whenever possible. I often wondered why e.g. NFS often felt oddly sluggish in comparison and realised that may be at least one of the reasons.

  9. John Smith 19 Gold badge
    Go

    Well well, polling more efficient* than interrupt driven IO.

    Who knew?

    A few points.

    "Giving the NIC some memory resources" is a thing called Direct Memory Access. it could be enabled by a "Smart" NIC, a specialist DMA chip or a sub-module

    IIRC there's a bunch of useful stuff in the Linux sub-community that works on 1-second-boot versions for embedded systems.

    The classic was merging clock ticks (for each user level app) into a single "global" clock tick. I found the stuff you could do with the ELF loader fascinating in optimising module loading order and memory layout (memory always being a constraint in embedded apps).

    TBH I would never have considered polling to be more efficient than interrupts, so their general point (periodic revisiting SW architecture based on the assumed HW architecture and workload you are supporting) is very sound.

    *For certain specific circumstances.

    1. Caver_Dave Silver badge

      Re: Well well, polling more efficient* than interrupt driven IO.

      Polling is the way to ensure determinism and response times in an RTOS.

      TBH networking is a real pain in an RTOS and I always try to place all asynchronous elements like that in a separate partition with closely defined interfaces to the hard real-time partitions.

    2. ChrisC Silver badge

      Re: Well well, polling more efficient* than interrupt driven IO.

      Polling can be more efficient if you know the task you're presently doing can't block for longer than the maximum period you can ignore the event that would otherwise have raised an interrupt, as it means you don't have to spend any time saving/restoring your present task context - you just let that conclude naturally and then run the code that would otherwise have been executed mid-task via the ISR.

      What puzzled me a bit though was "By reducing the number of interrupt requests, or IRQs, the host CPU can spend more time crunching numbers and less time waiting on packets that aren't ready to process.", because why would you be raising IRQs if there's nothing for the system to handle?

      Overall though, reading this article just made me turn a wry smile, because optimisations like this are merely par for the course in the world I inhabit of embedded systems development, and IMO it's long overdue that coders on t'other side of the fence in the world of desktop/server/other big system development start to re-discover some of the techniques their predecessors *had* to be aware of due to the lack of system resources, because simply continuing to throw system resources (CPU cycles, RAM etc) at a problem to cope with increasingly bloated code isn't sustainable.

  10. phuzz Silver badge

    On the complete other end of the scale from big data centres, this little kernel driver seems to drastically improve (some) games performance under Wine/Proton.

  11. Sceptic Tank Silver badge
    Trollface

    Tiny Linux kernel – written in microcode.

    I'm definitely no expert. But I find it strange that this isn't a solved problem. Disk drives have had PIO / DMA for ages now. I would have expected a NIC to have some cache memory for buffering and hook into DMA to keep the cache filled / empty depending on whether it's doing TX or RX and issue an interrupt when the kernel has to action something. Can't believe that polling is a thing in the year 2025. (Highly effcient sleep states and such).

    But what do I know?

  12. Ianab

    Tuning on the fly?

    What I got from the description is that there are 2 ways to handle incoming packets.

    1 : Interrupts. Down side of this is that the CPU has to stop what it's doing, handle (stack?) the incoming packet, then complete what it was doing. This obviously delays the completion of the current task, and so the start of the new task.

    2 : Polling. The CPU checks the network for new packets whenever if has no current task running.

    Neither option is "wrong". But under a heavy workload the extra CPU cycles used to deal with the interrupts slows the process down. Conversely if things are quiet, then the CPU is burning clock cycles polling the network, when it could be "sleeping" or running low power "No-Op" cycles, and get woken up when there was work to do.

    The clever part is being able to dynamically switch between the 2 methods, depending on the load. Exact improvement is going to depend on the actual work, but if you could get 10% better response under heavy load, or a 10% power saving when under light loads, that adds up to significant improvement over a whole data center.

  13. odyssey

    Garbage Collection should be the next to go. Developers have got lazy and dumb relying on someone else to manage their memory. So much efficiency could be gained looking at more efficient memory management. Rust, Zig, C, or something else - Deepseek and now this has shown that there are real software efficiency improvements to be made but only perhaps 10% of developers understand things well enough to make them.

  14. Numpty

    Nice of Linux to finally catch up with Solaris from 2008.

  15. TeeCee Gold badge
    Facepalm

    Well, well.

    So getting your code optimised by people who have a deep understanding of the language, know how the hardware works and know what they're doing makes it more efficient?

    I'm afraid this missive from The Department of Stating the Bleedin' Obvious is going to go down like a cup of warm sick in India.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like