back to article Latest update for 'extremely fast' compression algorithm LZ4 sprints past old versions

The new version of the high-speed compression algorithm LZ4 gets a big speed boost – nearly an order of magnitude. LZ4 is one of the faster compression algorithms in Linux, but the newly released LZ4 version 1.10 significantly raises the bar on its own forerunners. On some hardware, LZ4 1.10 compresses data over five and up to …

  1. Pascal Monett Silver badge
    Thumb Up

    "only very smart humans can do that"

    "writing efficient code to exploit the parallelism of multiple processor cores is very hard"

    Fuck yeah.

    And there is no amount of pseudo-AI bullshit that is going to change that any time soon.

    That said, if and when it does, that will be one hell of an accomplishment.

    1. b0llchit Silver badge
      Alien

      Re: "only very smart humans can do that"

      ...pseudo-AI bullshit...

      Soon you'll be able to buy GI(*) in a box from Amazon. It is a half-large/small box with a feed line, a sink line and just one button. It will perform all the necessary task you ask it via the WiFi connection and will outperform any AI day and night. A recent teardown video of the GI box revealed a small genetically altered squarish sized humanoid inside the box. A brain implant was coupled with a WiFi antenna and electrical power was supplied using a biogenerator.

      The main negative point of the box was the smell when in use for prolonged periods. The major positive point was the use of mainly biologically degradable components that can be recycled quite easily. No exact measurements were given for the feed- and sink-line requirements, properties and dimensions.

      (*) Genuine Intelligence

      1. elDog

        Watch out when it spews diarrhAI

        Nasty bug going around right now.

      2. Robert Carnegie Silver badge

        Re: "only very smart humans can do that"

        It sounds a lot like the Discworld Camera - or the original "Mechanical Turk".

        1. Liam Proven (Written by Reg staff) Silver badge

          Re: "only very smart humans can do that"

          [Author here]

          > It sounds a lot like the Discworld Camera

          At least the imp got breaks. Including for tiny cigarettes, as I recall.

          This is more like the single nastiest SF short story I have ever read, which lives in my head rent-free:

          https://isfdb.org/cgi-bin/title.cgi?45334

    2. Charlie Clark Silver badge

      Re: "only very smart humans can do that"

      Oh, I think we'll see AI being used more and more but discretely and cooperatively: learn how to solve the task and then, pace Thoreau, optimise, optimise. I think that DeepMind has demonstrated how well this approach can work across domains and, the recent developments suggest that they're working on a kind of AI assembly line, including teams that monitor quality and performance. But, for the moment, the money is going to be in that holy grail of "business process automation" for which many generative AI systems probably are already good enough™.

      I remember years ago that it became consensus that code optimisation could generally be left to the compiler and I think we'll see more and more of this kind of approach.

    3. mahan

      Re: "only very smart humans can do that"

      Just as a side note: Check out the concurrency model in Go (Golang) if you haven't.

      While multithreading is indeed always a challenge, their approach is by far the easiest I've had to wrap my head around.

  2. biddibiddibiddibiddi Bronze badge

    The opening says that the new version is able to compress data faster, but the release notes quote is only about the decompression speed. The full release notes DO show that compression is the primary improvement, but the quote provided is unrelated to what the article is trying to hype.

    1. Liam Proven (Written by Reg staff) Silver badge

      [Author here]

      The comparison of compression speeds is in the paragraph directly after the "more context" links section. You did keep reading that far, didn't you?

      I merely quoted the release notes where it noted that while it _is_ faster, it's fast enough to saturate an NVMe connection on a single core...

      1. david 12 Silver badge

        There is no need to be insulting. The comment to which you are replying is both true and helpful.

        Yes, there is an implicit criticism, a suggestion that readers will benefit both from more information (re the source document) and from less information (over-emphasis of secondary information).

        If you as (author), find yourself suggesting that a reader has misread your article, it's a chance for you to consider re-editing the original text.

      2. biddibiddibiddibiddi Bronze badge

        I never said there was nothing in the article about the compression rate. A big highlighted/bold section stands out more for people just skimming the article, it's meant to be a focal point that emphasizes the important information, so it ought to be directly related to the title and opening of the article rather than being something that isn't mentioned anywhere else and isn't about what the article mentions repeatedly.

  3. Anonymous Coward
    Anonymous Coward

    "...extremely fast compression...."

    Fine and dandy...............

    ...........but this Linux user has been using and saving backups using zip for a while now.

    Q1: I don't think I need faster compression. Do I?

    Q2: Then there's the problem of previous zip files.....if I were to change from zip to something else.

    Q3: ....and then I wonder about FUTURE updates...........

    ......so I'll be staying with zip. Sorry.....no sale......"new shiny" and all that......

    1. Anonymous Coward
      Anonymous Coward

      Re: "...extremely fast compression...."

      lz4 is very useful in filesystems where the same data is both read and written often, not archive files which are much more static.

      I've used it as the default compression in ZFS, where it could actually improve FS performance because there's usually more CPU available than storage bandwidth.

    2. Liam Proven (Written by Reg staff) Silver badge

      Re: "...extremely fast compression...."

      [Author here]

      > so I'll be staying with zip.

      A few points.

      1. That's fine, of course.

      2. Zip isn't _a_ compression format: it's a collection of lots of them. 7zip has some of the best compression routines these days but other tools also called (something)zip(something) also work well, such as bzip2.

      3. lz4 is not really aimed at the use case of "make a file containing this stuff, which I'll want later". Its strength is in doing compression _on the fly_ for instance when compressing memory into your swap file, or when reading a kernel from a possibly-slow removable medium into RAM. In other words where it's acting as a sort of filter in the middle of an ongoing pipe of data: it takes faw data in one side and spits out compressed data the other, _continuously_ and with the minimum possible delay.

      Zipping a file is different. It can take several goes. It can read ahead a bit, pick a good compression algorithm, go back and use it for that chunk, and then maybe repeat that for the next file you're putting in the archive.

      Zip is like a buffet: you can visit repeatedly and pick what you want at that time, and change your mind.

      Lz4 is like a drive-through: you have to know what you want, at the beginning you order it, and soon afterwards you get it and leave, and you can't go back and change your mind.

      1. NATTtrash
        Boffin

        Re: "...extremely fast compression...."

        Simple user trying to understand...

        On some hardware, LZ4 1.10 compresses data over five and up to nearly ten times faster than previous releases by using multiple CPU cores in parallel.

        So how is this different then to pigz (https://github.com/madler/pigz) and pxz (https://github.com/jnovy/pxz), that, as far as I know, also do multi-core and we already have some time?

        1. Liam Proven (Written by Reg staff) Silver badge

          Re: "...extremely fast compression...."

          [Author here]

          Answering out of order intentionally.

          > also do multi-core and we already have some time?

          OK. What follows is simply my best understanding.

          This is *not* about file-based compression algorithms, which are a separate area of R&D. Nothing wrong with either but they are different tools for different jobs.

          LZ4 is about high-performance _on the fly_ compression between a continuously streaming source of data and a sink. LZ4 is used in tools such as ZRAM (compressed swap file in RAM), Zswap (compression of RAM into swap file/partition), and ZFS (compression of filesystem data as it is stored to disk).

          With a file compression tool you can seek forwards and backwards in the data. You can look ahead some distance, work out what will work well, and then go back that distance and start work. You can look at file metadata such as file extensions and choose an algorithm that should work well for that type of data. You can choose one algorithm for text, another for images, another for executable binaries, etc.

          LZ4 is for scenarios where you do not have that luxury: you have a stream of data coming in _right now_ and you have to compress it ASAP and write it out again, and you don't know what the data is or will be.

          > So how is this different then to pigz (https://github.com/madler/pigz)

          1. Not GPL and thus possibly licence-incompatible with the kernel.

          2. Seems not to be a novel algorithm but a novel implementation of an existing algorithm.

          3. Not a streaming compression algorithm but a file-based one.

          > and pxz (https://github.com/jnovy/pxz),

          No docs, no useful readme, just the GPL text. Very lazy.

          But as above: #2 applies.

          This seems to be a multicore implementation of the existing LZMA algorithm, as described here:

          https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm

          It's not a compression algorithm.

          I am saying "here is a new fabric that you could make garments from" and you are responding "but we already have clothes. What's wrong with the clothes we have?"

          Does that difference make sense?

          1. NATTtrash
            Pint

            Re: "...extremely fast compression...."

            Actually Liam, yeah it does. Cheers!

  4. DaemonProcess

    Not so new

    I implemented parallel zlib compression in 1999 on hp-ux , so not so new. It was easy because the vendor had some lovely simple #PRAGMA options for multi-threaded parallelism in C.

    It showed up to 1600% cpu reported in top when running on 4 cores, totally destroyed the ability of the o/s to account for it properly.

    One problem was compress (zlib) had a small 64k maximum block size, so it spent more time managing it's threads than actually doing stuff. If I could have changed that which would make it not backwards compatible I would have gone for 4MB or so, to limit the i/o ops and reduce the thread management. The actual speed therefore was not 16x faster but maybe 2x.

    You could get 2x faster in the original single-threaded implementation by simply reducing the compression strength, so the whole exercise wasn't worth more than an academic investigation. De-compression was so fast that i/o was the bottleneck and it simply wasn't worth doing that in parallel according to my testing.

  5. Anonymous Coward
    Anonymous Coward

    Parallel compression is hard

    No kidding that this is a hard problem.

    Most compression algorithms compress data according to how the previous data was compressed.

    To oversimplify, to compress the next byte of data, you have to know how the previous byte of data was compressed.

    You can't compress two bytes of data at the same time using two different processor cores. At least not without making significant changes to the algorithm. It would be interesting to know what those changes were.

    1. Anonymous Coward
      Anonymous Coward

      Re: Parallel compression is hard

      They compress indépendent blocks of data in parallel, not a single stream.

      1. Anonymous Coward
        Anonymous Coward

        Re: Parallel compression is hard

        So there must be some encoding of where one block ends and another block begins, and they have to be compressed independent of each other, meaning that the beginning of each block can't take advantage of the previous block's data. Meaning that there's going to be some compression inefficiency at the beginning of each block.

        1. talk_is_cheap

          Re: Parallel compression is hard

          > So there must be some encoding of where one block ends and another block begins,

          That is kind of how block storage devices work. On top of this the file system decides how many physical blocks it wants to combine into a logical block.

        2. Anonymous Coward
          Anonymous Coward

          Re: Parallel compression is hard

          Yes, needing to know how the previous block was compressed sounds more like video compression. Would think lossless compression would only require looking at the dictionary.

  6. Steve Aubrey
    Happy

    Funny

    Kernel version 3.11, also nicknamed "Linux for Workgroups". Somebody has some history, and also a sense of humor. Bravo!

  7. Wofer

    7z on my computer makes smaller files than lz4, but I didn't test for speed.

  8. Bill Gray

    I have _got_ to be missing something basic...

    ...and am probably embarrassing myself here. But, in the spirit of "the only dumb question is the one you don't ask"...

    You have (say) five GBytes to compress, and (say) four cores to devote to the job. One core compresses the first 1.25 GBytes, while another simultaneously compresses the next 1.25 GBytes, etc. Write out all four compressed chunks.

    When it comes time to decompress, each of four cores is assigned a 1.25-GByte block to decompress.

    Some very slight amount of decompression may be lost, if (for example) you're at the two-GByte point (partway through the second block) and some text repeats that was somewhere in the first block. You won't be able to say "copy N bytes from a gigabyte ago". In the pathological case where the file consists of a single one-GByte block repeated five times (or similar), you'll get almost no compression. (The "traditional" method would compress the last four repetitions down to just about zero.)

    Aside from such unusual cases, I'm not seeing the downside.

    In reality, you would say "we'll do 0.5-GByte chunks, feed cores 0-3 a half gig each, and as each finishes, feed them another half gig until the file's done." With the chunk size being a tunable parameter. This also helps in the streaming case; as soon as a half gigabyte is read, a core can start tearing into it while the next half gig is read.

    As noted, if it were this easy, it'd be done this way. So what am I overlooking? (Need a head-scratching icon for this... we had the Paris one, but unlike Bogart, we won't "always have Paris".)

    1. silent_count

      Re: I have _got_ to be missing something basic...

      G'day, I'm not all that knowledgeable about (de)compression routines but what you're describing sounds reasonable.

      It may be worth going with smaller chunk sizes because, let's say you compress your 5GB file in four 1.25GB chunks but when it comes time to decompress, one core is busy with some long-running task. The three available cores will decompress the first three chunks but then two cores will be sitting idle waiting while one core does the last chunk. Smaller chucks would enable the decompression task to be more evenly distributed, albeit at the cost of some 'chunk management' overhead.

    2. Liam Proven (Written by Reg staff) Silver badge

      Re: I have _got_ to be missing something basic...

      [Author here]

      > So what am I overlooking?

      I am not a developer, and very definitely not of parallel code. But here are some points you seem to have overlooked that occur to me:

      * How do you allocate a continuous stream to an unknown number of cores?

      * When decompressing how do you coordinate who is writing which parts of the output stream -- bear in mind this is not a file-based system -- without expensive IPC or locking?

      * When compressing, how do you ensure that concurrent threads use the same encoding patterns? Note that all can be assumed to be receiving different data. Also note you can't change the pattern used in existing formats: the format was set 13Y ago.

      * Bonus: Do it without very resource-heavy interprocess communication which would block and offset the benefits of multithreading.

      * Now, implement this for an existing compression format which cannot be altered or you'd break compatibility.

      1. Bill Gray

        Re: I have _got_ to be missing something basic...

        (tl;dr : I think the need to maintain the existing format is probably the killer.)

        > I am not a developer, and very definitely not of parallel code

        I am a developer, but have only had a few projects involving parallel code. Just enough to be dangerous.

        > How do you allocate a continuous stream to an unknown number of cores?

        That part is fairly routine, and there are several ways to do it. You can just pick a number of threads to run. Pick too many, and the OS is quite good at scheduling them to spread out the load evenly. Best not to have more threads than the machine has cores, just to avoid the (small) overhead of added threads.

        > When decompressing how do you coordinate who is writing which parts of the output stream -- bear in mind this is not a file-based system -- without expensive IPC or locking?

        Hmmm... the loop I'm envisioning is :

        While data is coming in :

        Do we have (say) half a gigabyte, or whatever our block size is?

        If so, start a new thread or process (probably thread) to compress said half gigabyte.

        Has the thread processing the "oldest" block completed, with all preceding blocks (if any) written?

        If so, start a new thread to output its data.

        Both threads and forking new processes, by the way, are quite cheap in Linux. The above could be streamlined with a bit of IPC; the only "communication" in the above is a thread signalling completion. Pretty simple as it is, though.

        > When compressing, how do you ensure that concurrent threads use the same encoding patterns? Note that all can be assumed to be receiving different data.

        Not sure I follow you here. Of course the threads will each be compressing different data; the alternative would be that we've been conveniently handed data that repeats for each block. Different data means a different encoding pattern for each block.

        > Also note you can't change the pattern used in existing formats: the format was set 13Y ago.

        I'd ass umed that the format was block-based. Or, at the very least, you could say "here's a chunk of data to compress/decompress, and here's another chunk, and..." It isn't. To make it such would require a format change. Not much of one, and there's a 'version' flag that would allow for it. As an admittedly minor side benefit, such a change would mean that if you needed to decompress only part of a file, you could do so.

        It would not be completely impossible to do it the way I've described and stay with the existing format. Each block would need some data from the end of the preceding block, because the compression mostly consists of saying "copy N bytes from delta_bytes ago". But it does raise the level of difficulty a fair bit.

  9. Anonymous Coward
    Anonymous Coward

    Argument against hyperthreading?

    "More cores can't make a single-threaded process run any more quickly, and in general, most common apps tend to only use a small number of threads."

    Isn't that one of the core arguments against hyperthreading? A CPU-intensive single-thread process, on a machine that otherwise isn't using a lot of CPU, keeps getting interrupted and passed between virtual processors, making it slower than without hyperthreading. Am I understanding that right?

    Not an academic question. We have a machine at $WORK that takes forever to do certain operations. They're single-thread (and have to be). 12 physical cores, 24 virtual ones, and consistently runs at 5% of total CPU power, mostly due to those "certain operations".

    1. Martin an gof Silver badge

      Re: Argument against hyperthreading?

      Not at all well versed in a subject, but an overall 5% on 24 virtual cores on a machine doing mostly just one thing implies one core at near 100% (100/24 = 4.166) and other cores doing small amounts of stuff. Most system monitors will let you see what individual cores are up to but might report individual usage as 100% per core, or 4% per core when maxed out.

      Unlike the early hyperthreading days, OS schedulers are supposedly aware of physical -v- virtual and even big -v- LITTLE and avoid putting two threads on virtual cores which occupy the same physical core if there is another physical core free. I have an app at work which is so poorly written it practically crippled the original Pentium boxes it ran on by hogging the processor - it spends nearly all its time waiting so really, really doesn't need to max out a processor. On a 2-core box, no problem at all. It gets one physical core to itself and everything else shares the other.

      As for hyperthreading, the original idea was that most apps will spend some of their time "stalled", waiting for memory or for a lengthy calculation to finish. It's possible to put idle parts of the core to use on another thread, keeping the core busy. Similarly, with the badly-behaved app above, even on a single core processor, if it has hyperthreading, there is much less of an apparent slowdown for other apps. I did try it once, just to see. Can't remembee the processor, possibly an oroginal Intel "Core", or maybe one of those single core Atoms of a similar era?

      And with many multicore processors these days, the boost speed available if just one core is running hard is better than if you have lots of cores under high use (thermals), so it should perhaps be faster.

      Just my 2p.

      M.

  10. Henry Wertz 1 Gold badge

    zswap

    If you use zswap, do yourself a favor and do a "echo zsmalloc > /sys/module/zswap/parameters/compressor"

    I also used zstd "echo zstd > /sys/module/zswap/parameters/compressor" since it uses one of the pretty fast zstd modes (not as fast as lz4) and I preferred better compression over absolute swap speed, to avoid actually having to swap to disk as much as possible.

    On older kernels -- this is useless, zsmalloc didn't support freeing up space, so zswap would fill and be unable to free any space. They FINALLY fixed this within the last 6 months to a year, zsmalloc now works with zswap including freeing up space. I found both zbud (2:1 storage) and z3fold (3:1) to be useless, you'd average like 1.5:1 compression, your swap would be cached but the kernel would just swapping that much harder (since instead of freeing the blocks it expected, it was being stored at only an average of 1.5:1 compression or so, not freeing up the memory it expected or anythign close to it). I have seen 8:1 compression with zsmalloc, and 3:1 to 5:1 is typical -- apparently plenty good to keep things happy, I got a massive performance improvement under memory pressure using zswap with zsmalloc.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like