back to article Devs of bcachefs try to get filesystem into Linux again

The lead developer of the bcachefs filesystem is gunning to get it accepted into the Linux kernel… again. The story of bcachefs is quite long-running, and this isn't the first time – nor even the first time this year the project has attempted this. The filesystem has been around for a while; The Reg first reported on it in …

  1. oiseau
    Coat

    I?

    "I and the other people working on bcache realized ..."

    Seems gramatically correct ...

    ... but lacks class.

    I'm sure you understand, it is Friday after all.

    O.

    1. This post has been deleted by its author

    2. ICL1900-G3 Silver badge

      Re: I?

      Well, I agree, though would prefer "grammatically". As an aside, our (useless, moronic) MP invariably says 'Myself and Joe Blogs...'. Not only decidedly lacking in politesse, but just plain wrong.

  2. Doctor Syntax Silver badge

    Layering

    This reminded me of being DBA for an Informix system which used mirroring at Informix level on disks which were mirrored at physical level as well. At least there was no file system involved as it ran on raw disk.

  3. Platypus

    Once they got into it, I'll bet they found that they had a much lower percentage than they thought of what goes into a real filesystem. It's easy to get something that kinda-sorta does the very basics, but there's a *long* tail of obscure features and edge cases. I used to be on the filesystem team adjacent to the LVM folks are Red Hat, so they should have known better than to make such facilities claims.

  4. Throatwarbler Mangrove Silver badge
    Windows

    Say what you will about Windows . . .

    . . . NTFS is rock-solid and performs well, and volume management is much, much simpler than under Linux these days.

    1. Abominator

      Re: Say what you will about Windows . . .

      Performs well?

      Rock solid yes, but it's a slow as shit.

    2. Blackjack Silver badge

      Re: Say what you will about Windows . . .

      Is not rock solid, not even running on Windows.

    3. Pirate Dave Silver badge
      Pirate

      Re: Say what you will about Windows . . .

      eh, I dunno about that. Diskpart doesn't seem any more intuitive than the various Linux utilities.

      For instance, I found there's no easy way to expand a small multi-terrabyte volume into a large multi-terrabyte volume (ie - after expanding your RAID array), if you ignorantly ignored the "allocation size" when you originally formatted it and chose the default. Want to increase your 12TB volume above 16TB? Uh oh, the default allocation size of 4Kb says "No". So you gotta wipe it and re-format it as a new volume with at least an 8Kb allocation size (although there is a utility somewhere that claims to be able to do this for like $300). That is NOT easy unless you're living the high life and have another spare 16TB of disk space somewhere to move the volume data to while you reformat the original volume. This bit me in the ass late last year when we added more space to our Compellent box. $4000 for additional disks that I can't easily use because of the limit on the volume allocation size from when I originally setup the file server. DAMN IT! From here on out, every disk I format will use the 64K allocation size, slack space be damned. I'll be well-retired before we start straining that limit...

      So, no, volume management under Windows isn't really any simpler than under Linux.

      1. Nate Amsden

        Re: Say what you will about Windows . . .

        I hope it's better now, about 10 years ago a friend of mine worked at MS in high tier escalation dept for their products(Windows server at least maybe others). He said one of their regular suggestions/processes was helping customers turn off dynamic disks in windows as they were a huge source of problems. I think they had some special tool or thing to flip the bits to make them non dynamic anymore? I don't remember exactly. But was surprised to hear that dynamic disks were so problematic(I had used them on several systems w/o issues though small non critical systems, less than 5% of my work involves dealing with windows).

        As recently as a few years ago I had a conversation with a windows admin who felt the same, dynamic disks were to be avoid at all costs. Could be they got fixed and that person wasn't aware I don't know.

        By contrast I don't recall ever hearing of bad things about LVM on linux. I suppose a bad thing could be snapshot performance(I've read there's a 50% performance hit?) Though honestly in the 15+ years I've been using LVM I've never once taken a LVM snapshot. All my snapshots happen on the storage array.

        1. Pirate Dave Silver badge
          Pirate

          Re: Say what you will about Windows . . .

          I've never been a big fan of dynamic disks on Windows, but part of that may be my own reluctance to sit down and read through the tech docs on it ('cause, like, I have so much free time for reading during the day...). I think we have one server with a dynamic disk on it, but it was that way before I got here. Dynamic disks have always, in the back of my mind, seemed like asking for trouble, but that's just a gut feeling with no hard data to back it up. And maybe it's just because they're "new". But, well, you know, I still prefer BIOS booting to UEFI, so I guess I'm stuck in the past, when things were simpler (or at least had many, many utilities to support them).

          Nor have I (intentionally) used LVM that I can recall, although my Linux boxes tend to be fixed size and fixed purpose and pretty simple overall. The exception might be some SLES boxes I setup when we migrated off of Netware back in like 2010-12. But that was all Novell-ish stuff, I didn't really get down into the Linux nuts and bolts on it (not sure if that was more "don't look a gift horse in the mouth" or "don't stare into their eyes"). They worked, they were going to be our file servers, Groupwise servers, and eDirectory servers, so I didn't fiddle around with them much beyond Novell's utilities. I beat the bloody hell out of my old CentOS boxes though.

          I did setup a ZFS (FreeNAS I think) box 6 or 7 years ago to play around with. I was impressed, but good grief, did that thing like RAM. It did some way, way cool stuff though.

    4. Maventi

      Re: Say what you will about Windows . . .

      As a user of Windows and Linux I'd personally rate NTFS as reasonably robust and moderately performant. Nothing more, nothing less.

      Dynamic disks are what nightmares are made of when they go wrong (and they do!). LVM by contrast has a bit more of a learning curve but is generally more robust and flexible.

  5. Chewi

    LVM snapshots are slow

    I'm surprised you didn't mention that LVM snapshots are really slow. Because it operates at the block level, it has to make a copy of each block before it gets modified. I still use them on the odd occasion but only for brief periods.

  6. Jamie Jones Silver badge

    Ramsey and Oliver in the kernel?

    I'm probably the only one that read that as BCA Chefs.

    1. David 132 Silver badge

      Re: Ramsey and Oliver in the kernel?

      No, you’re not the only one.

      1. Jamie Jones Silver badge
        Thumb Up

        Re: Ramsey and Oliver in the kernel?

        That's a relief!

  7. Norman Nescio

    LVM & LUKS

    You can layer LVM over LUKS, or LUKS over LVM. Each approach has its benefits and disadvantages.

    For my use case, my entire laptop disk, with the exception of the ESP, is a single LUKS partition. I then use LVM within/over LUKS to provide volumes for boot*, root, swap**, home, etc. I can hibernate to swap quite happily.

    Other people prefer to use LVM to give multiple volumes, each of which can have a separate LUKS set up (with separate passwords).

    The nice thing is the flexibility.

    *Yes, encrypted boot means that I'm currently typing in the LUKS passphrase twice when I boot. No big deal. You can get round this by putting a key file in the (encrypted) initramfs, but I haven't bothered.

    **The swap is encrypted by LUKS. The performance is good enough for me, and it means I'm sure that there's no plain text data left on disk. While linux offers encrypted swap without LUKS, it uses a random key each boot, which means you can't (easily) hibernate to the encrypted swap partition.

  8. Norman Nescio

    Snapshots

    I'd really like to set up designated areas for continuous snapshotting - for example /home - so that if I accidentally delete a file or muck up an edit, rolling back is a possibility integrated into the file system. Essentially, make use of the COW feature to put every write into a data-structure that allows me to pick a file and roll back to any point in the file's history held by the log.

    1. VoiceOfTruth

      Re: Snapshots

      So use an OS with easy snap shots. I recommend FreeBSD.

    2. Nate Amsden

      Re: Snapshots

      might be easier to use something like rsnapshot. I use rsnapshot for most of my home backups(seems I have 16 linux servers on my home/personal colo network). I back up things like /etc /var /home /usr etc. Though I think it's MAYBE once a year I go to the backup to look for something.

      I deployed ZFS on my home file server many years ago because I wanted snapshots for situations like this. Turns out in the ~5-6 years I was using ZFS I never used the snapshots once. So I decided to stop using it, as that was my most important use case for it(at home anyway). I'm perfectly happy with hardware RAID 10 (largest array I've had at home for the past 20 years has been just 4 drives, and yes when I ran ZFS it ran on top of my 3Ware RAID card).

      It certainly won't catch every write just runs at regular intervals. But if you lose data that frequently then you have bigger problems to solve.

    3. Norman Nescio

      Re: Snapshots

      Having been weaned on VAX/VMS with the inbuilt file versioning, I would like something that implements the same functionality. Currently, I use NILFS2, where all writes are checkpointed, and any checkpoint can be converted to a snapshot, mounted readonly and a file state recovered.

      Kernel.org » Filesystems in the Linux kernel » NILFS2

      NILFS2 creates a number of checkpoints every few seconds or per synchronous write basis (unless there is no change). Users can select significant versions among continuously created checkpoints, and can change them into snapshots which will be preserved until they are changed back to checkpoints.

      There is no limit on the number of snapshots until the volume gets full. Each snapshot is mountable as a read-only file system concurrently with its writable mount, and this feature is convenient for online backup.

      My wish is for another filesystem - like ext<n> or bcachefs to implement the same behaviour (ext3cow and WaybackFS are defunct, as far as I can tell, and tux3 development is not going quickly). Note that when the NILFS checkpoint log fills the volume, things don't stop, you merely lose the oldest checkpoints. If you keep snapshots, or fill up the volume with files NILFS2 will struggle, so it is advisable to manage the volume to assure a sufficiency of free space, but I've had no NILFS generated problems in many years of use.

  9. VoiceOfTruth

    ZFS is superior

    I hae used bootadm and zfs on Solaris for years: rock solid, stable, reliable, predictable.

    I like the idea of btrfs, but is it ever actually going to be all of the above any time soon? I'm not convinced by amateur Linux users who insist that it is, when I know they do not speak from any important experience. Their home 'server' with some ironic hostname is not the same thing as 24x7x365 requirements. Basically I don't quite trust btrfs, and that for me makes it a big no no.

    Sure... ZFS requires a chunk of memory, but I am prepared to trade that (add it as a cost, if necessary) for what it offers. And the what it offers is substantial.

    1. Nate Amsden

      Re: ZFS is superior

      If you have 24x7x365 requirements you'll be using a proper storage array with redundant controllers and hot swap everything. ZFS as a storage system in the best case is a tier 2 storage solution, though most ZFS configs would be hard pressed to even be tier 3 storage. Nothing against ZFS, it's better than other general file systems certainly. It's more the design/architecture of the underlying hardware.

      I wrote a blog post covering this back in 2010, the general design of ZFS is pretty much the same since. I referenced this email thread as the source:

      https://www.mail-archive.com/zfs-discuss@opensolaris.org/msg18898.html

      The author of the email makes a great point regarding data availability(ZFS can't help much here) vs data reliability(ZFS does good here).

      1. hoola Silver badge

        Re: ZFS is superior

        This is exactly what I was thinking. There is ever more functionality being stuffed into the OS to do funky things with disks and bluntly, none of them are actually up to the job. Windows Dynamic Disks was a fudge to try and emulate LVM, which in itself is not that great. It works and does the job.

        ZFS has some nice features and works well at scale but as we move forward where is the support going?

        It was taken out of the kernel due to wrangling over security and such like.

        NTFS works well other than the block size limitation, just why the hell the default is still 4Kb beats me , it is just stupid beyond belief. Where NTFS wins hands down is the security integration with AD. Once you move into the enterprise storage a log of the funky stuff is carefully hidden away but AD integration is there.

        NTFS may be "slow" as some state, but compared to what? NFSv3 is fast I grant you but if need granular permissions that integrate with AD you are sunk. NFSv4 can provide that mapping with some horrible LDAP stuff but your driver here is what clients are accessing it? If it is predominantly Linux the NFSv4 but if you have limited Linux, what is the advantage over a commercial storage appliance? Very little as long as you have the expertise to support it.

        So if you are at small scale you will probably use something in the OS but hardware RAID is not that expensive and is far better. You don't have to have all the fancy hot plug stuff, just protect against disk failures in a way that means you are not grovelling in the OS tie things back together.

        Do we actually need yet another filesystem bundled into any OS?

      2. Liam Proven (Written by Reg staff) Silver badge

        Re: ZFS is superior

        [Article author here]

        > If you have 24x7x365 requirements you'll be using a proper storage array with redundant controllers

        > and hot swap everything.

        Surely this is just moving the question somewhere else?

        So that the question becomes: what OS does that storage array run, and what FS?

        At some level, somewhere on the network, there is software storing stuff on disk in the stack. That software is running from a filesystem and it's very probably putting stuff in a filesystem.

        My impression is that the real goal of these FSes is to bring the sort of functionality and robustness to Linux FSs that those SAN boxes have, be they NetApp or PowerVault or Hitachi Vantara or whatever.

        Many of them run proprietary OSes. We know, for example, that NetApp's is based on BSD.

        The goal here is to have everything on the network able to be hosted on Linux. Compute, storage, switching, whatever.

        Whether that is the right thing to do or not, I am not here to say.

        So, yes, you can respond "pah, if you're serious, you have dedicated kit for that," but if the answer is that *those* boxes run proprietary software, you haven't actually addressed the core point here.

        Yes, there is proprietary tech to do this. Some of it can run on Linux. Some of it is even FOSS, but not GPL FOSS.

        The point of the exercise here is to answer this need with GPL FOSS tools.

  10. Henry Wertz 1 Gold badge

    btrfs and bcachefs

    So, there's already a "bcache" block cache subsystem in the linux kernel. You can (for instance) have a slow block device, use a faster block device as a cache, the result is a 3rd block device that uses the cache to speed up access. The design is simple and reliable, and has full failure paths -- if the cache device fails, you don't have your device drop dead, it just goes to accessing the slow device directly. I have not used bcachefs, but it looks to also have a solid design that should both have good performance, and be able to be resilient to failure.

    btrfs? I've attempted to use it twice. First time, I turned on data compression and it corrupted by data. Second time, after plenty of assurances floating around online that all those bugs have been fixed and it's fine... I had filesystem self-destruct within a few months. I was not running a UPS and all that good stuff, and found that btrfs seriously cannot deal with power cuts. At all. I was not doing anything special (large file transfer, snapshot, etc.) when power cut, but that was enough to make it flip out -- the transaction log, checksums everywhere, and sequence numbers in critical data structures, sure make it DETECT when there's a filesystem inconsistency. But there's no automatic fsck AND no manual procedure to just have it roll back like 30 seconds, lose the last transaction or two and be consistent again. It's read-only use only from there on until you reformat!

    1. VoiceOfTruth

      Re: btrfs and bcachefs

      -> plenty of assurances floating around online that all those bugs have been fixed and it's fine

      That would be the Linux 'community', which it turns out has many blinkered fanboys and rather fewer honest reviewers in its number. To them: Linux is good, Windows (and everything else, actually) is bad.

      I wrote in another post that I don't quite trust btrfs. No fanboy is going to convince me otherwise.

    2. Ace2 Silver badge

      Re: btrfs and bcachefs

      Same with APFS, apparently… mistakenly unplugged my Time Machine drive and found that it moved permanently into read-only mode. This stuff is considered production quality!

  11. Henry Wertz 1 Gold badge

    s3qlfs

    I was going to add this to my last post but decided it deserves it's own. I recommend s3qlfs! Pretty fast and easy way to get compression and deduplication.

    I'm using s3qlfs for some of my storage now. As the "s3" suggests it's meant for cloud storage (Amazon S3 supported plus several other cloud systems) but one backend is "local", I just make a "s3ql-data" directory on my ext4 file system, a "s3ql-fs" mount point, and have a choice of several compression methods (also encryption, but not sure that's useful when it's already sitting on my local disk as opposed to "in the cloud", so I have that off.) It has built-in deduplication. I have it set to use a 50GB cache. Fast, reliable, and I have had power cuts and such and have had it recover just how you'd expect (Run a slow'ish fsck, lose the last couple writes if it was in the middle of writing something, and away you go.) I have a 4TB disk with like 6GB files on it and still 1TB free disk space so I can guarantee it's effective.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon

Other stories you might like