Hi, I'm the "Google kernel contributor" that this article is about.
Let me introduce myself. I'm Steven Rostedt (some people jokingly say 'Roasted' which started when I was in third grade. I sometimes use the term myself).
tl;dr; Google had very little to do with the email thread. I didn't copy a function that I didn't understand but it was overkill for my use case which is why Linus said I didn't understand it. Linus was having a bad week when a bug report came in on my code.
Sorry for the long post. Most people will not read this and I'm fine with that. The article will be forgotten in a month but lives forever. I wanted my response to live with it. For the few of you that care about the background, including why Linus blew up, feel free to continue.
My background. I started Linux kernel development while working on my Masters in 1998. I fell in love with doing it, and sought out work as a kernel developer. In 2001, I landed a job at TimeSys porting their version of the Linux kernel to embedded boards. In 2003, I became a contractor focusing on using Linux in real time environments. In 2006, I joined Red Hat and helped them create their Real Time Linux offerings. In 2017, I left Red Hat and joined VMware as I was asked to help them "Convert an Open Source hostile company into an Open Source friendly one". I was mostly there to consult on how to interact with the open source communities and where we made the policy that any change to an open source project must benefit the community and not just the company. About 20% of my time was to consult the company and 80% was to continue my upstream contributions. After 5 years with VMware, my contributions were becoming less relevant to the company, so I left in 2022 and joined Google to work on ChromeOS where I help improve performance on their low end Chromebooks. Basically, I work to improve the laptops your kids use in school.
In 2008 (at Red Hat), I wrote the tracing infrastructure (aka ftrace) of the Linux kernel. As my heart has always been with embedded development I wanted the tracing infrastructure to easily be used on embedded devices. I chose to use a file system base interface that was functional with nothing more than BusyBox (a simple embedded shell). My kernel expertise is with the scheduler, interrupts, real-time, a bit of memory management and obviously tracing. I've never worked much in the file system management layer. At the time, I decided to use the debugfs file system (/sys/kernel/debug) as it was the easiest to implement. I added a directory there (/sys/kernel/debug/tracing) to interact with my infrastructure. I've been thanked several times by the embedded community for making the interface so simple to use, including by the lead of Ingenuity, the Mars helicopter, as my code was heavily used in debugging it, My scheduler work happens to be on the helicopter too, so I know that my code was running (and flying) on Mars! ;-)
As my code was starting to be used in production environments, I was asked (not by Google, but by others), if I could move the tracing interface out of debugfs. That's because debugfs is an all or nothing file system, and being a debug interface there could be vulnerabilities along with it. So I created tracefs. As I stated before, I didn't know much about file systems and after talking with some file system folks, they just told me to clone debugfs and start with that. I did and it wasn't that hard. In 2016 (still at Red Hat), the tracing infrastructure appears in /sys/kernel/tracing and has no dependency on debugfs. For backward compatibility, when you mount debugfs, it will automatically mount tracefs in its original location.
While I was at VMware, someone outside of VMware complained to me that tracefs had a very high memory footprint. Investigating, I found that it was due to the trace event files and directories. The trace events are created by any kernel developer, and we now have close to 2 thousand of them which creates close to 20 thousand files in tracefs. When you boot up, these event files are created even if you don't mount tracefs. But here's where my lack of knowledge on the Linux virtual file system (VFS) layer was a problem. I based tracefs off of debugfs which was actually doing things wrong. It used "dentry" as a handle to the interface. A dentry is just a VFS cache element. It should never be used outside of VFS as it is a critical element of the VFS layer. As tracefs copied debugfs, it inherited the same issue. The problem for me was, I still didn't know this was wrong until the blow up with Linus. I realized that the dentry and its backing inode was the cause of the memory footprint, as it wasn't being used as a cache to the file system but was being created for every event. In early 2020 I started looking at converting the "events" directory in tracefs over to something that would only allocate the dentry and inode when referenced. I got a working prototype semi working but ran out of time to finish it. Another engineer at VMware who was mostly doing Linux kernel backports asked me if there's any TODO items I had that he could work on to become a more established kernel engineer. I gave him the eventfs work. He got it working, but as I was his only interface to the kernel, I guided him incorrectly due to still thinking it was OK to use dentry as a handle. The incorrect code was my fault, not his.
I presented this at the Linux File System summit (LSFMM) in 2022 to get some feedback. I was informed about kernfs that did pretty much the same thing (but correctly). But when I looked into that code (still thinking it was OK to use dentry as a handle) it didn't make sense, and there was virtually no documentation on how to use it. It's what /sys uses in general, but I couldn't easily see the connection to what I was doing. Continuing on my working prototype using dentry seemed a easier path to take. Note, all this work was in the public domain where I even posted patches to the file system mailing list. Nobody said I was doing it wrong. I don't blame them, they are just as busy as I am, and my work didn't affect theirs.
When I finally had eventfs passing all my tests, it saved over 22 megabytes per instance. Not a big deal for data centers, but my focus is on low end Linux devices where 22 megs makes a difference. This is 22 megs of memory that is totally wasted. It can't be swapped out. It's basically just like telling the kernel not to use this memory for anything. I broke the changes up into two parts where half went in in 6.6 and the other half in 6.7. In the development cycle of 6.8 Linus noticed my use of dentry and told me I was doing it "wrong". Now having worked on this for 4 years, and nobody once told me that using dentry was bad, and seeing that debugfs did it, I was never fully understanding what problem Linus had with using dentry. This miscommunication escalated where I was starting to annoy Linus. One of the changes Linus told me to do, was to make all the inode numbers the same. An inode number is a unique number that every file and directory gets in a file system. For real file systems it makes sense, but for a pseudo file system like tracefs it's meaningless. I was concerned that this might break user space, and was rather surprised when Linus told me I shouldn't worry about that. Basically, he told me to try it, and see what breaks.
Then Linus had a very bad week. The Linux kernel development process starts with a two week "merge window" were maintainers may send Linus all their new features. After the merge window closes, only bug fixes are allowed and the release candidate process starts and goes on for 7 to 8 weeks. When the new release is out, the merge window for the next release starts. This merge window Linus pulled in a scheduler change (not my code) that caused a regression on his machine. His builds took twice as long to finish. But this regression only appeared on his machine and others were not able to reproduce it. I could tell Linus was debugging this himself because the rate of pull requests going into Linux was drastically slower than in other merge windows. Then Linus lost power for 4 days, right in the middle of the merge window. When he got his power back, the scheduler bug was fixed and he rushed to get all the other pull requests in and not extend the merge window as there's a lot of people depending on this cycle. After the merge window closed, one of the first bugs to come in was that "same inode change" that I did for Linus. It broke the application "find". "find" checks the directory inode numbers to make sure it's not going into loops. With all the directories having the same inode number, find thought it was looping and complained about it.
So, the first thing Linus told me to do was to make a simple counter to create the inode numbers. I did that, but I also took a look at the inode number generator that my code was originally using called get_next_ino(). I noticed that it did a nifty trick of making per CPU counters and allocating a batch of numbers for each CPU. The batch count was set at a power of two so to prevent races between CPUs it would only do an atomic_add_return() (an expensive CPU operation) when the count overflowed. I saw that, and being a real-time embedded developer by heart, thought to myself "Damn, that's nifty" and replaced my simple counter that did the atomic_add_return() for every new inode number with that. Well, atomic_add_return() may be an "expensive" CPU operation, but that just means it takes several CPU cycles to process. For my use case, it would never show up as an improvement. So doing so was complete overkill and added complexity. Because of that, Linus said I didn't understand that function. It wasn't that I didn't understand how it worked, it was that I didn't understand it wasn't the right place to use it. He was right. I shouldn't have used it, but did so more out of habit of using optimized code when I can.
Steven "Roasted" Rostedt