Re: Machine learning
Try to copy each folder, and the OS dies in millions of file lookup table edits and writes.
The issue here is that AT ALL TIMES the system needs to ensure that the filesystem is in some semblance of a consistent state. So for each file, it needs to :
• find some empty space and allocate it - and make sure that anyting else looking for space now knows that it's not free
• copy the data into that free space
• add all the information that the filesystem needs in order to know where the file it etc
• add in all the other stuff about the file - attributes, directory entry, and so on.
Yes, in theory you could write code that would recognise that you are copying a million tiny files - and do each step a million times before moving onto the next step - but that would be relatively complicated code compared to doing "for each of a million files, do this standard process".
So yes, if you copy your ZIP file, there's one space allocation, one data copy, one metadata creation and update. But when you copy the files individually, each of those steps will be performed for each file.
If you take a step back, it would be "rather difficult" to reliably handle your "copy a million files more efficiently" process. The main issue is that the filesystem code (the bit that's doing all these complicated updates to the filesystem data structures) doesn't know what's being copied. Typically you'll invoke a user space program that does the "for each of a million files; copy it" part - and there are a number of different user space programs you might use. Underneath that, the filesystem code just gets a call via a standard API that (in effect) says "here's a chunk of data, please write it to a file to be called ..." - so all it can do is create each file when it's told to.
To do the more efficient "allocate space for a million files, copy the data into that space, create all the file metadata for those files" process could be done - but it would be harder to do and hence more error prone.
You could write your copy program so that it will :
• Call the appropriate API to create each file and specify how big it will be
• Copy the data into each file
• Set the metadata on each file
• Close all the files
Sonds simple - but you need to consider system constraints such as limits on open files. So next you end up having to split large lists of files into smaller groups, and so the task gets more and more corener cases to handle.
Caching complicates the process, as does journalling. Once you have caching, then you have the risk that data you have written into the filesystem doesn't get written to the disk (eg, on power failure) - and worse, that some of it might while other parts didn't, and potentially with the data handled out of order.
So for example, it would be possible for the directory entry to be created etc - but without all the data actually having been written into the file. So the user sees a file (after things have been cleaned up after the power failure) - but what's in it isn't what should be in it. This is just a simplistic example.
So journalling deals with this by writing a journal of what's being changed so that after the crash, the journal can be used to either complete a process or roll back those bits that did happen - so either the file was copied, or it wasn't, no "it's there but it's corrupt" options. Again, somewhat over simplistic but you should get the idea.
But journalling adds considerably to the disk I/O needed - there's no free lunch, what you gain in filesystem resilience you lose in performance. Different types of filesystem have different tradeoffs in things like this.