Interesting...
Interesting... how does this compare to other file systems? (i.e. non-windows)
Working with 60 million files pushes the boundaries of any storage. Windows underpins most of my storage and so the theoretical and practical limitations of NTFS and Distributed File System Replication (DFSR), and the difference between theoretical and practical limits on the number and size of files they handle, are important …
"Linux does not fragment files as much as NTFS, so even if other limits are similar then the lag of fragmentation is a bonus surely"
Linux not fragmenting files is the worst MYTH/FUDD known to man.
If it was true the next gen filesystems (ext4, btrfs et al) wouldn't have online defrag. Oh wait! They do!
Anybody who repeats such rubish really doesn't have a clue. The fact is that all filesystems fragment, the question is over how much and how good the algos for preventing it are and if there are tools and opportunities for sorting it out. Linux historically has been a bad, no and no respectively. Like you say - no metrics.
Giving an example that's only useful in 1964 doesn't prove anything, move outside data sizes that still work on punch cards and it all breaks down.
Don't confuse nobody really cares with no it doesn't.
Sure a lot of filesystems make a effort but if you try to write a .5TB file to a disk that only has a bunch of 50GB spaces guess what happens. It gets chopped up into tiny little pieces and scattered around on any filesystem.
Like I said, linux filesystems are getting online defrag tools for a reason - cos there's finally filesystem developers with a clue trying to sort through the fanboy (non-dev) bullshit. I'm not selling any filesystem, I'm just saying there's actually nothing you can do about it in real-world data use.
The question really is how much it matters.
Not for nothing but Microsoft didn't stop writing NTFS after v1 either...
@streaky:
Wrong. ext* allocators are much more sensible than NTFS ones. NTFS still uses brainded DOS-aged allocator mechanisms a lot of the time (e.g. put the file in the first available block). Linux, OTOH, tends to put files in block groups. Files aren't put consecutively on disk, so there is plenty of empty space between where one file ends and another begins (free space allowing, of course). That means that the files themselves aren't internally fragmented, which is the big problem (the fact that files aren't contiguous is much less of a problem, especially on a multi-processing multi-user system). You have to hit FS usage of 90%+ before fragmentation on ext* starts to rise above low single figures. On NTFS, fragmentation will go into double figures much more quickly.
You're so right! Unfortunately in the real world the sysadmin's aren't in charge of such decisions in every scenario. My evil schemes for Linux (or better yet, Solaris) file servers are constantly shot down. I get away with Openfiler VMs for specific purposes only in circumstances where nobody but IT will be using the file system.
CTO doesn’t understand Linux /at all/ and so doesn’t believe that you can secure it (Active-Directory-integrated share and file permissions) the same way that you can Windows. So for myself and others in such situations…there are articles on NTFS.
In teh end though, ZFS > *
I know it's being worked on, but Linux doesn't have ZFS yet. Also, please don't advocate ext3 or ext4. Both are just terrible hacks bolted on top of ext2, which is itself a terrible hack. Granted, this is Linux you're talking about, so I suppose all this talk of terrible hacks is superfluous.
There are, of course, a few other things that *do* have ZFS already. Perhaps you should go advocate one of those.
Can you explain what about ext2 is a hack? It is based on the basic principles of UNIX file systems going back to year dot. It's design is actually quite similar to UFS. ext3's journalling works pretty much the same way as UFS's logging (using UFS as an example since your comment indicates some Solaris fanboyishness), and ext4 brings along extent based file allocations. There is nothing at all hacky about the design of ext* file systems.
ZFS also isn't particularly unique. BTRFS is almost cooked, and will be in the next RHEL release. Nor is ZFS particularly performant. Convenient, sure, but not performant. For actually stressful loads (e.g. heavy DB usage), UFS is considerably faster.
But can you use other FS's on Windows? Not in a hacky "yes, it's just about possible" way - but are there production-ready alternatives? Coming from a more Linux-centric world where you can often choose a filesystem to suit the workload - I'm just curious. I seem to remember that Veritas had a version of vxfs for Windows - but I haven't really looked at that for years.
Anyway - just curious :-)
Definitely not. It takes a *lot* of testing over a long period to prove that a file-system is safe, so if there was an alternative FS out there with a user base large enough to make using it on an important server anything other than a career limiting move, then you'd have heard about it already.
I use the IFS freeware for when I sometimes need to pull stuff from my Linux system when I've dual-booted into XP. It works well and mounts the file system as a new drive letter. It's been fine for me and it's very easy to use, but I don't know if it's good enough for professional purposes or not as I haven't used it that much.
Dual boot is becoming such a pain (I used to use Linux exclusively, but I'm finding the latest MS Office too good to pass up), that I'm actually whacking on a Linux install in a VM on my system and letting *it* mount my /home/h4rm0ny folder.
I want ext4 in Windows 7, or BTRFS at some point. When I have a spare couple of months and the stomach for it, maybe I'll write it myself. ;)
Anyway, check out the IFS plugin at: http://www.fs-driver.org/. It's actually for ext2, but obviously that still has some utility with ext3.
Very briefly (in the days of NT4, IIRC, maybe w2k) you could get VxFS for Windows, I think it was part of the Veritas Foundation Suite. Someone from Veritas told me that the file system was pulled because MS weren't very impressed and Veritas were writing the disk subsystem for MS so didn't want to make any trouble. How true this is is another matter, but I have no reason to believe it isn't. Also Windows is designed to be able to access multiple filesystems, in the days of NT4, it only natively supported FAT16, but you could get FAT32 drivers from Sysinternals (again IIRC).
defragment with consolidate before each backup
use smaller volumes
zip the user's data up
or just ftp the files to a Linux backup server as suggested in the previous article, its got to be cheaper than buying a windows based backup application
rearrange the following phrase "goats. Microsoft blows"
Linux seems to scale far better than windows with large numbers of files. This article on LWN covers experiments to put 1 Billion files onto linux filesystems.
http://lwn.net/Articles/400629/
The basic conclusion is you can put 1 Billion files on a linux filesystem, but you require a lot of memory to check the filesystem (10-30 GB depending on filesystem type) That is far less than the requirement listed above for a mere 60 million files on Windows.
The one Linux filesystem, notorious for its capability to work with myriads of small files, is ReiserFS. There are downsides though: the stable ReiserFS v3, included in the vanilla Linux kernel, has a volume size limit of 16 TB. ReiserFS v4 is not in the mainline kernel (in some part due to "functionality redundancy" reasons = inappropriate code structure) and its future is somewhat uncertain - but it is maintained out of tree and source code patches (= "installable package") for current Linux kernel versions are released regularly. Both versions also have other grey corners, just like everything else...
When working with a filesystem that large, I'd be concerned about IOps capability of the underlying disk drives (AKA "spindles"). The question is, how often you need to access those files, i.e. how many IOps your users generate... This problem is generic, ultimately independent of the filesystem you choose.
Its been a while since I read something like this (a article about sometghing were the user has learned the product and THEN read the manual to see what its lieing about).
Nice to have realworld experience and observations like this. Realy useful and I want more - what about EXT3,JFS...... let the FS fight begin here.
Do people stop to think before blurting out "use Linux". If it's a MIcrosoft shop then introducing Linux into the environment will have additional overheads (staff training, additional monitoring tools, cost of integration). It's cheaper to simply heed the pratical advice in the article and manage the file system accordingly going forwards.
In the days when computer professionals were either System Engineers or users this conversation would never have come up.
If it didn't exist and it was needed your were expected to create it not bitch about how rubbish the OS development team are.
M$ have managed to created a generation of people who are responsible for maintenance and control of expensive hardware but don't know or care how it really works. In most UK system administration departments internal development is treated as suspect and hence they rarely produce more than scripts or download some badly fitting outside application to almost meet the need.
Well, this is the price industry pays if they go the M$ route, yes support is cheap but as the saying goes "you get what you pay for". M$ compatibility has reduced bespoke software to configuration of M$ products and we all know how reliable they are. If you want it done right get someone in who knows how it works from the bottom up.
It should not come as a shock to anyone that M$ products do not do what it says on the tins they never did.
My my - looks like somebody is used to having god-control over there network, and likes to be paid obscene amounts of money to look over highly tweaked setups since no-one else would have a clue what they had done (documentation is doubtful, as that would mean someone could replace you!)
Not every company has £60K+ (minimum) for a top-end *nix IT Admin whose main goal in life is reading man pages and writing drivers. And they dont wont to outsource it since:
1. They will charge a fortune in the beginning to "inventory" their network
2. Rip-n-Replace with a completely new setup that the customer is unfamiliar with, and is black-box to all but the new incumbent
3. Bugger all once the money for support contracts dries up, leaving them in the sh*t
People use the tools they are comfortable with, with the skills and budget they have to achieve they need to do on a daily basis... its not always the same choice as your scenario - get over it.
FYI documentation was always part of the job and to replace me they would need someone at least equally knowledgeable or a few month later they are buying in expensive consultants. I know the consultants are the result as this has happened, it wasn't because I set them up and hoarded knowledge it was simply that no one else was willing to read and implement my docs. I know that on but one system they paid out £80K over 8 months in consultants before they managed to find 4 "experts" to replace me. Needless to say the manager who thought up the idea of my leaving is no longer with a company.
I do however agree that not every company can afford £60K+ to be certain that IT is not something they have to worry about. I am sure that there are many companies where data loss/theft is not an problem and a web/email presence in not necessary but then again these companies would probably be better using a paper based office.
When it comes to a company of any real size however paper based offices are no longer financially viable. Larger companies should see £60K+ as nothing against the potential losses of bad IT baring in mind that this cost includes not only the maintenance/security of infrastructure but ongoing individual training for their other employees.
I too agree getting in many of the outsourcing companies is simply throwing money, experience and ownership away, worse most are staffed by the very point and click "experts" I was referring to and are learning the job as they go.
"People use the tools they are comfortable with", true, however, as the UK education system has gone from computer science to computer studies to ICT the majority of people you are referring to have only been exposed to M$ products and this includes many IT professionals. The answer to the problem in companies is training and good software that meets the user's requirements coded by someone who has met the intended audience and implemented their needs. The answer to the educational system is stop training our kids to be M$ data input clerks if they can be something more.
The real question whether a company of any size can continue to afford not to buy in real expertise along with the other insurance they need to protect them.
Our Unix guy isn't here right now, so I can't get all the grizzly details, but our fileserver gave us equally unexpected and unpleasant results when we started tripping over its’ hidden limits.
We have a Unix baox attached to a SAN with approximately 4Tb storage, which at the time we considered ample capability for its job. It didn’t take long before we hit the dreaded inode limit.
As I say, I can’t remember all the precise details, but the EXT filessystem, formatted over a certain size (1 or 2 TB) assumes that the average size of each file will be around 1GB, and therefore allots an appropriate number of Inodes for this assumption.
If your average filesize is closer to a few KB, your Inodes run out loooong before you reach the drives space capacity. When he investigated this, he naturally assumed he chosen an incorrect option when creating the partition. After significant research, he discovered the was the default, and only configuration.
"After significant research, he discovered the was the default, and only configuration."
-i bytes-per-inode
Specify the bytes/inode ratio. mke2fs creates an inode for
every bytes-per-inode bytes of space on the disk. The larger
the bytes-per-inode ratio, the fewer inodes will be created.
This value generally shouldn’t be smaller than the blocksize of
the filesystem, since then too many inodes will be made. Be
warned that is not possible to expand the number of inodes on a
filesystem after it is created, so be careful deciding the cor‐
rect value for this parameter.
@Psymon: if you can't remember the details correctly, and the person who actually did it isn't around for you to ask, why bother positing at all? Clearly you haven't got useful information.
ext3 (and I think ext2) filesystems default to 16KB per inode for large filesystems, and less for smaller ones. Specifically, to 4 blocks per inode, and 4KB block size (max). Meaning that if your average file size is over 16KB, you will never run out of inodes (since you will fill the disk first instead). If your filesystem was created with 1GB per inode, it must have been created as some bizarre experiment, since after all the disk space saved by pushing the inode limit down is minimal.
Also "After significant research, he discovered the was the default, and only configuration" is utter claptrap. See the -i option of mke2fs.
I use ImDisk quite a lot for this reason. Anything that involves lots of small files tends to be directed to one of a few virtual hard disks. There's costs, and it won't work for everyone, but it solves certain problems quite well. In particular, it's OK for a desktop system with a lot of files - but I can't see it being useful on a server. Sure, servers are often virtual, but that's a different thing.
Before I started doing this, deleting an old set of Doxygen files for a large app/library might take several minutes. Now, I just unmount the virtual drive and delete the image file - a batch file can do that in a fraction of a second.
Also, I mostly don't have any of these virtual drives mounted - so no RAM wasted on caching something I'm not using, MFTs etc included. It's *very* rare that I have more than one mounted at a time.
Uh, can we have some references for the assumptions made in this article? I think your confusing cache hits with loading the entire MFT into RAM. I've never heard of this being an issue before so I had a search in Google and the only place I can find it mentioned is... this article.
I nearly had a career limiting event caused by an unkown (to me) ZFS limitation recently.
ZFS performance drops off considerably when using snapshots AND your used space is greater than 50% of the total pool capacity.
So when / If you buy a storage product using ZFS as the file system, don't listen to the 'Experts', buy big much bigger than you need or it will bite you.
Anon - We settled and I promised not to bad mouth them.
ZFS performance drops off when using snapshots. Any snapshot-capable file system (or LVM for that matter) has that problem, especially if you are mounting your snapshots elsewhere, and especially if you are using them as writeable. The reason for this is that for every snapshot you have, you have to write additional undo-logs for every FS write. Performance degradation with snapshots is linear with the number of snapshots you have.
"Do people stop to think before blurting out "use Linux""
Yes I stop to think. And I say it anyway, because it doesn't have these ridiculous limits. I don't have any filesystems with 60,000,000 files, but I do have one with 1.4 million, the machine has 448MB of effective RAM (512, and it's pulling 64MB of that for video RAM.) I can assure you it doesn't use 100s of MB of RAM to keep track of those files, and there's no speed problems accessing files. I'm using ext3.
From a design standpoint, ext2/ext3/ext4 uses inodes and trees, there's not some huge bitmap that has to be crammed into RAM. From a practical standpoint:
http://events.linuxfoundation.org/slides/2010/linuxcon2010_wheeler.pdf
Using some custom-built disk array Redhat had, they tested a few filesystems with 1 billion files. (1000 directories with 1 million files apiece.) With ext4, mkfs took 4 hours, it took 4 days to make 1 billion files (ext3 made files about 10x faster..), fsck'ing the filesystem with 1 billion files took 2.5 hours, and 10GB of RAM. They are now working on a patch to fsck to hugely cut RAM usage, it turns out it's nothing inherent in the fsck using that much RAM, fsck just hasn't been optimized to reduce RAM usage as yet. The actual usage of the filesystem did not use abnormal amounts of RAM.
http://old.nabble.com/2GB-memory-limit-running-fsck-on-a-%2B6TB-device-td17737880.html
Here's someone that had (at the time) 113 million files on an ext3 filesystem in a box with 4GB of RAM. No sweat in normal operation -- they had trouble with fsck though because they had a 32-bit kernel and fsck wanted to use >2GB of RAM. But they did have 2 ways around it (first, run a 64-bit kernel, fsck in fact only needed a few hundred MB more RAM. Second, fsck has an option (that I didn't know about 8-) ) to write temp files out to a filesystem (obviously not the one being checked..) instead of keeping stuff in RAM. Apparently this is not accessed too randomly so it only slows the check by about 25% even using a regular hard disk.)
Sorry, but if NTFS really needs 1GB of RAM per million files (more or less)... well, wow. Just wow.
"Sorry, but if NTFS really needs 1GB of RAM per million files (more or less)... well, wow. Just wow."
NTFS does not *need* 1GB of RAM per million files. As I said in the article, it will work just fine without that RAM. While NTFS can of course work just fine with as little RAM as any other file system, you suffer a (fairly massive) performance penalty for not following the stated 1GB per Million files rule of thumb. The scenarios I find this to be true in are:
a) Generally access all of the files on your drive in a reasonably short timeframe. (For example a nightly backup crawl, or a large website where virtually everything will get read at least once during the course of a day.) Remember that your first access to that file will incur a hit such that your MFT is read from disk in order to find the file. Subsequent access to that file will naturally be significantly faster. (NTFS MFT records are huge!)
b) You have lots of medium-sized files files. Small files (less than 2K?) are actually stored WITHIN the MFT itself. This is optimal from an IOPS standpoint: read the MFT record and you read the file. The real advantage shines when Windows caches the MFT information into RAM; by doing so it’s also caching the data for that file! To contrast; larger files (say JPEGs or other things in the 10s of kB or higher) don’t live inside their MFT records. They have to load a huge MFT record (or more, if they are heavily fragmented,) as well as the data. The proportion of MFT/data on medium sized files is enough that you will seriously notice the speed difference of “enough RAM” versus “not enough RAM” in any scenario where you are reading the same files more than once, or in a multi-user environment. Large files (100M+) are largely immune to this unless heavily fragmented because the proportion of MFT to data is so small.
c) Random accesses of files because you have many users constantly hammering the same volume. While there are limitations based entirely on the spindles themselves, not having to send the heads flying back to the beginning fo the drive every few milliseconds to read new MFT information makes all the difference in the world. Straight-line read time for linear access may well not be all that affected by lack of RAM; modern spindles can probably feed the drive’s cache fast enough to compensate for the flying heads. Get even five users accessing large numbers of files on different areas of the drive however, and the performance penalty of these enormous MFT records becomes apparent.
d) You aren’t using flash or at least 10K SAS for your spindles. The more you utilise technologies that have stupidly low latency, the less anything I’ve talked about here matters. In fact, I’d go so far as to say that everything I’ve talked about if almost meaningless when using flash. Flash has no real seek/random I/O penalty. 10K SAS penalties are low enough that you might not notice enough of a difference to be worth putting in that RAM. If your arrays are 7.2K disks, however…you’ll notice it. The longer your seek times, the more everything I’ve talked about makes a huge difference.
Remember; take this all (and the article itself) with a grain of salt. These aren’t the whitepaper numbers or figures. They aren’t he Official Guidance from Microsoft. Microsoft will tell you NTFS can run with virtually no RAM and work just fine. They are indeed correct. You must however be prepared to accept the performance penalty that comes from merely “working” instead of “working remotely close to optimally.”
My rules of thumb for NTFS and DFSR are not the textbook answers. They are the practical ones from over a decade of experience in trying to push my hardware to the absolute max. Not because I am obsessed with getting every erg of performance out of my gear…but because I can very rarely afford new gear. I stand by them, and I would love to see someone prove or disprove them in a real production environment. One where you are hammering the arrays underlying the volumes in question with multiple random accesses 24/7. Most especially one in which almost all files are accesses more than once during the course of a day.
Oh, and one where 60M files live on several volumes located on the same array. 60M files, multiple NTFS volumes, one physical array. Transactions off that array in the 10s of TB/day. This is an extreme example, but one that shows exactly how I can arrive at the data I have. That’s *my* practical environment.