Anyone can build something small
Try building storage at scale with a distributed file system and then bet your business on it. Go ahead Martin, build it for your company and show us how it's done. Wait, let pop some popcorn.
Any half-way competent storage administrator or systems administrator should be able to build a storage array themselves these days. It’s never really been easier and building yourself a dual-head filer that does block and network-attached storage should be a doddle for anyone with a bit of knowledge, a bit of time and some …
I don't know why you posted Anon, because what you've said is spot on.
I mean the author is a homebrew hobby-ist. Now there's nothing wrong with that. I've been a home brew type of guy since my youth and I still run a serious home network.
But at scale... you definitely need some serious $$$ and talent which you get when you buy from a vendor.
10GBe is soo last year. In a large scaled clustered build out, you need to be much faster. Faster still when you start to look at the use of flash outside of the SATA bottleneck and the emergence of ReRAM. (Though, don't hold your breath. )
Moore's law may not actually be dead it just hit a flat spot... . Now its smaller, faster, denser using less energy and producing less heat.
"....But at scale....." Fair point, but the problem (for the monolithic array manufacturers) is that there aren't enough "scale" customers to go round. It also comes back to the core point of ALL storage considerations - the end user actually wants access to and safe storage of their data and couldn't give a rat's arse if it is on block or file or disk or flash or over Ethernet or FC. Their primary concern is usually cost. If an old PC running Linux will deliver that data with "good enough" performance then the users will accept the old PC. The 10TB of data I used to need a pair of EMC arrays and a 1Gb FC SAN for can now be delivered faster and cheaper by a cluster of two bog-standard x64 servers running something like MS Windows Storage Server over 10Gb Ethernet, and I don't need to pay for an EMC admin either. And x64-based solutions nowadays can provide scale and performance to cover 90% of business solutions, with professional 24x7 support on offer as well. So, yes, scale is still important for the few, but not so important for the majority.
... Also if you build your own .. Your storage vendors' SE's and sales Reps won't be taking you out for lunch anymore where they'll tell you about their recent kick-off event in Vegas and the Strippers.
And while you (customer) chew on your Wagyu Steak you hope you too wll be taken to their overseas executive briefing center one day, where you can learn about your vendors vision - in ways that would just not be possible locally. So you buy their product hoping that one day it'll be your turn.
You'll have to have some influence on the procurement decission making process or it'll never happen. If you work for government, you'll never go. They'll have to make it look like training...
Oh - i forgort - we we're talking about Data Storage ?!
Folks, it's all doable, just remember that something as seemingly simple as automatically managed online drive firmware updates can be of paramount importance.
Especially in this age of SSDs, drive firmware is updated rapidly - lots of corruption issues are resolved.
Not being aware of the issues is one problem. Not being able to update the firmware live is a different problem.
Find the release notes for firmware updates for some popular SSDs out there and you'll quickly see what I mean.
The FUD/blame reasoning will always seem to be true for everything one does self instead of outsourcing it.
On the other hand, built three Super Micro Linux storage servers using big tower case, 3PAR controllers and WD 500GB disks in 2006 for use with AVID video software. Two of them are still in use. Most common issues are fan failures and 2 disks had to be replaced. Loss of data never occurred.
When building such servers, do not use brand systems like HP/Dell etc, but build one self using high quality parts from Asus, MSI or SuperMicro and a full size tower case. SuperMicro offers professional grade service on its products.
Big tower PC cases offer lots of space to place disks in 3.5inch cages.
Everything like disk cages and SATA cables can be bought.
Also a separate UPS should be included, with software to automatically shutdown the storage server in case of a power failure.
Or if you want a half-Petabyte filestore on a budget, you can just head over to those friendly folks at Backblaze, who built a business on build-it-yourself hardware and even tell the rest of us how to buy one and where to get the parts from.
Buying COTS like Supermicro is a good idea, since it means you can replace/upgrade parts more easily (standard PSU, standard boards, etc). However, this post seems to be advocating bigger chassis being better: that's just not true. You want to move air past your devices and out of the case: bigger is not better. (It's also true that disks still don't dissipate much heat compared to CPUs.)
Any half-way competent storage administrator or systems administrator should be able to build a storage array themselves these days.
It will be down fast though. And then you will find out the management interface is missing, the documentation is missing, the disks have unspecified interface trouble, or the power supply mysteriously doesn't work and there is no maintenance contract. Aieeee!
Of course you could buy a Oracle storage appliance and pay for a system where the management interface is buggy and locks up during problems (not fixed over 5 years of support), where the documentation is incomplete (and then they move/withdraw Sun blogs that answered some of this), the disks have interface problems (oh dear, yes the SATA ones are like that, no fix provided) and the power supplies and other hardware show phantom faults that are, once again, never really explained or fixed.
They had all the bits to make a great and reasonably priced system, but pulled defeat from the jaws of victory by shipping a prototype version and then (largely by the Oracle take-over) losing key staff and failing to invest enough in to fixing it, instead of adding tick-box features that the sales folk were asking for.
Now of course Oracle has no interest in the lower priced end of the market, or even of selling storage as an item instead of part of a large profitable database deal. Others have stepped in with the same idea of a ZFS based appliance, but have any of them really sorted out the management and recovery aspects to make it reliable and painless to use?
Also we are seeing longer and longer rebuild times on bigger and bigger HDD, which are still your best bet for GB/£, and ZFS has not got anything like the Dell "data pools" where in effect your RAID strips are randomly spread over disks in a much bigger pool. Then a failed HDD results in a parallel rebuild of all affected RAID stripes to other HDD and you don't have the single spare/replacement HDD bottleneck in write speed versus capacity.
> Also we are seeing longer and longer rebuild times on bigger and bigger HDD,
Ah yes. There's a point.
Though, is that still a problem when you do RAIDZ2?
I usually only do 6-disk RAID Z2. I've yet to see a failure in the arrays with 6TB disks...
Like RAID-6 it gives you an extra degree of redundancy during a rebuild. And for all of you out there who have seen RAID-5 rebuilds cough blood on sector errors only found during the rebuild and with no parity remaining to correct them, that is vital.
But if you are looking at a week rebuild time on a 8TB disk under real-life conditions, you still have an uneasy window for something else to go wrong.
I used openfiler a decade ago with some spare HP JBODs for some dev workloads. As long as it worked there was no issue. Forget about upgrading though(I recall the uograde path at the time was basically full data migration to avoid loss).
I tried nexenta a few years ago as JUST a NFS solution for small file set (under 1TB). On paper seemed ok. In reality it sucked hard, made worse by non existent support (yes i paid for their support and professional services to certify the configuration). I have since heard nasty insider stories about ZFS solutions in general which make me glad I have never considered that as a viable solution. I use ZFS at home and it works fine. And I'm sure it has it's use cases at larger scale as well. Keep it away from my vmware and mysql databses though.
Not even solutions like pure storage are mature enough for my liking.
The more i have learned about storage over the past decade that i have been using it more closely, the more conservative I have become in deploying solutions.
I just need it to work without fuss, and for me and my org's mission critical data that means 3par on fibre channel(3par customer for 10 years now ). My experience with 3par is certainly not flawless by any stretch(no solution is perfect). That just reinforces not being interested in taking risks with any other block storage system. My feature usage of 3par is quite limited which probably means I encounter fewer bugs). I love the core of the platform that is very very solid.
Now if HP only had a decent NFS offering (storeEasy and storeAll don't count). 3PAR NFS is a combination of not mature enough and requires special controller versions that only 1 of my 4 arrays happens to have. Even if all my arrays had them I don't believe it would do the job for what I want.
If i was at a larger org we would have more flexibility in testing other things. As-is every piece of storage and server and networking is mission critical(all workloads whether development or qa or testing or production are all consolidated). I have no lower tier of stuff. Maybe at some point but not yet.
For a while we were pumping 200 million in revenue through 8 DL385G7s (384GB with vmware) and a single small 3par array. Today we are bigger for sure. Not big enough to justify segmented servers or storage though.
1) Don't use de-dupe unless you have absolutely masses of RAM and something like multiple VMs that share a lot in common.
2) Fail over - just don't go there.
So far we have used the Oracle fail-over feature that sucked donkey balls big time. Others have said of other fail-over software that it causes as much down-time as it is supposed to solve. Stopping the "split brain" risk is very hard to do.
You might be better served by having a small separate arbiter (like a Raspberry Pi, etc) who's sole job it to spot an unusable system and power it down (ILOM command, or network controlled power strip) and bring up the 2nd head. Syncing the 2nd head status is another area of pain, again maybe best of the arbiter acts to configure both machines on boot from a central configuration. Yes, you just got a difficult job to implement and form your own start-up...
Fail over - just don't go there.
It has always puzzled me that Digital (VAX/VMS) solved this so long ago with the VAXCluster and yet it seems to have been a festering sore for everyone else ever since.
VMSClustering was (and still is) actually far more advanced than mere failover. And the source code was (briefly) there on microfiche for anyone to learn the finer points of the technology, just before the lawyers and corporate types moved in and Digital entered its long slide into oblivion. Copyright, not open source, but even so ....
Best NFS head ever is a Linux server (either physical or virtual) where the storage is presented from a LUN on a SAN.
Not because a Linux box is not going to give you a headache every now and then (Linux diehard here) it is because the Linux box gives you flexibility and troubleshooting ability beyond any proprietary solution.
For the file-system you can mix and match.
As with many things, the first level is easy but then things get much harder. Can I build a simple database? Sure I can. Can I build a fully SQL-compliant database with a sophisticated query planner and good benchmark numbers? Not without some help. Can I build an interpreter for a simple language? No problem. Can I build a 99.9% gcc-compatible compiler that spits out correct high-performing code for dozens of CPU architectures? Um, no. Similarly, building a very simple storage system is within reach for a lot of people and is a great learning exercise. Then you add replication/failover, try to make it perform decently, test against a realistic variety of hardware and failure conditions, make the whole thing maintainable by someone besides yourself . . . this is still a simple system, no laundry list of features to match (let alone differentiate from) competitors, but it's a lot harder than a "one time slowly along the happy path" hobby project.
I'm not saying that the storage vendors deserve every dollar they charge. I'm pretty involved with changing those economics, because the EMCs and the NetApps of the world have been gouging too much for too long. What I'm saying is that "build it yourself" is a bit of an illusion except at the very smallest of scales and most modest of expectations. "Build it with others" is a better answer. Everyone contributes, everyone gets to benefit. If you really want to help speed those dinosaurs toward their extinction, there are any number of open-source projects that are already engaged in doing just that and could benefit from your help.
Assuming you're using Linux at some point you will come across those words that make even the most hardened systems/storage admin tremble.
I swear to fucking god they designed that thing just to mess with us. Anyone recognise this?
You need to delete this directory because $USER doesn't work here any more.
# umount /pool/home/$USER
umount: /pool/home/$USER: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
cannot unmount '/pool/home/$USER': umount failed
Wait... mounted? No they bloody don't. $USER has had their account deactivated. They've left the building. Their machine has been returned to the pool. No one other than the IT team could have mounted their share anyway - what gives?
Oh, wait, they turned the machine off at the wall didn't they. Well, now we're screwed, becuase nfs-fucking-kernel-server is going to sit there and await an unmount from the client and the only way to stop it is to restart the daemon - kicking everyone who actually IS using it.
(Honestly tho - if anyone has any ideas how the HELL you kick a user from a Linux NFS server in order to cause the kernel to release its lock on an exported filesystem, this is a bane-of-my-life type problem)
I have tried bloody everything :/
It's a well defined problem at least. The nfs-kernel-server process owns the lock on the file. Since nfs-kernel-server isn't kind enough to proved a human parsable entry in /proc to let you know which instance of nfsd is actually holding a given file (and since each nfsd process doesn't map 1:1 to a particular nfs export that wouldn't help even if they did) there's no way to know which instance you can kill to get your lock back.
The only thing to do is shut down all of nfs-kernel-server and kick the 99% of your users who aren't causing problems.
Possibly we'd be better served with a user space NFS server, but they all seem to have their own problems.
> Since nfs-kernel-server isn't kind enough to proved a human parsable entry in /proc to let you know which instance of nfsd is actually holding a given file (and since each nfsd process doesn't map 1:1 to a particular nfs export that wouldn't help even if they did) there's no way to know which instance you can kill to get your lock back.
This sounds like something you could fix right quick, if you had the source code.
"Possibly we'd be better served with a user space NFS server, but they all seem to have their own problems."
As one of the miscreants partially responsible for the nfs-kernel-server clusterfuck, I agree with the first and second parts of that statement - and it's not helped by the userspace server not having had any substantial work since 1996.
The original userspace nfs server was - to be blunt - a piece of utterly slow shit. That's why nfs ended up in the kernel.
The other part about it being in kenrel space that you missed is that IT WILL NOT PLAY NICE with _anything_ else accessing the same disk blocks. If you NFS export a filesystem, then the _only_ access to it had better be via that NFS export or you risk trashing the data.
Putting nfs into the kernel more than 20 years ago was a solution to a problem (painfully slow exports and PCNFS being almost unusably slow) at a time when the people implementing it hadn't even thought of the possibility of something accessing XYZ file via NFS at the same time as something else doing it via SAMBA or something doing it at local level. If we had, then perhaps we'd have been more careful.
ZFS depends on how you arrange the disks and what your use case is. A single process writing tiny files to a good disk subsystem with good amounts of RAM and a sensible application of compression(yes/no) or de-dupe(yes/no) will suck.
Give it a different task with multiple processes and large reads/Writes and it can shine as It can then leverage all the spindles and break down the writes in to segments and span them.
Its too easy to think "I'll add de-dupe, compression and an L2ARC to make it faster" when in reality you don't have the RAM to store the de-dupe or the meta data. That results in limiting the RAM to not caching but to holding the map for the SSD/de-dupe.
About 3 years ago I built a Debian+ZFS+SCST SAN and export LUNS over fibre channel to my VM host and desktop and iSCSI for my living room PVR. All for home.
I've considered a few HA versions of it for it's replacement.
I would need to set-up replication of the files system below SCST and be able to "shoot the other node in the head" I could use CEPH for the replication between nodes with direct infini-band connections.
Then one node would be the primary and one a slave. Using NPIV on the switch to hide this from the clients.
At home I would probably not to duplicate all my disks so would use a shelf with two controllers connected to both fie system heads and import with the F (force) command if a node when down.
As for backup I have another HP micro server with big disks that runs Bacula but to backup the data on my VMs not the SAN.
To do this commercial ask your self.
1. Am I trying to save money?
____To do this well will require good kit and more than one.
2. How long can a recovery of a file system node take/ what is my down time limit.
____Build your solution around this time limit. 0 down time can be done but only with sufficient replicas. Have spares. Use good resilient hardware (dual PSUs hot swap fans) Keep spares. Have a care agreement. That all will impact 1.
As a thought, instead of doing the replication below the SCST layer, how about exporting the raw LUNS from each of the storage servers to the VM host, then doing the mirroring there? That should maintain the full io / transfer rate of things, instead of being (potentially) slowed down by storage server side replication.
So the VM host process can write to two SANS simultaneously? That would be a cool feature and simplify things.
A quick google finds this for VMware:
Protection against LUN failure allows applications to survive storage access faults. Mirroring can accomplish that protection. Mirroring designates a second non‐addressable LUN that captures all write operations to the primary LUN. Mirroring provides fault tolerance at the LUN level. LUN mirroring can be implemented at the server, SAN switch, or storage array level."
Everyday is a school day
Is the mirror driver actually available as a thing to use for real-time SAN mirroring now? Been a while since I was a VMwarrior. This tells me that it was used internally for svMotion... http://www.yellow-bricks.com/2011/07/14/vsphere-5-0-storage-vmotion-and-the-mirror-driver/
Unsure about VMware, as I haven't personally used it in ages. Other host platforms (eg Linux + KVM) definitely do this, as Linux provides the mirroring natively. (you just need to configure it)
Haven't yet tried this with FreeBSD, but it would be kind of surprising if it didn't work.
And just to point out, if VMware itself doesn't let you mirror directly on the host, you could pass the LUNs through to the VM itself to do the mirroring there.
The thought makes me kind of nervous around failure scenarios thought. Would do decent testing. ;)
I do the same you do at home, but small scale.
With a single £50 ITX motherboard, single ATX motherboard, 16GB RAM and Debian.
4 x 4TB Drives, 2 RAID10 using MDADM, LVM ext4 and xfs volumes.
I have done horrible things to the set-up in the quest for science, while the server run every single conceivable service under the sun, including VMs.
It is not very fast, but it is not slow either, I know companies with older kit that have much more problems and way less functionality.
It is really good for the price.
* RAID or enterprise grade hard disks
* A server grade 64-bit motherboard supporting ECC RAM (e.g. Asrock, Supermicro), some cheap mini-ATX ones even have SAS on board!
* An Intel CPU supporting ECC RAM
* Lots of ECC RAM, never ever non-ECC unless you like doing ZFS read-only recovery, been there!
* At least 6 disk RAID arrays i.e. ZRAID2.
* Possibly some SSDs for ZFS read and/or write buffering.
The tiny OS runs off Flash Sticks (supports mirrored flash sticks), supported OpenZFS properly for ages (unlike Linux), is dead easy to set-up, has a web interface, ZFS makes lots of stuff easier, needs no messing around like Linux, and gets frequent updates.
FreeNAS 10 sound like it will be even easier.
I'm looking at replacing my storage solution at the moment...
Currently using on 6year old SAN (big ticket) and pro-grade NAS for a total of about 60TB, but would be very interested in talking with ANYONE who is using FreeNAS / TrueNAS or similar in UK... I can't find examples of critical front line deployment on this side of the pond outside Universities.
I believe this type of FOSS based solution has potential, but can't procure if I can'f find support/a way to demonstrate capability...
Sigh - I can see a big name "All-flash" vendor in my near future.
<i> "I built a block-storage array using an old PC, a couple of HBAs and Linux about five years ago; it was an interesting little project..." </i> – for home use
I notice you didn't say that your whole company depends on this homegrown array.
You're absolutely right, with a little time and research, you can build anything. But what percentage of a company's development budget goes into building a storage array (the fun exciting part) vs. testing, documenting, fixing, supporting, upgrading and generally "doing the important stuff" ?
"Then you add replication/failover, try to make it perform decently, test against a realistic variety of hardware and failure conditions, make the whole thing maintainable by someone besides yourself "
As one who is in the storage industry but not a storage manufacturer, it is the old 80:20 rule on product development. You get 80% functionality in 20% of the development time but that last 20% takes 80% of the time. And that 20% - it's all about what happens when there is an error - that is the really tuff stuff to do as every server / HBA / storage device behaves differently. Standards! don't make me laugh! Everyone has their own interpretation of a standard and the bigger they are, the worst they are. One SE once told me after I pointed out their non-conformance, "This is HP SCSI we don't care". The customer had the best answer I have heard in this situation " Look! I have the money and you have the problem" In a flash the salesman butted in "It will be fixed shortly" - this was a very big deal.
This post has been deleted by its author
Currently looking to procure a storage solution to replace what started out as a SAN-only consolidated storage strategy and now includes V-SAN and multiple NAS units to meet throughput needs. I briefly considered a "brew-your-own" (and my browser history covers the exmaples mentioned) but don't have the staff time to devote to such a critical project.
As mentioned consistently in comments, I *need* support, Real, 24/7/365, drop-it-all and turn up type support for my front line SAN. I would however be quite happy for my supplier to provide white-box FreeNAS servers/shelves at a reasonable cost if they would then contract for the support (and could prove themselves capable of providing *sufficient* service). I'd even pay to carry whitebox spares given where our boxen are racked.
More than happy: I'd be *delighted* to find a way to help meet my strategic FOSS commitment (it's harder than eople imagine). I really would not grudge a penny of the support costs, because my staff's time is already more than FULLY committed running the infrastructure, and THATS my problem - it needs to be plug and play, not an on-going build and test project!
All (UK based) pointers welcome!
I agree with some of the criticisms, but time for an anecdote...
We had the option of paying $20K+ for a fault-tolerant iSCSI array with dual controllers, plus putting our own redundant filers on top, or $60K+ for a NetApp NFS server solution. Now I'll admit the latter would have been great.
But since at our site everyone is an experienced sysadmin and/or developer, we simply bought commodity hardware, set up CentOS, DRBD, rgmanager, CLVM, XFS and Kerberized NFS, all over multiply redundant 10 Gb ethernet fabric, with proper failover, fencing etc. obviously, no single point of failure, and for a fraction of the TCO of a NetApp, we have a storage system more reliable than any cheap redundant controller array I've ever used (which often have their own cryptic bugs, and would still be more expensive than our system, including implementation time).
We do routine tests and have experienced several actual hardware and software failures in the last 5 years, and the downtime is measured in seconds. 10 GbE is enough for us at present; if it wasn't we could move up to 40 GbE or simply widen the LACP links. The latency introduced by 10 GbE is nothing compared to the disk seek latency itself, so performance is fantastic. This on a production site serving 200 desktops plus numerous virtual machines and 'BYOD' clients.
Moreover, we have the fantastically good feeling that *when* something does go wrong, we fully understand the system and can diagnose and correct the problem far more rapidly and efficiently than I have ever known a vendor to do. And trust me, I have *plenty* of experience with doing things that way...
So it can be done and done well. You just have to know what you're doing and have the in-house expertise to do it right.
Good anecdote. :)
As a thought, is there any chance your company would be ok with having that officially known? eg Some of the software projects you mention could make use of that as good Case Study material, for mutual promo benefit. Can point you to the right people if that's useful. :)
No objection in principle, would have to get management approval. We are actually somewhat behind the times as rgmanager has now been replaced by Pacemaker in RHEL 7, and DRBD is up to version 9 which is a complete rewrite. So I'm not sure how valuable our case study would be. Still, feel free to get in touch wth me privately if you wish (I assume there is some way of doing this on The Register!?)
I also know others who have made a very successful business of selling cost-effective HA systems based on these technologies and are more up-to-date on it than I am.
I built my own storage a while ago. It was cobbled together with a couple of Supermicro JBODs, some cheapy LSI controllers and a Fibre Channel HBA. Used SCST to present the disks over Fibre Channel and integrated it a StorNext file system. Was only meant to be a proof of concept but it ended up in production. Worked quite nicely. Might even still be alive....
Disclaimer, I work as a storage guy, used too be a pure tech, now more like a architect.
I've done business with all of them: NetApp, HP, EMC, IBM...
I've managed more then five different plattforms.
I've joked about building it myself for years, and done so for myself and friends.
The monster under the bed is *when* shits hits the fan, and BigBoss™ wants too know exactly when all services once more is running like normal, with SLA payback is hiding behind the curtains. Then, you cannot hide behind the normal "Well we have a 4 hours fix-time deal with HP".
Building a ZFS based storage system is quite simple, decent hw, LOTS of ECU RAM, simple hba's and jbods. Then you soon have a storage system able of doing decent NFS and possible FC luns. But hardware and service monitoring are lacking, billing systems are lacking and even if you have gotten parts from a alright provider there is no autosupport. Yes you can get HA with zfs-plugin from RSF-1. but it will cost you. And it's as all software, not flawless.
Adding SMB to it will complicate things a lot, samba is IMHO crap. And all vendors who make good converged storage system have paid *alot* of money to Microsoft and written a proper SMB server.
Even if you consider ZFS a good filesystem, witch it is, creating a storage environment is a lot of effort, and you will be responsible for services as well as outages. I've had battles with NetApp, over quality of releases, got hit quite hard a few years back with the toasters not freeing up dedup tabels, scrubbing gone spiraling down the drain, panics and so on. Still it's the best to manage. I've have had data loss with 3par, mirrors who went bad, tunesys gone haywire and more.
Ah, did I forget that ZFS have no block pointer rewrite? That means no releveling vdevs, size it right from the beginning. And you get a *huge* overhead if you're thinking streched cluster (four way mirror).
Do you really want that extra heat?
Biting the hand that feeds IT © 1998–2021