Want
Something every small business has needed for some time. Quality reliable shared network drives... in the cloud.
El Reg reader Greg Koss reckons the Hadoop filesystem can be used to build an iSCSI SAN. Er, say what? Here's his thinking: we could use Hadoop to provide generic storage to commodity systems for general purpose computing. Why would this be a good idea? Greg notes these points: Commodity storage pricing is roughly $60/TB in …
HDFS lacks a couple of features which people expect from a "full" posix filesystem
1. The ability to overwrite blocks within an existing file.
2. The ability to seek() past the end of a file and add data there.
That is: you can only write to the end of a file, be it new or existing.
Ignoring details like low-level OS integration (there's always the NFS gateway for that), without the ability to write to anywhere in a file, things are going to break.
There's also a side issue: HDFS hates small files. They're expensive in terms of metadata stored in the namenode, you don't get much in return. It's while filesystem quotas include the number of entries (I believe; not my code)
What then? Well, I''d be interested in seeing Greg's prototype. Being OSS, Hadoop is open to extension, and there's no fundamental reason why you couldn't implement sparse files (feature #2) and writing within existing blocks via some Log-structure writes-are-always-appends thing: it just needs someone to do it. Be aware that the HDFS project is one of the most cautious group when it comes to changes, protection against data loss comes way, way ahead of adding cutting edge features.
without that you can certainly write code to snapshot anything to HDFS; in a large cluster they won't even notice you backing up a laptop regularly.
One other thing Hadoop can do is support an object store API (Ozone is the project there), so anything which uses object store APIs can work with data stored in HDFS datanodes -bypassing the namenode (not for failure resilience but to keep the cost of billions of blob entries down. Anything written for compatible object storage APIs (I don't know what is proposed there) could backup to HDFS, without getting any expectations that this is a posix FS (that's untrue: I regularly have to explain to people why Amazon S3 isn't a replacement for HDFS).
To close then: if this is a proof of concept, stick the code on github and everyone can take a look.
stevel (hadoop committer)
Dunno *why* it waited for me to get you an upvote Steve.
I've *read* part of the code, since there is an app team in my shop that thinks HDFS is a good place for SAS temp files and report data.
Update, edit are not options on HDFS in large scale - I noticed the hook for object stores, but even there it looks to me that its more worm style than anything else.
Explaining that HDFS doesn't do these things takes some time - especially with some of the moronic marketing crap out there. (I've seen stuff that should have made the sales techs blush..... and I think I may have made one or two of them actually blush.)
Discaimer: I know nothing about HDFS. However:
The only one I can see as an issue with the idea he suggests is #1. He is suggesting iSCSI use, in which case we are talking about a disk image. Therefore small files, seeking past the end (unless you are creating sparse images) and other points you mention don't apply. We are talking about exporting a single large file over iSCSI to be used as a disk on another system, which will partition and format it with it's own filesystem.
Since iSCSI is block based, we're talking about storing a ton of (say) 4k blocks in Hadoop and the iSCSI service asks for or updates block #12309854? It doesn't sound terrible. I could see some concurrency issues with multiple iSCSI services, but I'm not sure they're worse than iSCSI to traditional block storage as they'd still need a cluster-aware file system on the clients.
The big question is: how/why is this better than something like DRBD?
Stevel's comments might be a real problem -- in particular, overwriting blocks within an existing file might be something a lot of software expects to be able to do. And, in particular, if you have disk images served via iscsi I would think this software would be particularly likely to want to scribble into either a disk image or a differences file.
Otherwise, I think this sounds quite reasonable -- why spend the huge bucks on specialized SAN hardware when this will do the same thing? The one reason ordinarilly would be stability but a) HDFS is proven software with known stability. b) SAN vendors have flubbed it now and then too.
Reminds me of a suggestion my VP of technology had at a company I was at several years ago, he wanted to use HDFS as a general purpose file storage system. "Oh, let's run our VMs off of it etc.." I just didn't know how to respond to it. I use the comparison as someone asking for motor oil as the dressing for their salad. I left the company not too long after and he was more or less forced out about a year later.
There are distributed file systems which can do this kind of thing but HDFS really isn't built for that. Someone could probably make something work but I'm sure it'd be ugly, slow, and just an example of what not to even try to do.
If you want to cut corners on storage, there are smarter ways to do it and still have a fast system, one might be just ditch shared storage altogether (I'm not about to do that but my storage runs a $250M/year business with high availability requirements and a small team). I believe both vSphere and Hyper-V have the ability to vmotion w/o shared storage for example (maybe others do too).
Or you can go buy some low cost SAN storage, not everyone needs VMAX or HDS VSP type connectivity. Whether it's 3PAR 7200, or low end Compellent, Nimble or whatever.. lots of lower cost options available. I would stay away from the real budget shit e.g. HP P2000 (OEM'd Dot hill I believe), just shit technology.
I mean, lets be honest, HDFS isn't a generalised block storage system. Its not ever particularly well designed for the job it intended to do.
If you want to cluster bits of block storage into one name space, there are many, many better ways of doing it.
For a start HDFS only really works for large files, large files that you want to stream. Random IO is not your friend here. So that makes it useless for VMs.
If you want VM storage, and you want to host critical stuff on it you need to do two things:
* capex some decent hardware (two 4u 60drive Iscsi targets) and let VMware sort out the DR/HA (which it can do very well) 60grand for hardware plus lics. That'll do 300kiops and stream 2/3 gigabytes a second.
* capex some shit hardware, ZFS, block replicate, and spend loads of money on buying staff to support it. and the backups
Seriously there is/are some dirt cheap systems out there that'll do this sort of thing without the testicle ache of trying to fiure out why your data has been silently corrupting for the last 3 weeks, and your backup has rotated out any good data.
So you want a custom solution:
1) GPFS and get support from pixit (don't use IBM, they can hardly breathe they are that stupid) <-fastest
2) try ceph, but don't forget the backups <- does fancy FEC and object storage
3) gluster, but thats just terrible. <- supported by redhat, but lots of usersapce bollocks
4) lustre, however that a glorified network RAID0, so don't use shit hardware <- really fast, not so reliable
5) ZFS with some cheap JBODs (zfs send for DR) <- default option, needs skill or support
Basically you need to find a VFX shop and ask them what they are doing. You have three schools of thought:
1)netapp <- solid, but not very dense
2)GPFS <- needs planning, not off the shelf, great information lifcycle and global namespacing
3)gluster <- nobody likes gluster.
Why VFX, because they are built to a tight budget, to be as fast as possibly, and reliable as possible, because we have to support that shit, and we want to be in the pub.
I wondered recently whether HDFS would provide commodity iscsi storage for an ETL staging layer using an extract-once, read-many approach, but ISTR the random reads and variable file sizes (and other stuff I can't recall) didn't make it very friendly. The more obvious GPFS appears a better choice.
Why go from file system to blocks to (on the client) file system? Seems a bit inefficient to me, and I would probably go for straight file level protocols right away and "do away" with some of the complexity.
IF block level protocols is a must, I'd say that there are way better choices for the backend. Just go with the standard file system of the server, or a fancy one like ZFS that has lots of bells n' whistles.
When it comes to backup & dr, I'd look into if that could not be solved at the application level, or if the server has some tools that could be used.
To be able to say anything useful and not overly generic like above, it would be useful to have more information about the environment and the setup.
Ceph does, you specify it in the CRUSH map where you want replicas placed. The CRUSH map is a tree, ultimately the leaves are individual HDDs/SSDs. Branches represent things like data centres, buildings, rooms, racks, nodes … whatever your physical topology is.
The default is to consider all nodes as branches of the root, and the disks as leaves off those branches. It'll pick a branch (node) that presently does not hold a copy of the object to be stored, then choose a leaf (disk). It repeats this for the number of replicas.
iSCSI is a BLOCK protocol
Look at the scale out block folks - CEPH or Swiftstack or MapR
HDFS, is a file protocol, it was built to ingest webcrawling info. and map-reduce it for web-search
for that task it works great.
HDFS is woefully short of data protection, because it was built on the assumption that if data was lost that data would be refreshed from the up-stream source (like another webcrawl).
HDFS zealots will say "rep count 3 cures every data protection situation", but it doesn't.
don't get me started on Java behavior during node fail.