Re: But not for long.....
Why would they bother changing the timing on lights when they can just crash your car directly?
68 posts • joined 9 Oct 2009
For all of its good points, Ceph is very demanding wrt the hardware it's running on. In particular it's an atrocious CPU hog. Running it on systems with underpowered CPUs is not something I would want to be involved in. TBH this seems more like a publicity stunt than a serious offering.
Disclosure: I'm still technically a maintainer for Gluster, though I no longer work on it day to day.
Dr. Peuto's name seems to be spelled correctly in the story, but not in the headline.
My first processor was the 6502, but I always kind of wished I'd have a chance to try both the Z80 and 6809. Maybe I'll find a simulator and putter around a bit. Low-level programming on the old 8-bit processors with their paucity of registers and addressing modes etc. always felt more like puzzle solving than most of what I've done since. Got a second taste of it with the 88K, which was an early exposed-pipeline RISC, and it's something I still remember fondly.
I worked on such a thing in 1996 or so. The company was Mango, the software was Medley. It worked pretty much like you describe, and it was a *real filesystem* unlike blb. About the only think blb has that it didn't was erasure coding, because that was hardly a thing back then (I personally didn't hear about it for another four years or so). Wouldn't have been hard to graft on, though.
Every year or two there's a new blob store based on the same old ideas, claiming unprecedented levels of scalability, reliability, and convenience. These claims *always* turn out to be exaggerated. Some of the offerings are fine, just not as fantastic as they claim to be, because it turns out this stuff is actually hard. (This is what I do for a living, BTW.) If it hasn't actually been run on a multi-thousand-node cluster, it *won't* run at that scale and probably won't run for very long even at hundreds before it loses data. FWIW, I wrote a bit more yesterday about the *absolutely predictable* issues/omissions I found in five minutes of looking at the blb source.
There's nothing actually wrong with blb. It looks like a decent starting point if you want to build a truly high-scale object store for fun or for internal use. It's just not as All That as they claim.
Startups, tech or otherwise, do not exist to create jobs. They exist to make money. They are *funded*, sometimes massively, to make money. If they can do so with greater automation, they will. Even for those startups that create physical things, it's more cost-effective to use pre-manufactured parts (including microcontrollers) and automated assembly than to hand-craft everything from small pieces. It's more cost-effective to sell on the internet than fight for space in a kazillion brick-and-mortar stores. Many logistics and even HR systems can be outsourced. So it's no surprise that even startups that intend to grow rapidly employ what once would have been regarded as a skeleton crew.
The thing is, if startups hire fewer people each, then maybe there should be more startups. Unfortunately, the funding models aren't geared for that. They want to fund a few unicorns not many minnows. Tax laws, accounting requirements, insurance requirements, and regulatory systems are increasingly hostile to smaller businesses, as they each tend to exact a fixed cost per company that's in the noise for a larger company but significant for a smaller one. (BTW I'm way over on the progressive end of the spectrum and support most of these things in principle, but even I recognize that their implementation has had ill effects.) If we want to undo the over-centralization (and over-financialization) of our economy and go back to a time when regular people were creating jobs for other regular people, we need to stop putting startups in the shadiest part of the garden and letting the giants stomp all over the fresh shoots.
The silver lining is that those who were laid off might be better able to apply their talents elsewhere, and might ultimately be paid more for it. There was a time when Veritas did innovative things. That time is long gone. For the last $forever they've seemed intent on *strangling* innovation. When they've wrung the last dollar out of products that were innovative twenty years ago, they'll just fold the whole thing up and they won't be missed.
So Facebook takes this suggestion and hires a bunch of editors, who at some point inevitably turn out to all be Trumpians who brand anything they don't like as fake news. Or Bernians who do likewise from a different direction. Then the same people continue to lambaste them for doing The Wrong Thing because it was *never* about doing what's right or protecting democracy or anything like that. A lot of it is competitors in the information business doing what competitors do, along with a big dose of Tall Poppy Syndrome.
Dealing with fake news doesn't mean giving any one group editorial control - not Facebook itself, not El Reg, sure as hell not any government. It means allowing multiple rating or collaborative-filtering services to flourish, and giving users a *choice* of which ones to trust, much as we do for spam filters and ad blockers. It's a market, not a planned information economy. Facebook's role is to help users find anti-fake-news filters *they* trust, and to honor the results as they're displaying an individual user's feed.
There are basically two reasons. One is that competition is a strong motivator. For a lot of people, including me, leaderboards can motivate people to go out more, or to push faster/further than they might have otherwise. Another is helping to cheer each other on. I have three friends on MapMyRun, I know that the encouragement I get from them is helpful when I'm not doing so well and I certainly hope it works the other way too.
That said, there are good and bad ways to share this data. For example, on MMR those three friends are the only ones who get to see exactly where I've gone, or whether I've gone at all on runs that don't earn me a place on a leaderboard. All anyone else sees is first name, last initial, time on that segment, and date. I *could* open up full sharing, but it's not a default. No heatmaps or anything like that, though I've kind of wished for that as a way to help people find routes worth trying. Overall, I'm pretty comfortable with MMR's approach. If I used Strava, I think I'd be a bit less comfortable.
As a Gluster developer, I don't mind the omission this time (though other times I've found it unjustifiable and irksome). This article seems to be about new or at least significantly changed companies. Neither Gluster nor Ceph fits that template. There's no new funding, new products, new partnerships, or C-suite drama to generate headlines. We're kind of boring TBH.
Mage, you're either woefully misinformed yourself, or deliberately misinforming others. Google has brought in or copied everything? Besides the obvious exceptions of the PageRank concept and map/reduce, here are plenty more exceptions.
Facebook hasn't innovated technically? More exceptions.
Never heard of Spanner or Borg/Kubernetes or Cassandra or HHVM or Open Compute? That's just ignorance. Look, I'm sensitive to the issues of privacy and market dominance and so on, but the specific claim here is that these two companies are bad for *innovation* and that's clearly false. Name a company you like better. Let's see if they contribute as much to innovation. Highly unlikely.
There are plenty of reasons to criticize Google and Facebook, but lack of innovation is not one of them. A large part of the reason is this thing we call open source, to which both contribute a great deal. The author would do well to read about open source, specifically how it prevents the kind of enclosure and stagnation he's worried about. The fact that one site can copy another's superficial features is a *good* thing, because the alternative is exactly the kind of intellectual-property regime that leads to the worst kinds of monopoly. Would the world be better if Amazon (should be the true target of his screed) had prevailed on the one-click patent, or Apple on all the "look and feel" stuff? Hardly.
It's refreshing to see a little skepticism about such vague claims by storage upstarts. Without an actual apples-to-apples comparison - not their NVMe gear vs. someone else's spinning disks as usually happens - it's impossible to say whether they have anything or not. Maybe they really have come up with some great innovation. If so, I'd wonder why they're not bragging about the patents they've already filed. Absent that, this looks like a pitch for investors/acquirers rather than customers. If they're still standing and still independent a year from now, maybe we can have a discussion about the technology then.
Isn't this true of other groups as well? We all fight over the issues with which we're most intimately and constantly familiar. Emacs vs. vi is our version of angels dancing on the head of a pin. Another metaphor is the infamous bikeshed. Nobody wants to argue about the design of a nuclear power plant, because that's complicated and hard and requires a lot of knowledge, but everyone has an opinion on what color the bikeshed at that plant should be. Perhaps the general tendencies of programmers - highly focused, introverted, a bit brittle - make this somewhat worse than elsewhere, but mostly it just seems like human nature.
Coho ran into an all-too-common problem for storage startups: storage customers are hard to evangelize. They have pressing, immediate problems. They want solutions to those problems, and ideally solutions that fit into their current paradigm. Getting them to look forward to the *next* set of problems is really really hard. Coho had looked ahead, seen a problem on the horizon, developed some interesting technology to address it ... and then found themselves too far ahead of the customers to get any revenue out of all that. Like other technologies (*ahem* CDP) it will be not the originators but some late-arriving copycats who hit the right market window to benefit. With luck, some of the people who actually had the vision and made the efforts will get to ride the gravy train a few years from now, but in my experience that happens all too rarely.
I've seen way too many cases where the preinstalled firewall crap at a cloud provider interfered with the operation of the distributed systems I was installing. Often the tools and documentation available to resolve the issue were miserable too. I did not appreciate it. I'm perfectly capable of locking down my own system, without making it unusable, all by myself. IMO it's perfectly reasonable for a provider to avoid the complexity and cost and aggravation associated with trying to do what any competent Linux administrator can and should do themselves.
At a previous job we had a similar - but not identical - problem with a machine in Boulder. In our case there was one more step. Because of the thinner air, we got less cooling. The warmer temps made the PSU less efficient, causing brownouts which manifested as transient errors on our internal communications links. The fix turned out to be a slight adjustment the the ratio between temperature and fan speed. I was the guy on-site, but kudos to the hardware folks back on the east coast for figuring it out.
So all of those accusations against Hillary, or the claim that there were millions of illegal immigrants voting, should also be ignored until proven, right? Ditto with your accusation of lying. But you're missing one important thing: some information is dangerous to disclose. The evidence has been given to those whose need to know exceeded the risk of that disclosure, which does not include you. It takes a tremendous ego for someone to believe they are the sole arbiter of truth, and that they personally must be convinced of a statement's truth before others are allowed to consider it. Nobody's being thrown in jail based on rumor. It's OK for people to claim and believe what a preponderance of evidence - both public and vetted but not disclosed by our elected representatives - suggests.
As far as I can tell, this is just EC2 with features removed to enable a simpler pricing model. The fact that many of these features become available again through VPC peering suggests that it's a separate (someone else's?) data center. But the price isn't really going to destroy Digital Ocean etc. Looking at the 2GB level, which is the lowest they all have in common and is what really constitutes a starter system:
* Digital Ocean - $20/month for two cores and 40GB SSD
* Linode - $20/month for one core and 24GB SSD
* Vultr - $20/month for two cores and 45GB SSD
* Lightsail - $20/month for one core and 40GB SSD
Lightsail is below median for cores, at median for storage, all for exactly the same price. Without benchmarks - especially storage benchmarks which IMX have shown a 2-3x difference between providers or even instances within one provider - it's hard to know which is really the better deal. The real take-away here seems to be that Amazon was feeling pressure at the low end.
The overcommit at issue on a storage server is probably not VM overcommit (or oversubscription) but process-memory overcommit. If you allow memory overcommit what you're saying is that the system can allocate more virtual pages to processes than it can actually back up with physical memory plus swap. It's kind of like fractional-reserve banking, and we've all seen what happens when that goes too far. Everythingl works great until there's a "run on the bank" and every process actually tries to touch the pages allocated to it. Since it's not actually possible to satisfy all of those requests, the kernel picks a victim, kills it, and reaps its pages to pay other debts. It's just as evil as it sounds. It works to a degree and/or in some cases, but IMO it's an irresponsible default made worse by the fact that the Linux implementation has always tended to make the absolute worst choices of which process to scavenge.
In a virtual environment, things get even more interesting. You can allow memory overcommit either within VMs or on the host, or both, and that's all orthogonal to how you size your VMs. Where most people get in trouble is that they oversubscribe/overcommit at multiple levels. Each ratio might seem fine in isolation, but the sum adds up to disaster. The OOM killer within a VM might take down a process, the OOM killer within the host might take down a VM, you can get page storms either within a VM or on the host, etc. It's much safer to overcommit in only one or two places, and then only modestly, but those aren't the defaults.
Way to play the false-dichotomy and appeal-to-authority cards, Nate. I've been a Linux user just as long as you claim, and a UNIX user for a decade before that. There are other options besides a crash or hang. I even mentioned one already: don't overcommit. If there's no swap (really paging space BTW but I don't expect you to know the difference since you don't even seem to realize that allowing overcommit increases the page/swap pressure you so abhor) then memory allocation fails. The "victim" is statistically likely to be the same process that's hogging memory, to a far greater degree of accuracy than any OOM-killer heuristic Linux has ever implemented. If you want to avoid paging, limit your applications' memory usage and don't run them where the sum exceeds memory by more than a tiny amount (to absorb some of the random fluctuations, not steady-state usage). If you fail to follow that rule, adding overcommit will just push the problem around but not solve it.
There are cases where overcommit makes sense. At my last job we had users who'd run various scientific applications that would allocate huge sparse arrays. Since these arrays were guaranteed to be very thinly populated, overcommit was safe and useful. However, for general-purpose workloads overcommit makes a lot less sense. For the semi-embedded use case of a storage server, which is most relevant to this discussion, it makes absolutely no sense at all. Unconstrained memory use is the bane of predictable performance. Turning performance jitter into something that's easier to recognize and address is actually pretty desirable in that environment, and that's what disabling overcommit will do.
I feel bad for everyone involved. For the customer, the reasons are obvious. For Maxta, this is all too reminiscent of experiences I had working at small companies, and especially in storage. One of the main culprits seems to have been bad controller firmware. Even companies that control the hardware sometimes have trouble with that one. When you ship software to run on hardware the customer controls, the situation becomes impossible. The second issue sounds like the good old Linux "OOM KIller" which was an incredibly stupid idea from the day it was conceived. At both of my last two startups, we ended up having to disable memory overcommit because of the havoc that would result when the OOM Killer started running around like a deranged madman shooting random processes in the head. To be sure, Maxta probably could have done a better job controlling/minimizing resource use, but I know that's a difficult beast to fight so I'll cut them some slack. Put both of these problems in a context of confused business relationships and expectations, and it's no surprise that a disaster ensued. The lesson I take away from this is that vendors need to keep the list of Things To Avoid complete and up to date, while customers need to be clear and open about what they're doing to make sure they don't fall afoul of that list. Amateurs and secret-keepers have no place in production storage deployments.
I think the proper analogy is XP, not NT. NT was a new architecture, separate from the legacy 3.x/95/98 codebase. XP was the reunification of these divergent streams. Similarly, Android represented a bit of an architectural departure with its unique JVM-based userspace. Andromeda will represent the reunification of that with the more traditional architecture of ChromeOS (so traditional that I'm running full Ubuntu in another window on this Chromebook right now).
Still trying to get a handle on what Andromeda will mean for us Chromebook users, BTW. *That* would be an interesting story to delve into.
Containers are pretty useful, but the idea that they should all be stateless has always been STUPID. Any non-trivial application has state that has to be stored somewhere. Making it "somebody else's problem" only creates a new problem of how to coordinate between the containers and whatever kind of persistent storage you're using. If one provisioning system with one view is responsible for both, subject to the constraint that the actual disks etc. physically exist in one place, then it actually does simplify quite a bit of code.
It's misleading to say Red Hat Gluster Storage will be available this summer, or to imply that it's just now competing with Portworx et al. RHGS has been available for years, since before some of those others issued their first press release - let alone wrote their first line of code. It's just the new version that's coming.
Why do you insist on comparing vaping only to cigarette smoking? That's pure cherry-picking. Nobody has disputed that the all-vaping world is better than the all-cigarette world, but neither is the world we actually live in. Vaping needs to be considered *on its own merits* and not just in comparison to something we all know is bad. Doing X and vaping carries some risks that doing X alone does not, for all X. Those risks, which are and are likely to remain better known/understood or controlled by vendors than by consumers, are a legitimate subject of legal/regulatory interest. If you think these particular regulations are too draconian, the constructive response would be to suggest alternatives. Trying to dismiss all possible regulation makes you seem like an ideologue, and trying to suggest that vaping is a net public-health positive makes you look delusional.
I'm not going to disagree with you, there. Centralized trust doesn't work any better than centralized anything else. The only thing I'll say is that the browser makers have made the whole thing even less secure than the design allows by shipping certs for all these shady companies - many of which are clearly just arms of equally shady governments in various forsaken parts of the world. A chain of trust can still be strong if the links are all strong. It's a problem that this becomes hard to guarantee as the chains get longer, but it's also a problem that the browser vendors *knowingly* include weak links in the bags they provide.
Thanks for clarifying that.
The one nugget of truth in the article is that the list of CAs built in to browsers etc. is ridiculous. I had occasion to look recently. I'll bet at least half of those organizations are corrupt or compromised enough that I wouldn't even trust them to hold my hat - let alone information I actually value. Anybody who wants a signing cert for MITM can surely get one. That really does cast doubt on whether HTTPS is really doing us all that much good, but it's important to understand exactly where the weak link in that chain is.
As with many things, the first level is easy but then things get much harder. Can I build a simple database? Sure I can. Can I build a fully SQL-compliant database with a sophisticated query planner and good benchmark numbers? Not without some help. Can I build an interpreter for a simple language? No problem. Can I build a 99.9% gcc-compatible compiler that spits out correct high-performing code for dozens of CPU architectures? Um, no. Similarly, building a very simple storage system is within reach for a lot of people and is a great learning exercise. Then you add replication/failover, try to make it perform decently, test against a realistic variety of hardware and failure conditions, make the whole thing maintainable by someone besides yourself . . . this is still a simple system, no laundry list of features to match (let alone differentiate from) competitors, but it's a lot harder than a "one time slowly along the happy path" hobby project.
I'm not saying that the storage vendors deserve every dollar they charge. I'm pretty involved with changing those economics, because the EMCs and the NetApps of the world have been gouging too much for too long. What I'm saying is that "build it yourself" is a bit of an illusion except at the very smallest of scales and most modest of expectations. "Build it with others" is a better answer. Everyone contributes, everyone gets to benefit. If you really want to help speed those dinosaurs toward their extinction, there are any number of open-source projects that are already engaged in doing just that and could benefit from your help.
Enrico, the problem with the idea of high-performance object storage is that the S3-style APIs are not well suited to it. Whole-object GET and PUT are insufficient. Most have added reading from the middle of an object; writing likewise has been claimed/promised for a long time, but is still not something developers can count on being able to do. The stateless HTTP protocol is also inherently less efficient than what you get with file descriptors and a better pipelining model. Frankly, a lot of the object-store implementations aren't up for a performance game either. The most charitable way to put it is that the developers were prioritizing other features such as storage efficiency. I'll be a bit less charitable and say the whole reason most of them got into object stores was because they're easy, so they wrote their code with inefficient algorithms and languages/frameworks. That lets them get to market earlier, but the downside is darn-near-unfixable performance issues. The main exception is Ceph's RADOS, which has an API more like NASD/T10 than S3 and which was designed from day one to support upper-layer protocols that demand higher performance.
Throwing flash at an object store won't let it catch up with block or file storage that's also flash based. It might be higher performance than it is now, but it will still be slower than contemporaries. It's going to be really hard for anyone in that mire to get beyond the tertiary role.
"If I have more resources than required"
Might as well stop there. That never happens for long. Where there's capability to spare, new workloads will be added until that's no longer the case. It happens with CPU, it happens with memory, and it happens with storage. Always has and always will. The real question is how to maximize the value of the IOPS you're providing when you're providing as many as you can. That means letting higher-value IOPS (e.g. for higher-value apps or tenants) take priority over lower-value IOPS, and that's QoS.
Besides the fact that what they're doing is no different than what GlusterFS (which I work on) and Ceph have done for years, they start off with two lies.
(1) Their FAQ claims that GlusterFS uses a centralized server, which is not true.
(2) They claim to be open-source, but when you follow the link a big fat "coming soon" is all you'll get.
Outfits like this come along every damn month. And they disappear every month too, when they find out that gaining and retaining users is harder than getting a few mentions in the trade press. There's no reason so far to suspect this one will rise above that vile crowd.
The application containers themselves might be stateless, but they almost always need access to shared persistent data somewhere - web pages, customer records, calculation inputs and outputs. That can be a whole separate island of specialized hardware or bare-metal servers, but why not use the storage already within the container infrastructure? That gives your storage servers the same benefits as your application containers, and allows seamless sharing/balancing of resources between them.
BTW, Gluster (on which I work) has been able to do this since approximately forever, and we have many enterprise customers using this approach. Some of them have even presented publicly about their experience. Nice to see Portworx following our lead.
Especially for startups. It's one of the first places that enterprises look to cut costs, and one of the last places they're willing to experiment. And it has become a crowded space. The folks at Coho are great, but I could say the same about a dozen other startups of the same vintage. They can't all succeed. In a way, this is a side effect of lowering the barrier to entry. Now that scale-out software on top of commodity hardware (even if it has a fancy faceplate) is more competitive with specialized hardware, it seems like everybody and their brother has a storage startup with a new take on where the "real" storage problems are and how to solve them. Some of those ideas are truly new, and truly great. Some aren't. The problem is that it's hard to tell which is which, so when the lifeboat's too crowded and companies start getting thrown overboard it's not always the ones who should have been. Sadly, technical merit and business value don't usually count as much as cozy relationships with investors, analysts, journalists, and (just once in a while) "whale" customers.
Why do these articles only ever seem to compare against *proprietary* solutions? Another basis of comparison for semi-open-source RozoFS would be truly-open-source Gluster (on which I work) or truly-open-source Ceph, both of which already have erasure coding too. Based on experience with that, I'd say *it doesn't matter* which erasure-coding algorithm involves more addition or multiplication because those calculations are only a minor factor in overall performance. The amount of data that must be transferred, either during normal I/O or during repair, matters far more. The coordination overhead matters even more than that. If you have two clients trying to write overlapping blocks, and they don't coordinate properly, then half of the servers get erasure-coded pieces of one write and half get erasure-coded pieces of the other. This isn't even "last writer wins"; anyone who tries to read that data subsequently gets *garbage* back. The #1 determinant of performance in such systems is how they avoid this issue for every kind of operation (including both data and metadata with all of the atomicity/durability guarantees that must be met to keep users from screaming).
If the Rozo folks want to brag about their erasure-coding efficiency, let's see some actual performance data. While we're at it, let's talk about the scale at which things have really been tested. Anybody can claim hundreds of nodes and multiple exabytes but AFAIK no project in this space has ever successfully run at that scale on the first try. They *always* run into new failure modes and performance anomalies that never appeared at smaller scale and that often require substantial new subsystems to address. Then they find out that customers at this scale are going to want tons of other features as well. Some of these are still only on Rozo's roadmap, after having been shipped years ago by competitors. Others, especially related to multi-tenancy, are still missing entirely.
I think what Rozo is doing is very cool, and I wish them all the success in the world, but let's not lose sight of the fact that there's a *long* row to hoe before even the best ideas turn into a competitive storage solution. They sound a lot like the Ceph folks did *five years ago*, but Ceph (with far more resources at hand) is just now making the transition from bleeding-edge to enterprise-ready. It's not because they lack talent, I can assure you of that. It's just that these problems are *hard*, and solving them takes a lot longer than Evenou and Courtoy seem to think. I'd love to hear from the RozoFS developers about when *they* think RozoFS will be competitive with what's already out there.
Let's not overgeneralize. *This time* he didn't name names. On the other hand, he still did make some pretty strong inferences about "whoever" wrote the code, and "whoever" isn't hard to discover. That's well beyond just criticizing the code.
On the other other hand, I've been on too many projects that *didn't* lay down the law this firmly. Developers are a sneaky lot, and they tend to have their own agendas. They'll keep sneaking in code that they know is crap, if it lets them mark more of their personal tasks complete. If nobody is watching, or nobody responds strongly enough to put the fear of God into them, the result is a codebase that slowly rots into irrelevance. I do think Linus and (even more so) certain other Linux kernel developers behave in some pretty toxic ways sometimes, but as we try to improve that situation we still need to remember that bringing the hammer down once in a while is strictly necessary to maintain any kind of quality. It's all in how it's done, not whether it should be done at all.
This particular case involves siblings (as of last week), but I suspect we'll be seeing a lot more of this kind of thing even among non-siblings - yes, even among current rivals - in the next few years. Among other things, it means folks like Isilon will be forced to compete on the basis of software quality instead of relying on custom-tuned hardware to give them an edge in performance comparisons. Bring it.
(Disclaimer: I'm a Gluster developer)
"Amazon S3 is designed for 99.999999999% durability" (i.e. every put has 11 9s durability)
That's really about availability. It says nothing at all about when data is guaranteed to hit stable storage. You do know what "durability" means in data storage, don't you?
"few year old Beta level Ceph benchmarks are not a good measure,"
Ah yes, that's no true Scotsman all right. You asked for citations, I provided them, now you demand different ones. At least those actually compared Ceph to Gluster, on the same hardware. The document you cite only compares Ceph to itself. Why would you assume Gluster has been standing still, and wouldn't also perform better? That's convenient, I suppose, but hardly realistic. Making comparisons across disparate versions and disparate hardware tells us absolutely nothing.
"as the Gluster architect you are not clean from bias"
And I disclosed that association right at the beginning, because I believe in being honest with people. You're still moving the goalposts, citing "evidence" that's unrelated to the actual topic at hand, ducking the issue of how NFS overhead *plus* impedance-mismatch overhead can be less than NFS overhead alone. You haven't even begun to address the problems inherent in trying to provide true file system semantics on top of a system that has only GET and PUT, different metadata and permissions models, etc. This isn't personal, but misleading claims often lead to wasting a lot of people's time if they're not challenged. If you think object-store based file systems are such a great idea then you need to grapple with the issues and provide some facts instead of just slinging mud.
"Object even S3 provides Atomicity & Durability as base attributes"
Simply untrue. You were talking about making the file store sync *on every write*. Object stores provide no guarantees on every write, because they don't even have a concept of every write. That's the flip side of any API based on PUT instead of OPEN+WRITE. At the very worst, an apples to apples comparison would require only an fsync *per file*, and even that would be requiring more of the file store than the object store. Can you actually cite the API description or SLA for any S3-like object store that makes *any claims at all* about immediate durability at the end of a PUT? Amazon's certainly don't, and that's the API that most others in this category implement.
"Would be happy if you can point me to a benchmark to back your thesis which can shows Gluster significantly knocks out Ceph"
"not fare to pick on a cloud archiving product like S3 to make perf claims."
Except that such "archiving products" are the subject of the article we're discussing. What's unfair is comparing a file system to an object store alone, on a clearly object-favoring workload, when the subject is file systems *layered on top of* object stores. All of those protocol-level pathologies you mention for NFS will still exist for an NFS server layered on top of an object store, *plus* all of the inefficiencies resulting from the impedance mismatch between the two APIs. If the client does an OPEN + STAT + many small WRITEs, the server has to do an OPEN + STAT + many small WRITEs. The question is not how a file system implemented on top of an object store performs when it has freedom to collapse those, because it doesn't. The question is how it performs when it's executes each of those individual operations according to applicable standards and user expectations, which set definite requirements for things like durability.
The only "religion" here is faith in the assumptions that support your startup's business model. It's not my fault if those assumptions run contrary to fact. I'm just pointing out that they do.
"if you disable the client cache or sync() on every IO to be on par with object atomicity/durability (required for micro-services)"
S3-style object storres make *no* guarantee about consistency or durability. There's a word for the kind of tuning you speak of, hamstringing one side to meet a requirement for which the other is held exempt. It's called cheating. It's a way of *massively* skewing the results to favor one side, and it's why methodological disclosure is so important. Please compare apples to apples, then get back to us.
Interesting piece, Chris. If you don't mind, I'll try to add on a bit based on my perspective as a developer in this area.
Traditional big-box on-premise storage vendors also face another pair of closely related threats: open source and roll-your-own. The relationship between something like Isilon and something like Gluster (which I work on) is obvious, so I won't dwell on it. The relationship between something like Isilon and something like AWS is also obvious: more AWS usage means less Isilon sales. The relationship between Gluster and AWS, or any of several similar things on either side, is more nuanced. Sometimes people abandon their own open-source scale-out storage in favor of AWS services. Sometimes they deploy that same software within EC2. It's both a threat *and* an opportunity.
That brings us to roll-your-own. If you were to look under the covers at Amazon's storage offerings, I'm sure they'd look an awful lot like what's out there in open source. Ditto for Google. Ditto for Facebook. And Twitter, and LinkedIn, and so on. The fact is that the techniques for doing a lot of this are now pretty well known. Many of those techniques were developed are refined at the aforementioned companies, each of which has rolled their own not once but several times to address various needs and tradeoffs. I've seen a public presentation from GoDaddy - not generally regarded as a company in the vanguard of storage research - about their own home-grown object store. I know of many more that I can't talk about. Perhaps the biggest threat to both traditional storage vendors and someone like me (or my employer) is not some one new product or project but the general idea that scale-out storage software can be assembled rather than developed. That doesn't mean there'll be no place for people who know this stuff and can assemble those parts into a smoothly functioning stack, but we'll be providing less of a product and more of a service. As in so many other areas, increasing levels of automation might put customization in the hands of more than the elite.
"Good morning, madam. What kind of storage system would you like me to build for you today?"
Biting the hand that feeds IT © 1998–2020