Reply to post: Re: Marketing Bull

WekaIO pulls some Matrix kung fu on SPEC file system benchmark

CheesyTheClown

Re: Marketing Bull

Hi Liran,

Nice to see someone in your position actually commenting on the article.

I'm a long-time file system and storage protocol developer. I spent many years trying to solve storage problems at the file system level and I've now moved further up the stack as I believe that there are rarely any cases where high performance distributed file systems are really the answer as opposed to better designs further up the stack.

For example, the SpecSFS test is building code which is obviously quite a heavy task. I spend most of my life waiting for compiles and I would always welcome better solutions. But I already have seen huge improvements by moving away from poor languages like C and C++ towards more managed languages that have endless performance and security benefits over compiled languages.

Now, given the problem of compiling code, this has always been a heavy process. Consider that most development houses have a complete rats nest of header files dependencies in code. Simply using a library like Boost or the standard C++ library can cause decades of programmers lives to be lost. Of course the local operating system will generally RAM cache most files once they've been read once... making the file system irrelevant. But compiling something that produces a large number of object files (such as the Linux kernel) on a system which has anti-malware protection will kill performance in general.

To distribute the task of compilation across multiple systems, there are many solutions, but tools like Incredibuild handle this in a far more intelligent manor than placing a large burden on the file system. Therefore, testing file access in those regards is a meaningless solution because it presents a higher performance file system as opposed to a distributed compilation environment as the solution. Simply precompiling the headers and distributing that along with the code to be built to other systems is far more intelligent.

Then there's the case of data storage and manipulation. Your product makes a big point out of having it run side by side by with compute on large nodes which also hold storage. On algorithmic principles in terms of making file i/o perform better, making a better distributed file system that implements the POSIX APIs makes a lot of sense... if you're interested in diagnosing the symptoms but not the underlying problem.

When working with huge numbers of nodes and huge data sets, generally the data in question is structured at least in some way that can be consider object oriented. It may not be relational, but it is generally something that can be broken down into smaller computing segments.

Consider mapping a DNA strand. We could have hundreds of terabytes of data if we store more than simple ribosome classification. If we stored molecular composition of individual ribosomes, the data set will be massive. In this case, each ribosome will be able to be structured as an object which can be distributed and scheduled most intelligently in a database that handles hot and cold data distribution across the cluster through either sharding or share-nothing record replication.

Consider the storage from a collision within an LHC experiment. The data is a highly structured representation of energy readings which themselves are not structured... or at least not until we'll identified their patterns. As such, the same general principle of shared nothing database technologies make sense.

To have a single distributed file system to store this data would be quite silly as the data itself isn't well represented as a file as opposed to a massive number of database records or objects.

The only system I know of anymore where large scale file systems makes sense is virtual machine image storage. And in this case, since VMware has one of the most impressively stupid API licensing policies EVER... you can't generally depend on supporting them in a meaningful way. They actually wanted to charge me $5000 and make me sign NDAs blocking me from open sourcing a VAAI NAS driver for VMware. I simply moved my customers away from VMware instead... that was about $5,000,000 lost for them. In addition, if I had to instead a vib to support a new file system, I'd be nervous since VMware famously crashes in flames constantly due to either storage API or networking API vibs.

But that said, VM storage for Hyper-V, KVM and Xen are a great place to be. But if I'm using Hyper-V, I'll use Storage Spaces Direct, for KVM or Xen, I can see room for a good replacement for Gluster or the others.

So, now that I hit you with a book... I'm interested in hearing where your product fits.

I read your entire web page because you sounded interesting. And I found your technology to be quite interesting. Under different circumstances, I'd probably even ask for a job as a programmer to have some fun (it's sad, but I find writing distributed file systems to be fun). But I simply don't see the market segment which this technology targets. Is it meant as file storage for containers? Is there something which makes it suitable for map/reduce environments other than better database tier distribution?

I look forward to hearing back. I get the feeling you and I could have some absolutely crazy (and generally incomprehensible) conversations at a pub.

P.S. - I'm working on a system now that would probably benefit from technologies like yours if I wasn't trying to solve the problem higher up in the stack. I may still need something like this later on if you start looking towards FaaS in the future.

POST COMMENT House rules

Not a member of The Register? Create a new account here.

  • Enter your comment

  • Add an icon

Anonymous cowards cannot choose their icon