
Press release
I thought, briefly, that we'd stopped regurgitating storage company press releases with basically zero analysis.
Seems I was wrong again.
Swiss startup balesio, staffed by all of nine people, has devised a penalty-free way of reducing unstructured data file sizes without altering the original file format, meaning no rehydration or decompression is needed to read the reduced size files. Its Native File Optimisation (NFO) software technology analyses unstructured …
In the demo I saw there was a balesio run against a couple of images. The resulting output files were much smaller than the input files. Their on-screen dimensions were the same and their on-screen appearance to my eyes was the same as well.
As far as I can see the optimisation technology is visually lossless (to human eyes). It does what it says on the can.
Also, just to enjoy a tart comment for a second, there was no press release, the story being based on an interview.
Chris.
"In the demo I saw there was a balesio run against a couple of images. The resulting output files were much smaller than the input files. "
I can show this with compression and file optimizations for certain files, but not in general. There's a reason for that.
"Also, just to enjoy a tart comment for a second, there was no press release, the story being based on an interview."
Fair enough - i'll adjust my comment - "sounding like a press release". If you know anything about compression, file formats and/or information theory - which i'm sure you do - i'd hope you could see the lack of any useful analysis on this "technology", and why it doesn't (cannot) work in general as well as portrayed.
I don't know why they keep repeating the word "unstructured", though, and the term is definitely misleading. But the compressor is obviously aware of the file format. I can see no claim to the contrary - they actually say that they started with Microsoft Office files and moved on to PDFs.
It's hideous. In addition to all the Microsoft-only stuff, the same complex style tags are used over and over. Turning it into plain HTML reliably reduces file size by 80%. It sounds like they've found a way to automate that sort of process.
'Structured' files are typically binary formats, where data is stored at fixed offsets within the file. Unlike XML, there's no way to shorten those without corrupting the file.
So the software does change the data - presumably irrepairably - by downscaling image sizes in documents. This is crazy talk for automated enterprise use. Imagine the support calls - "Hey, Helpdesk! Who the heck reduced my high resolution image to a tiny jpeg?" Crazy talk, hence Paris.
Chris allow me to point out a few things here:
1) "Scaling back" colour and resolution attributes may not be desirable, especially in regulatory and compliance instances.
2) the 5%-10% savings attributed to NetApp dedupe is based on one customer's comment, hardly represents an installed base of tens of thousands of dedupe Users.
3) Penalty-free? Dedupe's value in primary storage is providing reasonable capacity savings without degrading performance. Dissecting files, rescaling, and removing duplicate images, seems like some mighty heavy lifting to me - why no mention of the performance penalty?
Larry Freeman aka DrDedupe
(a NetApp Employee)
Larry,
true and not true.
1) "Scaling back" is not what we are doing because it would mean we treat every object in the same exact way. No, what we are doing is recognizing the contents (if you wish "interpreting" correctly the elements and objects there) and optimize them according to what they are. the result is a visually lossless file. If we were to scale back attributes, we would not be visually lossless.
2) true, that is a customer comment. What is the true ratio for these kind of unstructured files with internally compressed content (PowerPoint, images, etc.)?
3) It is penalty-free because you do not need a reader or any rehydration of an optimized file. The optimization itself requires performance, but only one time. Once the file is optimized the file is smaller and doesn't need to be rehydrated anymore by no application or system. And a smaller file is also loaded faster, so after the optimization less performance is required for handling that file.
In general, our approach is totally different than dedupe. We don't look across files but INSIDE files to optimize capacity. By doing so, we create an open form of capacity savings and users can do primary dedupe and all other things in the same way after optimization.
Best,
Chris