There is a major difference between Dropbox and Mega.
I am a Dropbox customer.
Kim Dotcom's comeback cloud storage service, Mega, has responded to criticism about its approach to cryptography and password security after security researcher Steve Thomas (@Sc00bz) released his MegaCracker tool, which cracks hashes embedded in emailed password confirmation links. In a blog post designed to reassure users, …
"Mega was already catching up with Dropbox in daily usage."
That's.... so stupid it's not even worth mentioning. That's like saying increasing a number from 0 to 1 makes it closer to infinite.
If I open a new shop and it sells a single spork in the first week of trade, then I'm 'catching up' with Amazon.
Well, if you're right about Psyx's point then the second example conflicts with the first.
Sales of 1 are effectively zero when compared to Amazon. But they're still a finite distance away, and if I keep increasing my sales by one I will eventually get on par. Adding 1 to 0 makes me absolutely no closer to infinity.
So Psyx's first example is about absolutes. The second about practicalities. They're very, very different things and make very different points. So either sloppiness or hyperbole.
Agreed. I might not have the same level's of security as a Swiss Bank, but then Mega is not intended for the storage of billions and billions and billions of Francs/Pounds/Euros/Dollars.
As long as the resources required to break in are several orders of magnitude more than the value of the encrypted data, it's good enough. And remember, even if there is some juicy stuff on Mega, it's still swamped by crap by a very large ratio (needle in a haystack, etc).
Well I got burned for suggesting they simply chunked the encrypted files and looked for matches on the last post, but seems to me that the other suggestion that they hash the files before they're encrypted isn't going to work either, as mega are saying they're not keeping any keys and user b's key isn't going to decrypt user a's file even if they were the same before uploading.
Here's my optimistic thinking: The majority of the files that get uploaded to a service like this are already tightly compressed, and that the encrypted versions of these files are going to be bigger and a sliver more compressible, and that given the scale they expect to be working with, any dedupe, no matter how minor is going to reduce costs.
here's my realistic thinking: The announcement that they will be able to change your password and re-encrypt your files suggests bullshitting, if all files can be decrypted with a users private key and mega's master key then everything they've said they're doing is possible except the lie about not having access to users files.
Hash before encryption is it. Nobody will know what is in your original and personally created data but the hash matches will allow for reverse lookup of known files. Very small files could be brute-force decoded. It's not great privacy.
Big hashes do create false positives sometimes so there can be data loss. Sure, it's a chance of 1 in an nearly infinitely big number, but the amount of data in the world is nearly infinite too. Math says that a smaller number of bits can't represent all the patterns of a larger number of bits.
You dedupe the encrypted data, so if you and I both upload a 1MB file, and the encrypted versions are exactly the same (significantly less likely than the cleartext versions being identical, I know), they get deduped. And you can raise the chances of successful deduping by doing it on chunks rather than whole files - now our files are different when encrypted, but the third page of (say) 4KB of mine matches the 100th page of yours, so those pages get deduped.
With 4KByte chunks of what is essentially random (encrypted) data, the probability of finding an identical chunk is even less than 1/2^(8*4000) in theory. This is vanishingly (and ridiculously) small. That can't be the technique that Mega uses.
(I realise that in practice, no chunk will be all zeros or all ones, etc; but it will still be a tiny probability.)
A chunk has just as much chance of being all zeroes or ones as it has being any other combination :-)
However, I'm not convinced dedupe works on such random data (which in effect it will be). The key for reconstituting the deduped data would end up being the same size as the original data to begin with.
I've recently deduped 3TB of storage down to 2 bits, just a 1 and a 0. Reindexing it all is going to be a bit of a chore though...
I think that's what the flatulent herbivore was getting at with "Or chunkify(TM) the data such that..." i.e. make the blocksize sufficiently small (a byte or word perhaps!?) and there'll be a good spread of matching blocks in even rather small datasets. As long as the chaining method isn't TOO clever you should be able to use simple heuristics to match the patterns even though the actual data forming those patterns wouldn't match. Bit of a crypto disaster under normal conditions but perhaps advantageous here? Trivial to defeat though... compression or, better, pre-encryption with password as filename leap to mind.
Personally I think it's all a Dotcom get-out-of-DMCA bluff though. In that I doubt he's done anything like that at all... I expect the statement is just there for the lawyers and means something like "if you can demonstrate a match, I'll happily delete offending files (but you never will, 'cos your keys will differ :P)"
"the third page of (say) 4KB of mine matches the 100th page of yours"
This doesn't work. The chances of getting any matches is astronomically small even for 16B chunks. For 4KiB chunks it's as close to zero as you will ever get in any practical measurement situation.
My money is on the impossibility of doing what he claims they're doing. If the encryption method obeys certain constraints it is possible, but those constraints *seem* to imply a trivial plaintext attack revealing the key. Strong crypto algorithms don't have trivial plaintext attacks revealing the key. I look forward to a real cryptanalysis of the claim, but most cryptographers appear to have the same gut feeling as I do. *If* he has cracked this problem, someone in his organization is a genius, and that seems less likely than that he's simply lying his ass off.
The way to de-dupe as suggested above is to run a hash of the original, say SHA-1 at 256 bit length and store it "in the clear" . This does allow an identical file to be matched to yours, and copyright owners could make a rainbow table of all their stuff and detect copies, but, it only takes one bit to be different, say in the metadata, and the hash will be greatly different. This same property will cause any attempt at de-dupe to fail also since it needs bit-identical duplicate files.
Perhaps a method a little like Shazam's - taking a "fingerprint" of the file before encryption, so things that sound similar measure similarly. A corresponding process for video might look a the overall structure of the compressed video, its entropy versus time or something - again allowing similar fingerprints to be matched.
Of course, these approaches allow the rights-holders to trawl through the hashes (that would be extracted under court order) and identify stuff that looks a bit like theirs. Proving it is another matter - for that you need to be given a key, so for it to work as a file-sharer service then Mega must never own the keys, you have to ask the folder owner each time. So what if big copyright set up a load of shill accounts?
You can't hash before encryption for dedupe. That will allow you to identify identical blocks, yes. But if your block is deleted as a dupe, you can't decrypt the other copy as it's encrypted with someone else's key.
Either there's no deduplication and it's in the license agreement just-in-case, or there's a per-block master key which is accessible to multiple users. And if that's the case, your data is no more secure than on dropbox.
Either way Mr Schmitz, convicted fraudster, is talking shit. Which shouldn't come as much of a surprise.
@Fuzz The trick here is that the files are _not_ encrypted with different keys. Each file has a per-file symmetric key which is generated when the file is first uploaded. When the uploader wants to share the file, they share this key using PKI to protect it. Since the PKI transaction is all done client side, Mega have no way of intercepting the per-file key and decrypting the files - but do end up with two files on their system which have the same contents and the same key which can therefore be deduped.
As for the no password recovery - the whole point about this system is that Mega _never see_ the password to a user's master key because it is all generated client side. The fact that they can't do password recovery is actually a good sign here (modulo the entropy issues).
Whatever you might think of Kim Dotcom, I can't help thinking that he's got some smarter people working for him than many of the self-appointed security experts who seem incapable of understanding these basic points...
The only way there could be meaningful deduplication is to use a scheme broadly similar to Freenet and Entropy. You split the file into blocks of equal size, compute the hash of the block, and encrypt the contents of the block with that hash. You end up with a bunch of encrypted blocks, and an equal of hashes of plaintext that can be used to decrypt those blocks. You take those hashes and you encrypt them with the user-provided symmetric key.
So each "file" consists of a number of encrypted blocks and a key chain to decrypt the blocks and glue them back into the original file, and the key chain is encrypted with the password that only the user knows.
The problem, of course, is that it is not beyond a powerful attacker to enumerate all of the files they believe they own copyright to, chunk the files in exactly the same way, compute the hashes, and encrypt each block with the hash. There is a possibility that they could then persuade a judge somewhere to produce a court order that demands that the following specific blocks of cyphertext and all files referencing those blocks be deleted and the owners of the accounts containing the files identified. In other words, a sufficiently well resourced entity could relatively easily identify the files and still issue takedown notices, it would just take more computing resources to do so compared to simply searching the metadata for file names.
Of course, if it were properly encrypted, this couldn't be done - but the data would also be completely undedupeable and uncompressible. If Mega really does use deduplication, I rather expect they might regret it.
Of course, the reality is somewhere inbetween. Mega are unable to decrypt the content, and they definitely don't own the rights to the content, which means that they would have to engage in piracy in order to police the content - so in theory, they might be off the hook for not policing the said content. OTOH, if the well resourced rights owners check the contents and hashes of most of the versions of their content that is pirated, they can provide enough identifying information on the file blocks to issue takedown notices. It shifts the policing burden toward the copyright owners, which is probably all the goal was in the first place. From there on the copyright owners can go after the users as they could traditionally - business as usual.
Thinking about it, Mega would have probably done better if they just kept quiet about the deduplication features.
@Gordan That's one way. The other possibility-- perhaps mentioned in someone else's comment; quite a lot of chaff has been posted with the wheat-- is that deduplication is enabled but effectively applied on a per-user basis.
That is, if we accept that user data is being encrypted with the user's master key, and that only that single instance of the encrypted data is being stored by Mega (e.g. a second copy, encrypted with a Mega-owned key, is not also being stored), then the only *likely* instances of duplication the system will see will come from the user him/herself, either in the form of entire duplicate files or identical data chunks within those files (assuming the data chunks are encrypted independently of each other).
Data savings might be large enough to justify this, if we consider that there is a possibility for users to maintain multiple copies of the same music file (for example), either as identical tracks from different albums or as part of playlists. Yes, I know it is much more efficient to maintain playlists as text files pointing to member tracks, but it's often more convenient to copy the playlist tracks to their own directory. Of course, metadata for the tracks will probably be different-- different album names, publish dates, etc-- so deduplication is only likely if independent encryption of data chunks is performed.
Such deduplication ought to be impossible if Mega truly didn't know the contents of uploaded content, according to critics.
If A+B => X and C+D => X there seems no reason they cannot say X is the same and deduplicate without knowing anything about A, B, C or D.
"Knowing that two files are the same, even without knowing the content, nevertheless leaks information about the data".
Does it leak any useful or usable information though? I suspect not. If it does then surely the fact I have an encrypted file already means I can theoretically know every other file that could encrypt to the same end result.
Given that that the main use case that got the predecessor service shut down was sharing big content's precious assets, it is reasonable to assume that is the main use case of the all new service. If it isn't why so much effort aimed at saying to the law "we don't know what's in the files"?
So you can guarantee a high level of de-dupe efficiency because everybody is uploading the same stuff and knowing what it is, or you can hope for some lesser degree of de-dupe based on a chunk/block level process and remaining ignorant. A smart person would go for the latter, but the thrust of the article is how incredibly naive/dumb/reckless these guys are (no password recovery process, really?). It wouldn't be much of a stretch to think they may be doing something far stupider that does allow one to de-dupe based on the unencrypted content in the interests of saving costs to pay the bail money.
Even if it is de-dupe after encryption, any decent forensic investigator would be able to join the dots by looking at patterns of usage of shared folders/keys stitched together with IP address logs to track the *really* popular stuff being uploaded, downloaded and re-uploaded again. I suspect it would not take too long to provide sufficient evidence for the big content lawyers to have a once-more-round the block with this guy.
Really, the whole thing is the cloudified equivalent of a two-year old covering their eyes and thinking they are invisible because they can't see you. It'd be funny if it wasn't so tragic. No wait, it is just funny.
Does it leak any useful or usable information though? I suspect not. If it does then surely the fact I have an encrypted file already means I can theoretically know every other file that could encrypt to the same end result.
If I made a film and I want to see everyone who has it, surely I just upload a popular torrent version of my own film and let the de duping software flag up everyone else who uploaded the same file to give the feds a basis to start on
or something like that anyway..
You have sharing keys - keys which decrypt just whatever you've shared with whomever you give the key to.
However it doesn't seem to specify anywhere if you can give the same key out multiple times (print it on a website) or if it's on a per user basis (so someone writes a script to do it for you).
You can certainly dedupe encrypted data if it's a copy of the same file uploaded into the same account, but the recurrence of an encrypted block of data of any appreciable size is infinitesimally likely. So either Mega's using encryption that's somehow dedupe-friendly (i.e. insecure), their dedupe feature is just crap, or they know more about your data than they should.
It's little wonder people are deriding Mega's marketing as disingenuous, at best.
"It's probably not even important in the overall scheme."
Their business model relies on a third party uploading files that neither they, nor Mega, have the rights to and then selling Mega users access to those infringing files by the MB. Their previous business got raided. If they want to attract pirates to their new business they need to make them feel secure in the knowledge that they won't be caught should the new business get raided as well. They also need to convince the feds that this time they really don't know if a file infringes somebody's copyright. Hence the 'ZOMG, we have encryption!' spiel.
Assuming they actually de-dupe the data right now of course, it might just be in there to give them the oppotunity to dedupe in the future (however they decided to do it) without getting everyone to re-agree to the T&Cs.
If I were going to be doing a file hosting service of that size, I'd certainly want the oppotunity to save space at some point in the future.
even after encryption your going to hit some duplicates
Not any time soon you're not.
4kB = 4*1024*8 = 32786 bits. Not 32786 possible values, 32786 bits. So basically you're flipping a coin 32786 times, repeatedly, and hoping you get the same pattern of heads or tails on multiple attempts.
the recurrence of an encrypted block of data of any appreciable size is infinitesimally likely
I was thinking that, but if you've got enough data in small enough blocks the odds get better. I guess someone better at maths than me can work out those odds. They might be able to dynamically apply an additional level of encoding to make a file/chunk more likely the same, carry that around as metadata, which could improve the chance of a match.
Dedupe or not; it doesn't make much difference to me as I really don't care how much disk space Mega are using or saving. Maybe they've got it and maybe it doesn't work very well in saving disk space. Not my problem.
The second line of concern arises from Mega's terms of service. These explain that the service "may automatically delete a piece of data you upload or give someone else access to where it determines that that data is an exact duplicate of original data already on our service". Such deduplication ought to be impossible if Mega truly didn't know the contents of uploaded content, according to critics.
This doesn't seem right to me. AFAICS the concern would only be legitimate if Mega is talking about different users uploading the same file. But the same user, using the same encryption key, would generate the same message digest on encryption, meaning Mega could compare message digests of files from the same user and delete one if it is a duplicate. A sensible rule, possibly.
It is most odd that there is all this fuss about possible dodgy content being securely uploaded and stored in Mega vaults and yet there is no hassle at all and no media and security attention paid to the physical equivalent which has possible dodgy goods and ill-gotten riches stored in secretive safety deposit facilities which banks offer to customers with no questions asked.