Deduping supposedly encrypted data? Yet it can spot 20 copies of the same file from different users?
You want to patent that. No really...
Start-up Bitcasa invites you to shove all your data into the cloud and use your hard drive as a cache. It's offering infinite storage capacity, it says, for 10 bucks a month. Really. Bitcasa says it is different from Nirvanix, Mozy and others because it is not a cloud-based backup company. It's different from Dropbox because …
They use the same cereal box decoder ring for each encryption. Because of this they can't "know anything about files and folders" but they can still spot that template you and 10 others copied from the internet for your PowerPoint slide deck. Template spotted; 10 copies not stored.
See? Easy peasy.
Too bad for them that Cap't Crunch and I have prior art dating to the early 80's.
Even if they use the same template, the resulting file will be different, so file-level matching is out based on hashing of the encrypted data. It says if you share even a single slide between ppts, they dedup it, which means perhaps block-level dedup with understanding of filetypes to a degree. likely they "stripe" intelligent chunks (slides, pages, or simply data-blocks) across hdds and match those chunk signatures. They may not know what's in the file server-side, but I'm sure the client-side is quite aware to pull this off.
Take 10 punters with DIFFERENT powerpoint presentations. Hopefully the encryption will turn them into the same bit stream that can then be de-duplicated and only one copy stored.
Either that or everything is sent to /dev/null and is able to be retrieved as locg as it is cached on your hard drive....
I'm not even going to list the companies. I'd run out of room. Unlimited bandwidth deals. Unlimited storage deals. Infinite (or @#$#$%$^ Unlimited) any bloody thing.
I'd rant. I'd rave. But I really can't be bothered. Apart from saying that the appearance of 'infinite', 'unlimited' or 'forever' in advertising for just about anything these days is a quick way to stop me ever becoming a customer.
In order to do deduplication they have to know what your data is --- obviously; they need to be able to match the hashes of a block from one user against a block from another user. And, naturally, since they're only storing one instance of the block, that block must be accessible by both users. The only way I can conceive of this working in an encrypted environment is if all users have the same encryption key... which rather defeats the purpose of encryption.
Or am I missing something?
Having read the "learn more" link on their site, they appear to suggest that the data is encrypted client side and stored unchanged, i.e. encrypted, on their storage.
If users do not have the same encryption key then I'd have though the liklihood of finding any duplicated data that users could share would be minimal. If they do have the same key, then identical files encrypted by 2 users could result in the same encrypted data block. As they have no access to the unencrypted data, they can't know if user1's powerpoint slide is the same as user2's unless they use the same key.
I suppose because they are looking below file level then blocks of data could be duplicated across users, and the smaller the block size the liklier that would be, peaking at about 50% when the blocksize is 1bit (assuming a random average)
Perhaps I am missing something too. I share Mr Given's Misgivings.
The link has many comments that explain it fairly well.
Let's see, if you have encrypted my data it is because does not bear any reasonable resemblance to the original. Hence, you cannot compare it with someone else's files and determine if I have an identical copy.
Unless of course, you store a signature (hash) of each of my files. That avoids you having to look at the file contents, you can just compare the hashes and determine if two files are identical.
But if you do that you have not encrypted my data, at least given any file I can tell if you have it or not.
But it's even worse than that and I don't know why anyone has not highlighted it. For encryption to be secure, it has to be based on something shared between the cloud and me, plus something else I only know and nobody else. Hence, if encryption keys are shared among clients, you're actually encoding using the same key to everyone.
Which is enormously risky, because if the private key is revealed, all the encrypted data, not only the one belonging to the user but everyone's can be drecrypted. Ask the HDMI and Sony guys about that.
This is such a scam that I don't know why even The Reg is giving it visibility. Oh, well, to allow for some anonymous coward to post.
All the posts I've read so far seem to talk about asymmetric encryption, or symmetric encryption. It is entirely possible to do what Bitcasa are claiming. Although I wonder if they got it right. Anyways, the classic Needham-Shcroeder protocol (assuming you replace nonces with timestamps) provides a good basis for it.
It works like this:
Alice is a subject that submits a file
Bob is a subject that has shared access to Alice's file
Sam is the Bitcasa server
Alice calls Sam and says she'd like to share a file with Bob.
Sam makes up a session key message consisting of Alice's name, Bob's name, a key for them to use, and a timestamp.
Sam encrypts all this under the key he shares with Alice, and he encrypts another copy of it under the key he shares with Bob.
He gives both ciphertexts to Alice.
Alice retrieves the key from the ciphertext that was encrypted for her, and passes on to Bob the ciphertext that was encrypted for him.
Next, Alice creates a hash of the unencrypted file, and sends that to Sam for indexing.
Alice now uploads her file to Sam, encrypted using the key from the ciphertext that was encrypted for her.
Bob has access anytime he likes, using the key from the ciphertext that was encrypted for him.
Simples. If, as I said, they got a). the protocol right and b). the implementation actually reflects the protocol.
...when your hashes are computed from the unencrypted source. You could split the file into block-sized chuncks and hash those. Or you could treat file contents (e.g. individual PowerPoint slides) separately. If fact you could templatize file chunking based on a new policy downloaded from the server for each and every session. If you wanted.
How the hash values (if that's what they're using) are computed will determine the granularity of deduplication. From there the problem is one of indexing and content management.
The real security issue such a system faces is key management. There will be a public/private keypair (async) for every user, and another for every user's device. There will be a syncronous key for every file. That's a lot of key management.
I guess the final point is that using a well-thunk-through combination of async, sync and one-way encryption, it's entirely possible to compare segments of files you don't know the contents of.
The point was about how encryption and deduplication are basically incompatible. Your example only addresses the sharing scenario, of course assuming that one knows before uploading who are you going to share things with. But does not address the situation where one wants to share the file after it is uploaded.
Neither addresses how the system is going to know which two files from different users are identical. Hashing prior to encryption is at best a leak, because the system knows who else has something that hashes to the same value and thus has a degree of knowledge of the content (i.e, the RIAA/MPAA can ask them who has a file by providing a hash) and at worst a terrible security nightmare since the hash is calculated by an untrusted party -the client PC-
The possibilities for breaking havoc are endless. So yeah, unlimited storage (patently false) without knowing what we are storing (false) Pretty much invalidates the whole product.
Dude, without prejudice, I iamgine you're probably not familiar with assymetric and symmetric encryption. If you're interested, check out how PGP managed session keys. Similar concept, different application.
If I upload a file it is encrypted using a symmetric key I own. If I then share that file at a later stage all I have to do is to share the symmetric key using my friend's public key. This would be done at the point where I instruct Bitcasa to "Share file.ptt with UserX".
The system knows two files (or file parts) are identical because their hash values are identical. Yes. This means the hashing password must come from the server.
I did not specifiy where keys are generated - I don't know that. Either client or server are a good choice, depending on your objective.
There are some very sensitive documents that use a similar protocol to make encrypted content searchable. Your risk analysis will highlight the impact and probability of any weakness. It is then a business decision to mitigate (manage), transfer or accept those risks.
Many problems I've worked on choose to both mitigate and accept - i.e. in the search implementation I did, the search index was also encrypted. AES was fast enough for that. The threat model showed that accepting the remainder of the risk (shared symmetric key for the index) had business legs.
Technology is easy. People and process are not.
I'm not an expert in encryption, but still not convinced that their claims are false. You seem well versed so maybe it's a good time to learn something. Let's see.
Sharing after uploading: so each shared file with each individual has an associated key? Looks like the right way to do it from a security point of view. However, that's an awful lot of keys to handle, does not look very scalable to me. Way less than "unlimited", which simple laws of physics says it was false from the start.
Hashing to de-duplicate: if the hashing password comes from the server, I have to upload the file unencrypted, right? Hence, the service knows the unencrypted contents of my file. Fails on "we don't know what you're storing" part.
To avoid that, the client can encrypt before uploading and then upload the hashes as part of say, metadata. Then the system will know the hash of the raw content only because it trusts the client, so I can make up whatever hash I want and check if the server has it. Great for content providers, I guess, but fails again in the "we don't know what you're storing".
I'm deeply suspicious of this too. The encryption just cannot be for everything.
The techcrunch article even says it "doesn’t know anything about the file itself, really. It doesn’t see the file’s title or know its contents." And you "can share a link (file, folder) with other users". Really? Share a file it "really" doesn't know about? Oh rly! Sounds like marketing bull to me.
If it is all encrypted client side before upload and the hosts "can't access or see" the encrypted data, then passing a link to it when it is sitting "in the cloud" gets decrypted how? Unless, the other user also has to have Bitcasa client and all clients have to use the same key to encrypt/decrypt...
Or it is indeed a bucket of still steaming, finest marketing.
"Thank you for signing up for the Bitcasa beta. Space is extremely limited and you are at the back of the queue.
To move yourself up in line, send a tweet or post to Facebook including your personal sharing link below. The more people you get to sign up, the sooner you get Bitcasa."
so they want you to spam FB and twitter and get people after you to sign up, how the fuck does that move you up a FIFO Q?
They'll basically tell you to spam all your friends in order to increase your chances of actually getting on the beta.
 'Tell all your Facebook friends and paste the following URL into your Twitter feed' - and what about people that don't have farcebook and twitter?
Neither encryption or dedupe are my strong points, but why couldn't you encrypt at file level and dedupe at block level? If a block of encrypted data looks like "01010111", for example, and you dedupe any blocks with that same sequence, surely that is completely abstracted (and therefore irrelevant) to the encryption keys/vectors etc that are in use?
I signed up for beta and got the email
"Space is extremely limited and you are at the back of the queue. The more people you get to sign up, the sooner you get Bitcasa. Use the link below to share with friends or post to your social networks."
I think someone attended too many web 2.0 marketing seminars.
The problem is that the number of possible permutations balloons with just one additional byte. Each one multiplies the total possible combinations by 2^8, or 256. Put in perspective, to actually store the two-byte words of every single 16-bit possibility from 0 to 65535 would require 2 x 65536B, or 128KiB (this from just 16 bits--double it and you leapfrog Mebibyte into Gibibyte territory).
Just keep sending the data round and round "the internet". Using the cache on all the servers and routers it passes through to store it. Then when you want to retreive it, just wait until it comes round through your severs on it's next "orbit"
It's a mashup of mercury delay line memory, logistics companies using their lorries as warehouses (while they're on the road, delivering your stuff) and the standard internet/cloud marketing BS.
If they dedup at the hard drive level then there are only so many sequences of bytes you can have. If they do not know what is file and what is directory then your client must hold all of that infromation & they would not be able to use hash keys to identify files.
With several Petabytes of data even "random" content such as encryption is going to have a few matching patterns; true they cannot identify that you and your Facebook friend have the same slide in a power point slide but they don't need to. I don't see why every client needs to share any keys, I think the power point analogy in the article has sent people down the wrong mental street.
...on the subject of infinite space availability... They don't have to actuall PROVIDE infinite capacity, since they'll have finite customers. They just need to have enough capacity per-customer to meet the average customer - who probably needs very little storage. So they have to add the drive space (and infrastructure) for, say, 50gb for every new customer - or whatever it averages out to; your usual person hasn't got 100gb of MKVs - and really that's pretty cheap. You amortize it and it comes out OK, I bet.
You can make educated guesses about the transfer costs, too; they already say that high-volume stuff is 'cached' on the local drive. So you store DSCN4012.JPG for six months and they download it to show grandma. You've used 4mb of bandwidth for that chunk for a year.
As long as you can scale your storage to match each customer, and your averages are correct (and presumably they'll get better the more customers you have) there isn't any reason you shouldn't have effectively infinite storage. The service won't be practical for storing big-ass media libraries unless you have unlimited, *fast* network; it won't be practical for things like movie or audio editing or game development; it won't be practical for loading Crysis Tournament and Conquer of War: Lost Coast.
So what's going to go on there? Pictures of grandma and the cat, email folders, Word docs, and the odd music collection - tiny, in the scheme of things.
I don't see why the storage aspect won't work. Encryption, of course, is another matter.
Infinite storage is impossible
However there will be limited bandwidth, and limited time.
Like an unlimited mobile phone contract, you cannot talk for more than (60*24*31=) 44640 mins a month, and most people (they hope) talk for about 150-600.
Unlike voice, you can throttle bandwidth for heavy users (as isps do) - or ask heavy users to pay more for more bandwidth - but the storage is still free :)
Deduplication I leave to others, but would wonder how many of us encrypt our MP3/4 collections.
Of ~60GB of files, I have 30GB mp3, 20GB mp4/avi, 8GB photos and 2gb other
So, 1/6th cannot be deduped.
Getting clever, you could look at my music collection and make good geusses on things I would like if there was enough of a user base. Privacy is a subject for historians to invade
1) checksum local file
2) if checksum does not exist in online storage area, transfer checksum and encrypted file as a pair to online storage area.
3) if checksum does exist in online storage area, store a pointer
Only problem being, querying the checksum catalogue is gonna get slower the more data gets stored. Presumably all clients download updates to the checksum catalogue for faster local querying.
Will be interesting to see if they've done their math's correctly, but when I see broadband being provisioned for £3.25pcm and the price of S3, it's not out of the question.
5) User later discovers that although the checksum matched, there was in fact a hash collision, and when retrieving their files, gets someone elses document. Probability goes up the more files you store.
6) You are then sued by either customer.
In dedupe, hashes are just used as a indicator for which files/blocks to do a bit compare on , if you don't want to lose data anyway,
I guess a case could be made for a checksum, the size of which was as large or larger than the file itself, which could mathematically guarantee no collisions while not being subject to decryption to reveal the files contents. Not sure such an algorithm exists.
What kind of infinity are they allowing? I'll bet it's only aleph-0.
Well, I'll go one better than that! I'll give you aleph-1 of storage!
Not just enough storage to store ALL the counting number, but enough to store ALL the REAL numbers, including the irrational ones! Act now, and I'll bump it to allow you to store all the COMPLEX numbers!
That's right! Not just countably infinite storage, but UNCOUNTABLY INFINITE Storage!
I can now make a fortune re-selling this service to CERN so they can store all those petabytes of Higgs boson related data.
IMNSHO the word "infinite" will now go the same way as "unlimited", meaning whatever the seller pleases. Use of either in advertising should be automatic grounds for withdrawal of the ad, much as any reference to perpetual motion invalidates a patent application.
A useful rule of thumb.. "If it souunds too good to be true, it probably IS too good to be true".
This reminds me dimly of a wonder storage product some years ago, which perported to use some kind of holographic storage method to provide vasty levels of data storage in a shoebox sized device.
If this turns out to perform as advertised I will eat any kind of headcovering your care to mention.
This post has been deleted by its author
No one has mentioned Livedrive who have been offering unlimited storage for some time now. They use similar (disclaimer, it seems similar to me..) hashing techniques to avoid file duplication and as other posters have said, I think simply rely on the fact that the vast majority of people signing up will probably use less than 50GB. And if you've got security concerns then no service which can be accessed with a username and password via the web is ever going to be safe enough for you, in my opinion.
Lots of discussion has been around how they'd store uniq files. I think a missing piece is that they can and probably do dedupe at the block or sector level and not only and the whole file level. I wrote a post on how I think they do it. "Understand how Bitcasa can do what they say." http://t.co/nzoSOFy.
So, these duplicate file hashes... will the MPAA/RIAA be able to get hold of said hashes / create them, then request Bitcasa to hand over everyone who has an MP3/video matching that hash? Possible instant mass lawsuit ensues?
Other than that, it sounds a great plan, but I suspect a lot of freetards would get potentially nailed.