Dropbox Hashing

If you add a file to your Dropbox folder that already exists on Dropbox’s servers, you don’t have to send the file.  They just mark the existing file as also being owned by you.  In fact, Dropbox works on files in 4 MB chunks, so if you modify a large file and most of it remains unchanged, they only need you to upload any 4 MB chunks that changed and don’t already exist somewhere else on Dropbox.  A lot of people are amazed or confused by this technology.  How can Dropbox’s server and your computer know that two 4 MB chunks are the same without comparing them side-by-side?

They use an algorithm called hashing, specifically SHA2, that produces 256 bit fingerprints of data.  The fingerprints (known as hashes) look like ca978112ca1bbdcafac231b39a23dc4da786eff8147c4e72b9807785afee48bb

The thing that seems unbelievable is that each hash could potentially represent somewhere around 10 ^ 10000000 different 4 MB chunks.  How is it that your chunk won’t be mistaken for one of those countless other chunks with the same hash?

The better way of thinking about it is, how many 4 MB chunks exist in the world and how many possible hashes are there?  Imagine 10 billion people, each with 4 terabytes of unique data.  That makes 1 million chunks per person, or 10 ^ 16 chunks in total.  The number of possible hashes is 2 ^ 256 or approximately 10 ^ 85.  What’s the probability that any two of the 10 ^ 16 chunks share the same hash given that there are 10 ^ 85 possible hashes?  There are fewer than 10 ^ 32 pairs of chunks, and the probability that any single pair matches is 10 ^ -85, so the probability that any of the pairs matches is less than 10 ^ -53.  In fact, if the probability of one of your houses burning down on a given day is 10 ^ -9, it’s more likely that your main home, your vacation home, your office, and your parents’ home all coincidentally burn down on the same day by accident without being targeted.  And that’s compared to someone in the world losing a file on Dropbox, not you losing a file on Dropbox.

An obvious question is, “What if someone tries to corrupt Dropbox by uploading a file chunk that has an identical hash as a popular file?”  Well, the way SHA works, you cannot reverse-engineer a file from a hash.  Online security and data encryption depend on that fact.  It’s currently and for the foreseeable future (we hope) impossible to find two different files that generate the same hash.  Try it, and let me know how long it takes before you give up!  http://www.xorbin.com/tools/sha256-hash-calculator

Leave a Reply

Your email address will not be published. Required fields are marked *