Rolling hash

Alright kiddo, imagine you want to check if two words are the same but you don't want to compare them letter by letter because that takes too long. What can you do?

Here's where rolling hash comes in! Rolling hash is a way to quickly hash a string of characters (like a word), allowing you to compare the hash values of two strings if they're the same, they are likely the same string, without having to compare them letter by letter.

So, let's say you have the word "cat". The first thing you do with rolling hash is assign a number to each letter of the alphabet, for example: a=1, b=2, c=3, and so on. So in this case, "cat" becomes "3-1-20" (since "c" is the third letter, "a" is the first, and "t" is the 20th).

Now, let's say you want to check if the word "bat" is the same as "cat" without comparing each letter. You can use a hash function to produce a single number from the "bat" string as well (in this case "2-1-20"). If the two hash values are the same (in this case, "23"), then you know there is a high probability that the words are the same without even checking each letter individually.

But, what if you want to compare a longer string, like a sentence or paragraph? You still follow the same process of assigning numbers to each letter, but instead of hashing the entire string at once, you break the string up into smaller segments and hash each one separately (these smaller segments are called "windows"). You can then use these smaller hash values to compare with other hash values, to see if two strings are the same.

For example, let's say you have the sentence "the quick brown fox jumps over the lazy dog" and you break it up into windows of three letters. The first three letters are "the", which gets hashed into some number, then you slide the window over by one letter so that it now includes "he ", and hash again. You keep sliding the window over and hashing each segment until you cover the entire sentence.

Rolling hash is useful because it allows you to quickly compare whether two strings are likely the same without having to compare each character individually. It's commonly used in things like plagiarism detection or identifying duplicate files.

Related topics others have asked about:

MinHash, W-shingling