Google Interview Question

How would you implement Google spelling correction algorithms?

Interview Answers

Anonymous

May 10, 2014

Building the database: Use Google’s bot as raw data - detect each page’s language and build a dictionary by language. for each language, Each word is scored according to the number of instances of this word in all pages together. a word that appears in less than N different pages gets a score of 0. The dictionary structure key: word value: linked list of (url, show count) We can save the dictionary per domain and do a map-reduce to get aggregated results. Then, when rescanning a domain, we can easily clear results for this domain and start-over. Pre-processing: Do a map-reduce to convert the dictionary to string->score Spell check Request if a word is in the dictionary - return OK Else try to replace each letter with similar letters and get scores. if not enough options, try to replace each letter with keyboard-close letters and get scores. if not enough options, try to replace each letter other letters and get scores. sort all alternative by scores and return results. *CACHE* the query and result with expiration date. Optimization: When possible, send the selected suggestion back to google. save all selections with score by number of selections: wrong-word -> sorted list of (# of selections, correct word). if a word is found in this dictionary, return top results from there.

3

Anonymous

May 16, 2015

For the very basic version: Write a function variations(s, i), which takes a string s and returns a collection of all strings that are within an edit distance of i of that string, using standard edit operations of adding a character, deleting a character, or transposing two adjacent characters. The function is implemented by generating all strings with an edit distance of 1 and recursing. For the purposes of spell-check i=2 or i=3 is probably sufficent, which should probably give between a few hundred to a few thousand potential corrections. For a given input string s, calculate all the variations as given by the above function, sort them by frequency from Google's index of pages, and divide each frequency by the frequency of the input string, and if any of the variations is much more frequent than the input string (for an experimentally determined cutoff point), suggest that as a spelling correction. Once you have the basic version working, you can start to fine tune it using things like weighted edit distance based on the distance between key on a keyboard, or even machine learning of which edit operations are more likely, etc., or weight words by how likely they are given the other words in the search query, and so on.

Anonymous

Aug 18, 2010

What about keeping a dictionary of all words in all acceptable spellings. Then hashing all the words to create a list of valid hashes to link to matching words. Every time one finishes to type a word, the hash of the new word would be found in the dictionary, then the word match confirmed.

1