## Interview Question

Software Engineer Interview

## Develop an algorithm for finding the shortest distance

between two words in a document. After the phone interview is over, take a few hours to develop a working example in C++ and send it to the manager.

## Interview Answer

11 Answers

Assume that we will be passed two Strings that represent the words we’d like to find. Assume “distance” means the number of characters between occurrences – that is, line breaks will be ignored. Assume the document is represented by a long string. We need to find the two occurrences of these words that are closest together.

Simple solution:

- Walk through the string and find each occurrence of word A. Store the middle index of this occurrence

- Walk through again and find each occurrence of word B. Store the middle index of this occurrence.

- Compare the arrays of indicies. Find the smallest difference between each middle position of A and each middle position of B.

- Running time = O(n + ab) where a and b are the number of occurrences of the input strings A and B.

Walk through the document, character by character, looking to match <space>word<space>. If we find such a match, put the index of the middle character into the array. Continue on our walk. Repeat for the second word. Now we have two arrays of integers. Our new task is to find the smallest difference between the elements in each array. Use the mergesort merge method, keeping track of the smallest difference. Return this distance.

Optimization:

Instead of walking one character at a time, when we find a mismatch, skip ahead by an intelligent number of characters – that is, to the next index at which we might find a good match. We can use the Boyer-Moore algorithm for this approach.

Further optimization:

Look for both words in one pass by using Boyer-Moore tables that account for both words. As we walk, keep track of the current shortest distance. Running time is O(n).

Does anyone have a simpler or clearer answer?

@ ellemeno

Perfect!

Thank you! I have my first technical phone interview with Google on Monday. Wish me luck!

I would clarify with the interviewer, but it doesn't seem to me that finding the words in the document is the principal focus of the question, so would concentrate on the basic algorithm assuming an efficient word finder (which could be implemented independently).

So here's my take on the algorithm:

1. Find the first instance of word A or B (assume A going forward). Continue searching (e.g. iterating) and each time you find another A, discard last recorded position of A until you reach and record the first B. (If B is the first word reverse A and B in the last sentence). The distance is the first try.

2. Continue searching until you reach the next instance of A or B. Measure the distance to the last recorded instance of the other word. If it's less than the current smallest distance, discard the last recorded instance of the word and record the distance.

3. Repeat 2. until the distance between the words found is greater than the current smallest distance.

4. Keeping track of the distance, start 1 again from current index until you find the next smallest distance or until you run out of words. Replace the recorded smallest distance each time you find one smaller.

The shortest distance between two words is a space. My algorithm would simply search for the first occurrence of a space.

Grant, I whole-heartedly agree. It doesn't say that they are two specific words, just the shortest distance between two words... Which would naturally be the one space character between each word. In this case, the algorithm could simply return 1, and not even bother with searching the document at all. After all, the question doesn't really ask where the shortest distance is located in the document.

Since this is labeled a C++ programming question asked at a top-notch software company solving difficult problems, handing in a program that always returns 1 probably isn't a good idea. If I were a hiring manager at Google I'd want to see the algorithms, data structures, & design skills you have at your fingertips. If I got this question during an interview at Google with no chance to ask additional questions, I'd assume they wanted the answer to a hard problem. And if I were interviewing for a C++ programming position but never asked such a question, I'd worry about the competence of current staff.

P.S. - If I could ask a question, it would be whether the document was already indexed ('cuz I'm interviewing at Google after all). Then try to show a bit of linguistic sensitivity, such as whether intervening sentence and paragraph boundaries should count as "extra" distance. After the interviewer left the room, I'd Google the Web for an answer - why reinvent the wheel? ;-)

index1 = -1

index2 = -1

minDistance = -1

Scan the string for words.

For each word:

1. If it matches 1st word, set index1 to index of word.

2. If it matches 2nd word, set index2 to index of word.

3. If it matches neither word, continue

4. If index1 == -1 or index2 == -1 continue

5. distance = abs( index1 - index2 )

6. minDistance = (minDistance == -1) ? distance : min( distance, minDistance )

Min Distance : O(1) solution

1) Maintain a variable 'prev' to indicate the index of the word (Assuming the text file is an array of words).

2) Start traversing from the beginning of file to end .

3) Whenever any of the two words are found set the 'prev' to the index of that string, keep traversing if the next found word is the same as prevously found word update 'prev' to new index .

4) If the next match is the other string then update 'min_difference_so_far' which initially would have been initialised to infiinty.

5) If this is followed the Min Distance should be captured in the variable.

Let me know if any one of you think this would fail.

## Add Answers or Comments

To comment on this question, Sign In with Facebook or Sign Up

Can you do a greedy approach? find the first occurrence of either of the 2 words. Then go through the document until you find the first occurrence of the other word. you need to do this twice. each word will take O(n), which gives you still O(n).

neakoronNov 11, 2009