Google Interview Question
1,069 Interview Reviews |
Back to all Google Interview Questions & Reviews
Interview questions and reviews posted anonymously by interview candidates
Interview Question for Software Engineer at Google:
There were 2 questions: 1 design and 1 implementation. The design was something like the following: you have a billion google searches a day, design a data structure which lets you pull out the top 100 unique ones at the end of the day.
Helpful Question?
Yes |
No
Inappropriate?
Answers & Comments (6)
0 of 2 people found this helpful
in terms of memory, you only need 100 elements in the queue and the queue itself. in terms of performance, you get the typical nlogn insertion of binary search trees.
Helpful Answer?
Yes |
No
Inappropriate?
if at some point the search will be smth like:
search 1
search 2
search 1
search 2
and so on
your idea will not work
Helpful Answer?
Yes |
No
Inappropriate?
We can use a hash table to hold the search strings. We’d want to use a table of one billion slots and choose a hash function that assures a uniform distribution of our strings over the slots. In terms of memory consumption, we must store one billion strings and one integer (32-bit, since 10^9 = ~2^30) for each of those strings. Assume that the average length of the string is 100 characters or less. This is probably a good estimate; the actual average length is probably more like 10 or 15 than 100 since most of the time people search using only one or two words. Assume each character requires two bytes of memory. Therefore, we need:
(2 bytes/char * 100 char/str * 10^9 str) + (4 bytes/int * 10^9 int)
= 200 GB + 4GB = 204 GB memory
Lookup and insertion into the hash table run in constant time with respect to n, the number of search strings.
Another data structure we can use to store the strings and their counts is a radix tree or a Patricia trie. A radix tree is a tree in which the edges are labeled with characters and internal nodes are shared by strings that have their prefixes in common. The leaf nodes hold a special termination character and, in this case, our count. This data structure will take up less space than the hash table since the strings are not stored explicitly and instead share nodes. In the worst case, where every search string is unique, the radix tree becomes quite large and unwieldy, but of course we do not expect such a case to happen since the input is not at all random.
The data structure that we can use to store our top 100 unique searches is a priority queue. A priority queue is a queue in which the elements are ordered by some priority value that is provided at insert time. In this case, we’d use the search count for each string as its priority and the last element in the queue would be the string with the highest search count. We could also maintain this highest search count value separately to add processing.
So, for every incoming string, check if it’s in the hash table. If it is, check if its old count is less than or equal to the highest search count in the priority queue. If that is true, the string may be in the priority queue, so we remove it from the priority queue if it is there. Next, increment its count. If its new count is still less than than highest search count in the priority queue, add it to the queue. If the item is not in the hash table, add it to both the hash table and the priority queue with a count of 1. Remove the 1001th item from the priority queue and reset the highest search count value.
---------------------------
Obviously this approach requires a *ton* of memory. Does anyone have a better solution?
Helpful Answer?
Yes |
No
Inappropriate?
Each search query can be represented by a combination of word hashs, so you create another hash table containing (combined hash -> count). A billion non-repeating search queries, assuming an average of five words per query, would be stored in 10GB.
At the end of the day, you can iterate through the data and use a min heap to store the top results. If the next value is greater than the current min, insert it in the min heap and remove the current min (root node). Finally, sort the heap.
Helpful Answer?
Yes |
No
Inappropriate?
Helpful Answer?
Yes |
No
Inappropriate?
Members can
answer or comment on this question
–
Join Now (It's Free) or
Sign In
0 of 0 people found this helpful
by Interview Candidate: