About Us

Glassdoor is your free inside look at Google interview questions and advice. All interview reviews posted anonymously by Google employees and interview candidates.

Search

for

in

Interview Question for Software Engineer at Google:

Oct 19, 2009

There were 2 questions: 1 design and 1 implementation. The design was something like the following: you have a billion google searches a day, design a data structure which lets you pull out the top 100 unique ones at the end of the day.

Add Tags [?]
Helpful Question?  
Yes | No
Share: Link Digg

Answers & Comments (4)

Oct 19, 2009

by Interview Candidate:

I started off with an implementation on the file system, when he got me to compute the size of memory required to maintain a billion numbers (where I screwed up majorly). Finally we figured out we need 14 32-bit computers. After this I started using a tree of tries, which maintains the top 100 throughout : like in insertion sort(because maintaining all billion and at the end computing the top 100 takes a lot of time, a few years). He did not like my design and told me I was complicating the whole thing big time.
Helpful Answer?  
Yes | No
Inappropriate?

0 of 2 people found this helpful

Nov 7, 2009

by neakor:

Why can't you just maintain a priority queue or a binary search tree where you put the smallest node on top. then every time a new query comes in, you just check if the queue is full. if it is full, replace the smallest which would be the root. if it's not full, just directly insert into the queue and do a sort.

in terms of memory, you only need 100 elements in the queue and the queue itself. in terms of performance, you get the typical nlogn insertion of binary search trees.
Helpful Answer?  
Yes | No
Inappropriate?

Nov 12, 2009

by Anonymous:

neakor,
if at some point the search will be smth like:
search 1
search 2
search 1
search 2
and so on

your idea will not work
Helpful Answer?  
Yes | No
Inappropriate?

Nov 14, 2009

by ellemeno:

I think the best way to approach this problem is to use two data structures: one to hold the searches with their individual counts and one to hold the current top 100 unique ones.

We can use a hash table to hold the search strings. We’d want to use a table of one billion slots and choose a hash function that assures a uniform distribution of our strings over the slots. In terms of memory consumption, we must store one billion strings and one integer (32-bit, since 10^9 = ~2^30) for each of those strings. Assume that the average length of the string is 100 characters or less. This is probably a good estimate; the actual average length is probably more like 10 or 15 than 100 since most of the time people search using only one or two words. Assume each character requires two bytes of memory. Therefore, we need:

(2 bytes/char * 100 char/str * 10^9 str) + (4 bytes/int * 10^9 int)
= 200 GB + 4GB = 204 GB memory

Lookup and insertion into the hash table run in constant time with respect to n, the number of search strings.

Another data structure we can use to store the strings and their counts is a radix tree or a Patricia trie. A radix tree is a tree in which the edges are labeled with characters and internal nodes are shared by strings that have their prefixes in common. The leaf nodes hold a special termination character and, in this case, our count. This data structure will take up less space than the hash table since the strings are not stored explicitly and instead share nodes. In the worst case, where every search string is unique, the radix tree becomes quite large and unwieldy, but of course we do not expect such a case to happen since the input is not at all random.

The data structure that we can use to store our top 100 unique searches is a priority queue. A priority queue is a queue in which the elements are ordered by some priority value that is provided at insert time. In this case, we’d use the search count for each string as its priority and the last element in the queue would be the string with the highest search count. We could also maintain this highest search count value separately to add processing.

So, for every incoming string, check if it’s in the hash table. If it is, check if its old count is less than or equal to the highest search count in the priority queue. If that is true, the string may be in the priority queue, so we remove it from the priority queue if it is there. Next, increment its count. If its new count is still less than than highest search count in the priority queue, add it to the queue. If the item is not in the hash table, add it to both the hash table and the priority queue with a count of 1. Remove the 1001th item from the priority queue and reset the highest search count value.

---------------------------

Obviously this approach requires a *ton* of memory. Does anyone have a better solution?
Helpful Answer?  
Yes | No
Inappropriate?
Post your anonymous review or sign in to answer or comment on this question
Google Overview (GOOG )
Web
www.google.com
Industries
Size
5000+ Employees, $21B+ Revenue
HQ
Mountain View, CA
Competitors



Tags are like keywords that help categorize interview questions that have something in common.

Flag this {0} as inappropriate

Would you like us to review something? Please describe the problem with this {0} and we will look into it.

Flag this {1} Cancel

Advanced Search Reset

What

Where

How

or Cancel