Yelp Interview Question: We have a fairly large log fi... | Glassdoor

## Interview Question

Software Engineer New Grad Interview San Francisco, CA

# We have a fairly large log file, about 5GB. Each line of

the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users.

0

Possible Solution: not necessary the right one.

Notes: Will recommend use non recursive version of quick sort since 5GB log file could be considered to generate URL data to 500 MB ball park and worst estimate. That would blow the below solution.

Also keeping the file open all the time while reading rather than putting it in buffer, since 5GB is a large file size and that may not hold in the memory provided for the program. Will have to ensure that file remains untouched in those moments.

# Program to find the URLS that have been most referenced and used from a log file.
import sys

# Function that will quicksort the list of tuples that have occurence number as first
# element in tuple and the URL as the second element in the list. The list is sorted
# from the least referenced to most referenced as done in classical quicksort.
# From: www.algolist.net/Algorithms/Sorting/Quicksort
def sortDict(impurllist, left, right):
i, j = left, right - 1
pivot = impurllist[int((left + right) / 2)]
while (i pivot[0]):
j = j - 1
if (i len(urllist):
for element in reversed(urllist):
print(str(element[1]), '\t', str(element[0]))
else:
j = -1
for i in range(num):
print(str(urllist[j][1]), '\t', str(urllist[j][0]))
j = j - 1

# Definition of main.
def main():
try:
int(sys.argv[1])
except ValueError:
print('Please input a integer for the number of URLs you want to view.')
return # Terminate program execution if the arguments provided are not right.
if len(sys.argv) != 3:
print('Usage: python3.2 HighURLFinder.py #URLs_you_want Log_file')
return
else:
if logdict is None:
return # Terminate execution since the file read program encountered error.
else:
impurllist = sortLogDict(logdict)
showUrlLast(impurllist, int(sys.argv[1]))

# The starting point of the program
main()

Blah on Apr 23, 2012
6

Quick Sort is O ( n log n ) avg case, so we could do better O (n)

Now, onto design : Do the lines have other information other than URL? If so, we need to filter them out using Regular Expressions.

Possible O (n) solution:
A hashmap with URL's as keys and number of hits as values would yield a O(n) run-time to populate it.

Algo:
while input from file
getURL (or filter out URLs from line)
if URL exists in hashmap, increment its count
else add URL to hashmap

Ofcourse, being a large data set, we might get a ton of random disk seeks due to paging of memory. After profiling, if we really find that to be a problem then we could do something like encode the urls into something smaller that takes less space and hence has a lesser probability of causing paging. I can think of few more solutions but I think this should be sufficient.

Better solution on May 7, 2012
8

I would add a min-heap of size 100 to BetterSolution's answer (in order to get the top 100).

dopey on May 13, 2012
20

cat log | sort | uniq -c | sort -k2n | head 100

David on Oct 4, 2012
1

Well, I believe, we can create a HashMap with as the key and visit times as Value. Once we get all the values(i.e. URL visited), we can do a Min Heap Sort. Its like saying I have n number in an array and I need to find top k number (n>>k) ,So applying Min Heap would cost O(nlog k), whereas if we use the best sorting technique, we get O(n log n).
But Hey @ David Response is a smart one in relevance to FrontEnd Engg position.

cooladi806 on Mar 16, 2014
0

If you don't need 100% accuracy, which in this case you probably don't, you can use a Count-Min-Sketch data structure to come up with a pretty accurate estimate ( with some acceptable degree of error ) in a streaming fashion. Very little memory needs to be used in this approach.

Daniel Popescu on Oct 29, 2015