Yelp

www.yelp.com
Engaged Employer

## Interview Question

Interview San Francisco, CA

# We have a fairly large log file, about 5GB. Each line of

the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users.

0

Possible Solution: not necessary the right one. Notes: Will recommend use non recursive version of quick sort since 5GB log file could be considered to generate URL data to 500 MB ball park and worst estimate. That would blow the below solution. Also keeping the file open all the time while reading rather than putting it in buffer, since 5GB is a large file size and that may not hold in the memory provided for the program. Will have to ensure that file remains untouched in those moments. # Program to find the URLS that have been most referenced and used from a log file. import sys # Function that will quicksort the list of tuples that have occurence number as first # element in tuple and the URL as the second element in the list. The list is sorted # from the least referenced to most referenced as done in classical quicksort. # From: www.algolist.net/Algorithms/Sorting/Quicksort def sortDict(impurllist, left, right): i, j = left, right - 1 pivot = impurllist[int((left + right) / 2)] while (i <= j): while (impurllist[i][0] < pivot[0]): i = i + 1 while (impurllist[j][0] > pivot[0]): j = j - 1 if (i <= j): temp = impurllist[i] impurllist[i] = impurllist[j] impurllist[j] = temp i = i + 1 j = j - 1 if (left < j): sortDict(impurllist, left, j) if (i < right): sortDict(impurllist, i, right) # This function will take the dictionary and then convert the same to sorted list of # tuple of URL and its respective number. def sortLogDict(logdict): impurllist = [] for entries in logdict: # Get the entries from dictionary into a tuple. entry = (logdict[entries], entries) impurllist.append(entry) sortDict(impurllist, 0, len(impurllist)) # The tuple of URL list sent for sorting. return impurllist # This function reads the log file and then returns the dictionary of URL as the key # and the number of its appearence as the value in that dictionary. Note it is assumed # that the log file is new line separated list of URLs. def readLogFile(logfile): try: logfilehandle = open(logfile,'r') # The read mode for reading of the file. except IOError: if IOError.errno == ENOENT : print("The said log file does not exist.") return None # Since you cannot find file, return a None to say error. elif IOError.errno in (EACCES, EPERM): print('You do not have the permissions to read that file.') return None else: return None logfiledict = {} for lines in logfilehandle: line = lines.rstrip('\n') # Make sure you remove the if line in logfiledict: urlhit = logfiledict.get(line) urlhit += 1 logfiledict[line] = urlhit else: logfiledict[line] = 1 logfilehandle.close() # Close the log file return logfiledict # Return the dictionary that is related to that log file. # Show the URL from the last to the said number of URLS. def showUrlLast(urllist, num): print('URL \t \t \t References') if num > len(urllist): for element in reversed(urllist): print(str(element[1]), '\t', str(element[0])) else: j = -1 for i in range(num): print(str(urllist[j][1]), '\t', str(urllist[j][0])) j = j - 1 # Definition of main. def main(): try: int(sys.argv[1]) except ValueError: print('Please input a integer for the number of URLs you want to view.') return # Terminate program execution if the arguments provided are not right. if len(sys.argv) != 3: print('Usage: python3.2 HighURLFinder.py #URLs_you_want Log_file') return else: logdict = readLogFile(sys.argv[2]) if logdict is None: return # Terminate execution since the file read program encountered error. else: impurllist = sortLogDict(logdict) showUrlLast(impurllist, int(sys.argv[1])) # The starting point of the program main()

Blah on Apr 23, 2012
6

Quick Sort is O ( n log n ) avg case, so we could do better O (n) Now, onto design : Do the lines have other information other than URL? If so, we need to filter them out using Regular Expressions. Possible O (n) solution: A hashmap with URL's as keys and number of hits as values would yield a O(n) run-time to populate it. Algo: while input from file getURL (or filter out URLs from line) if URL exists in hashmap, increment its count else add URL to hashmap Ofcourse, being a large data set, we might get a ton of random disk seeks due to paging of memory. After profiling, if we really find that to be a problem then we could do something like encode the urls into something smaller that takes less space and hence has a lesser probability of causing paging. I can think of few more solutions but I think this should be sufficient.

Better solution on May 7, 2012
8

I would add a min-heap of size 100 to BetterSolution's answer (in order to get the top 100).

dopey on May 13, 2012
13

cat log | sort | uniq -c | sort -k2n | head 100

David on Oct 4, 2012
0

Well, I believe, we can create a HashMap with <site URL> as the key and visit times as Value. Once we get all the values(i.e. URL visited), we can do a Min Heap Sort. Its like saying I have n number in an array and I need to find top k number (n>>k) ,So applying Min Heap would cost O(nlog k), whereas if we use the best sorting technique, we get O(n log n). But Hey @ David Response is a smart one in relevance to FrontEnd Engg position.