Yelp

  www.yelp.com
Work in HR? Unlock Free Profile

Yelp Software Engineer New Grad Interview Question

I interviewed in San Francisco, CA and was asked:
"We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what's the most popular 100 urls visited by our users."
Add Tags [?]
Answer

Part of a Software Engineer New Grad Interview Review - one of 673 Yelp Interview Reviews

Answers & Comments

0
of 2
votes
Possible Solution: not necessary the right one.

Notes: Will recommend use non recursive version of quick sort since 5GB log file could be considered to generate URL data to 500 MB ball park and worst estimate. That would blow the below solution.

Also keeping the file open all the time while reading rather than putting it in buffer, since 5GB is a large file size and that may not hold in the memory provided for the program. Will have to ensure that file remains untouched in those moments.

# Program to find the URLS that have been most referenced and used from a log file.
import sys

# Function that will quicksort the list of tuples that have occurence number as first
# element in tuple and the URL as the second element in the list. The list is sorted
# from the least referenced to most referenced as done in classical quicksort.
# From: www.algolist.net/Algorithms/Sorting/Quicksort
def sortDict(impurllist, left, right):
    i, j = left, right - 1
    pivot = impurllist[int((left + right) / 2)]
    while (i <= j):
        while (impurllist[i][0] < pivot[0]):
            i = i + 1
        while (impurllist[j][0] > pivot[0]):
            j = j - 1
        if (i <= j):
            temp = impurllist[i]
            impurllist[i] = impurllist[j]
            impurllist[j] = temp
            i = i + 1
            j = j - 1
    if (left < j):
        sortDict(impurllist, left, j)
    if (i < right):
        sortDict(impurllist, i, right)

# This function will take the dictionary and then convert the same to sorted list of
# tuple of URL and its respective number.
def sortLogDict(logdict):
    impurllist = []
    for entries in logdict: # Get the entries from dictionary into a tuple.
        entry = (logdict[entries], entries)
        impurllist.append(entry)
    sortDict(impurllist, 0, len(impurllist)) # The tuple of URL list sent for sorting.
    return impurllist

# This function reads the log file and then returns the dictionary of URL as the key
# and the number of its appearence as the value in that dictionary. Note it is assumed
# that the log file is new line separated list of URLs.
def readLogFile(logfile):
    try:
        logfilehandle = open(logfile,'r') # The read mode for reading of the file.
    except IOError:
        if IOError.errno == ENOENT :
            print("The said log file does not exist.")
            return None # Since you cannot find file, return a None to say error.
        elif IOError.errno in (EACCES, EPERM):
            print('You do not have the permissions to read that file.')
            return None
        else:
            return None
    logfiledict = {}
    for lines in logfilehandle:
        line = lines.rstrip('\n') # Make sure you remove the
        if line in logfiledict:
            urlhit = logfiledict.get(line)
            urlhit += 1
            logfiledict[line] = urlhit
        else:
            logfiledict[line] = 1
    logfilehandle.close() # Close the log file
    return logfiledict # Return the dictionary that is related to that log file.

# Show the URL from the last to the said number of URLS.
def showUrlLast(urllist, num):
    print('URL \t \t \t References')
    if num > len(urllist):
        for element in reversed(urllist):
            print(str(element[1]), '\t', str(element[0]))
    else:
        j = -1
        for i in range(num):
            print(str(urllist[j][1]), '\t', str(urllist[j][0]))
            j = j - 1

# Definition of main.
def main():
    try:
        int(sys.argv[1])
    except ValueError:
        print('Please input a integer for the number of URLs you want to view.')
        return # Terminate program execution if the arguments provided are not right.
    if len(sys.argv) != 3:
        print('Usage: python3.2 HighURLFinder.py #URLs_you_want Log_file')
        return
    else:
        logdict = readLogFile(sys.argv[2])
        if logdict is None:
            return # Terminate execution since the file read program encountered error.
        else:
            impurllist = sortLogDict(logdict)
    showUrlLast(impurllist, int(sys.argv[1]))

# The starting point of the program
main()
- Blah on Apr 23, 2012
4
of 5
votes
Quick Sort is O ( n log n ) avg case, so we could do better O (n)

Now, onto design : Do the lines have other information other than URL? If so, we need to filter them out using Regular Expressions.

Possible O (n) solution:
A hashmap with URL's as keys and number of hits as values would yield a O(n) run-time to populate it.

Algo:
while input from file
   getURL (or filter out URLs from line)
   if URL exists in hashmap, increment its count
   else add URL to hashmap

Ofcourse, being a large data set, we might get a ton of random disk seeks due to paging of memory. After profiling, if we really find that to be a problem then we could do something like encode the urls into something smaller that takes less space and hence has a lesser probability of causing paging. I can think of few more solutions but I think this should be sufficient.
- Better solution on May 7, 2012
4
of 4
votes
I would add a min-heap of size 100 to BetterSolution's answer (in order to get the top 100).
- dopey on May 13, 2012
9
of 9
votes
cat log | sort | uniq -c | sort -k2n | head 100
- David on Oct 4, 2012
0
of 0
votes
Well, I believe, we can create a HashMap with <site URL> as the key and visit times as Value. Once we get all the values(i.e. URL visited), we can do a Min Heap Sort. Its like saying I have n number in an array and I need to find top k number (n>>k) ,So applying Min Heap would cost O(nlog k), whereas if we use the best sorting technique, we get O(n log n).
But Hey @ David Response is a smart one in relevance to FrontEnd Engg position.
- cooladi806 on Mar 16, 2014

To comment on this question, Sign In with Facebook or Sign Up

Tags are like keywords that help categorize interview questions that have something in common.