# 50K

Data Scientist Iv interview questions shared by candidates

## Top Interview Questions

Sort: Relevance|Popular|Date
Data Scientist Intern was asked...February 25, 2012

### Find the second largest element in a Binary Search Tree

The above answer is also wrong; Node findSceondLargest(Node root) { // If tree is null or is single node only, return null (no second largest) if (root==null || (root.left==null &amp;&amp; root.right==null)) return null; Node parent = null, child = root; // find the right most child while (child.right!=null) { parent = child; child = child.right; } // if the right most child has no left child, then it's parent is second largest if (child.left==null) return parent; // otherwise, return left child's rightmost child as second largest child = child.left; while (child.right!=null) child = child.right; return child; } Less

find the right most element. If this is a right node with no children, return its parent. if this is not, return the largest element of its left child. Less

One addition is the situation where the tree has no right branch (root is largest). In this special case, it does not have a parent. So it's better to keep track of parent and current pointers, if different, the original method by the candidate works well, if the same (which means the root situation), find the largest of its left branch. Less

### Case interview: basic business problem (if product X costs Capital One \$4.00 per unit, with a \$800 sunk cost, and we charge X amount of dollars along with a \$10 annual fee, how many do we need to sell to break even, etc). Followed by a longer discussion of more complex problems that the situation might entail.

Actually, to correct the above response, if the selling price for 1 product is \$794 (to cover the cost price of the product itself (\$4) ), only 1 quantity needs to be sold. (\$794 + \$10) = (\$800 + \$4). Less

let's say we should sell y+1 units to break even 800 + 4.y = y*x + 10*y 800 = y(x + 10 - 4) 800/(x+6) = y so we should sell (800/(x+6)) + 1 units Less

How is it possible to answer this without knowing the selling price per product? # of quantities to sell in order to breakeven depends on how much the selling price per product is? Since the selling price says X, you can sell it for \$790 and breakeven by just selling 1 product (\$790 per product + \$10 annual fee (assuming it's per product)). But I may be misunderstanding the information provided. Any feedback? Less

Data Analyst was asked...December 23, 2012

### If you have 10 bags of marbles with 10 marbles each and one bag has marbles that weigh differently than the others, how would you figure it out from one weighing

Assume all marbles are from 10g and the heavier one is 11g Take 1 marble from bag1 ,2 from bag2, 3 from bag3, 4 from bag4, 5 from bag5, 6 from bag6, 7 from bag7, 8 from bag8, 9 from bag9,10 from bag 10..and weigh them together Let it be W. So if bag5 contains the heavy marbles The total weight (W) will be 10+20+30+40+55+60+70+80+90+100 = 555. where as if all were of 10g it should have been 550. meaning the bag which is heavy will always be MeasuredWeight - 550 Mathematically if bag X is the one which is heavy, X can be found using one weighing of sample (W) - (N(N+1)/2) Less

Terribly worded question; you never specified that we are given the normal and heavy weights. Without that information, the votes up solution here doesn’t work. Less

Using a series sum [N(N+1)/2], will not work in this case, as we've not been provided with enough specifics (such as "whats the standard weight of the marbles?). In fact any solution dependent on summed weight will have overlapping solution space and fail. Consider if the standard weight is 10 ounce per marble, and bag 1 is the exception at 12 ounce per marble, for a total weight of 552. Now consider if bag 2 is the exception at 11 ounce per marble - also for a total weight of 552. There is no way to distinguish between the two cases. Given the setup, if appears the interviewer did expect the answer to use some variation of summed series, and I suspect the poster has paraphrased the question and missed some key language - and as best I can tell is not solvable as stated. Google will expect you to provide a generalized solution that can be automated, so the "add the bags one at a time", while simple and clever, would probably not be acceptable as a "weigh only once" solution, and if accepted would be a lesser answer as it does not provide a code-able business solution. Less

### How would you test if survey responses were filled at random by certain individuals, as opposed to truthful selections?

This is a very basic psychometrics question. Calculate Cronbach's alpha for the survey items. If it is low (below .5), it is very likely that the questions were answered at random. Less

I would design the test in a way that certain information is asked two different ways. if two answers disagree with each other I would seriously doubt the validity of the answers. Less

We need to find the histograms of the questions in the survey to see the distribution of each answer in each question. All question histograms will likely follow the normal distribution if they are truthful selection. If one response with more than of half of total answers being located outside of 95% confidential interval in each histogram, the response will be categorized as random fall out of mean plus tw Less

### You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

We cannot say what has caused the spike since causal relationship cannot be established with observed data. But we can compare the averages of all the months by performing a hypothesis testing and rejecting the null hypothesis if the F1 score is significant. Less

The photos are definitely Halloween pictures. Segment by country and date and check for a continual rise in photo uploads leading up to October 31st and a few days after for the lag. There's also a ton of these product questions like this on InterviewQuery.com for data scientists Less

Hypothesis: the photos are Halloween pictures. Test: look at upload trends in countries that do not observe Halloween as a sort of counter-factual analysis. Less

### How would you build and test a metric to compare two user's ranked lists of movie/tv show preferences?

1) Develop a list of shows/movies that are representative of different taste categries (more on this later) 2) Obtain ranking of the items in the list from 2 users 3) Use Spearman's rho (or other test that works with rankings) to assess dependence/conguence between the 2 people's rankings. * To find shows/movies to include in the measurement instrument, maybe do cluster analysis on large number of viewer's viewing habits. Less

Look at the mean average precision of the movies that the users watch out of the rankings. So if out of 10 recommended movies one user prefers the third and the other user prefers the sixth, the recommendation engine of the user who preferred the third would be better. InterviewQuery.com has it more in depth of an answer. Less

It's essential to demonstrate that you can really go deep... there are plenty of followup questions and (sometimes tangential) angles to explore. There's a lot of Senior Data Scientist experts who've worked at Netflix, who provide this sort of practice through mock interviews. There's a whole list of them curated on Prepfully. prepfully.com/practice-interviews Less

### If you have a 5*5*5 cubic, what is the outside surface area?

Its surface area...not volume .So answer should be 6*(5*5)

Looking at the first answer makes me doubtful about what a cubic means. If its a cube only, then the answer should be 6*(5*5), as mentioned above. Less

The simplest answer will be 5*5*5-3*3*3=98

### How do you take millions of users with 100's of transactions each, amongst 10k's of products and group the users together in a meaningful segments?

You can group similar users and similar items by calculating the distance between like users and items. Jaccard distance is a common approach when building graphs of items x users relationships. For each user you have a vector of N items that they had the potential to buy. For each product you have a vector of M users that bought that product. You can calculate a euclidean distance matrix of user x user pairs and product x product pairs using these vectors. Calculating the distance between u1 and u2: f(u1, u2) = intersection(u1, u2) / (len(u1) + len(u2) - intersection(u1, u2)) same with products: f(p1, p2) = intersection(p1, p2) / (len(p1) + len(p2) - intersection(p1, p2)) You do this for each of the N^2 and M^2 pairs. Then you rank each row of the euclidean matrices for the product matrix and the users matrix. This will give you rows of rankings for each user; Example: "product p1's closest products p4, p600, p5, etc..." These rankings are according to purchase behavior. Similar to Amazon's "people who bought this also bought..." This is only working with the purchase graph. You could segment users by price of item bought. Someone who bought a Macbook retina probably have enough money to buy an another expensive laptop but kids of only paid \$30 for headphones probably don't. Less

That is one way but also clustering algorithms can help in doing it in a more efficient ways Less

### The three data structure questions are: 1. the difference between linked list and array; 2. the difference between stack and queue; 3. describe hash table.

Wow... pathetically easy