Associate scientist Interview Questions
associate scientist interview questions shared by candidates
Top Interview Questions
Data Scientist Intern at LinkedIn was asked...
Find the second largest element in a Binary Search Tree 16 Answersfind the right most element. If this is a right node with no children, return its parent. if this is not, return the largest element of its left child. One addition is the situation where the tree has no right branch (root is largest). In this special case, it does not have a parent. So it's better to keep track of parent and current pointers, if different, the original method by the candidate works well, if the same (which means the root situation), find the largest of its left branch. if (root == null || (!root.hasRightChild() ) { return null;} else return findSecondGreatest(root, root.getValue()); value findSecondGreatest(Node curr, value oldValue) { if(curr.hasRightChild()) { return (findSecondGreatest( curr.getRightChild(), curr.value)); } else return oldValue; } Show More Responses The above answer is also wrong; Node findSceondLargest(Node root) { // If tree is null or is single node only, return null (no second largest) if (root==null || (root.left==null && root.right==null)) return null; Node parent = null, child = root; // find the right most child while (child.right!=null) { parent = child; child = child.right; } // if the right most child has no left child, then it's parent is second largest if (child.left==null) return parent; // otherwise, return left child's rightmost child as second largest child = child.left; while (child.right!=null) child = child.right; return child; } Soln by "mindpower" works. Thank you. I am trying to solve a similar problem Find the 2nd nearest high(in in-order traversal) value for a given node Eg: Given nums: 12 7 14 3, construct a BST. If the given value is: 7 then we should return 14 (in the sort order: 3, 7, 12, 14) if the given value is: 3 then we should return 12 (in the sort order: 3, 7, 12, 14) Generic solution in C# for any k. Notice that this example can be easily changed to find the k-th smallest node by doing a depth-first recursion on root.Left first, and then a tail recursion on root.Right. public Node GetKthLargest(int k) { return GetKthLargest(ref k, this.Root); } Node GetKthLargest(ref int k, Node root) { if (root == null || k < 1) return null; var node = GetKthLargest(ref k, root.Right); if (node != null) return node; if (--k == 0) return root; return GetKthLargest(ref k, root.Left); } recursion is not needed. SecondLargest(Node root, Node secondLarge) { if(root.right==null) return root.left; Node secondLargest = root; while(secondLargest.right.right==null) secondLargest=secondLargest.right; return secondLargest; } int getmax(node *root) { if(root->right == NULL) { return root->d; } return getmax(root->right); } int secondmax(node *root) { if(root == NULL) { return -1; } if(root->right == NULL && root->left != NULL) { return getmax(root->left); } if(root->right != NULL) { if(root->right->right == NULL && root->right->left == NULL) { return root->d; } } return secondmax(root->right); } In-order traverse the tree. The second last element in the array in the answer. In Python: def find_second_largest_bst_element(root, parent=None): if parent is None: # BST root if root.right is None: # no right subtree if root.left is not None: # if a left subtree exists... return root.left else: # root is the only element of the BST return False else: if root.right is None: # right-most element if root.left is not None: # left subtree exists return root.left else: # leaf return parent else: # check right subtree find_second_largest_bst_element(root.right, root) find_second_largest_bst_element(root) For kth smallest, descend the left subtree first. class Node: def __init__(self, value, left=None, right=None): self.value = value self.left = left self.right = right def findKthLargest(root, k): global count if root is None: return findKthLargest(root.right, k) count += 1 if count == k: print root.value return findKthLargest(root.left, k) count = 0 r = Node(10, Node(5, Node(2), Node(7)), Node(30, Node(22), Node(32))) findKthLargest(r, 3) // solution in java // main routine Node findSecondMax(Node root) { if(root == null || (root.left == null && root.right == null) return null; else { Node max = findMax(root); return (max.parent == null) ? findMax(max.left) : max.parent; } } //helper routine, recursive implementation.... can also be done non-recursively Node findMax(Node root) { return (root.right == null) ? root : findMax(root.right); } Show More Responses Find the largest number in the binary tree and delete it. And again find the largest number. Short and fast. Reverse in-order traversal of the BST, keeping a count of # of visited nodes. This methods works great to return the kth largest element in a BST. mindpower's solution looks right One or more comments have been removed. |
Senior Data Scientist at Glassdoor was asked...
How would you test if survey responses were filled at random by certain individuals, as opposed to truthful selections? 5 AnswersI would design the test in a way that certain information is asked two different ways. if two answers disagree with each other I would seriously doubt the validity of the answers. This is a very basic psychometrics question. Calculate Cronbach's alpha for the survey items. If it is low (below .5), it is very likely that the questions were answered at random. We need to find the histograms of the questions in the survey to see the distribution of each answer in each question. All question histograms will likely follow the normal distribution if they are truthful selection. If one response with more than of half of total answers being located outside of 95% confidential interval in each histogram, the response will be categorized as random fall out of mean plus tw Show More Responses Similar to Cronbach’s alpha, calculate corrected item-total correlations. Since the item is part of the total, you will need to remove it from each estimate to correct for this. Otherwise, you will get inflated estimates. Drop items with very low item-total correlations (either 1.5. Good luck. One or more comments have been removed. |
Data Scientist at Yammer was asked...
You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it? 3 AnswersHypothesis: the photos are Halloween pictures. Test: look at upload trends in countries that do not observe Halloween as a sort of counter-factual analysis. We cannot say what has caused the spike since causal relationship cannot be established with observed data. But we can compare the averages of all the months by performing a hypothesis testing and rejecting the null hypothesis if the F1 score is significant. The photos are definitely Halloween pictures. Segment by country and date and check for a continual rise in photo uploads leading up to October 31st and a few days after for the lag. There's also a ton of these product questions like this on InterviewQuery.com for data scientists |
Data Scientist at Apple was asked...
How do you take millions of users with 100's of transactions each, amongst 10k's of products and group the users together in a meaningful segments? 3 AnswersOf course there are many ways to separate the market. But apple has already got several segments that I believe work. First is the Mac line, within this is The education market. This includes 3 segments. Instructors, Students, and Schools. Instructors will be more likely to spend more on a single product, and buy software relevant to their subjects, but these decisions will influence there students to do the same, but generally students will seek a "value" product, and will buy software based on requirements. School on the other hand will buy a large amount of Computers and software at once, which also effect instructor and student purchases. So selling to schools will raise the sales in both other categories, and selling to instructors will raise the sales for students. This is just the first segment. You also have corporate industries which are similar to Education. Now lets move to the iPhone Segment within this segment you have to ask, why do people buy iPhone. There is the High-Tech segment, meaning those who always want the newest and best. Then you have the Mid-Tech segment. These are those that don't feel it is logical to flip out phones each year, they wait for two years before buying a phone. Now lets move into iPad. Interestingly this segment can move from business, to leisure. The business segment seeks to have an iPad because it allows them to get work done faster and easier. The leisure market seeks to have an iPad because it brings them entertainment and helps them relax. Then lets go to iPod. The wonder of the iPod, the product that sent Apple on a crash course to stardom. I believe the greatest segment for the iPod would be parents wanting to get a gift for kids / something to keep kids entertained. because the iPhone acts as a iPod there is a spill of sales that goes to iPhone, although the iPod touch does offer an affordable alternatives to those who do not want an iPhone. Although the iPod Nano does capture the convenience segment. These are just the segments for the Main Products of apple. You can group similar users and similar items by calculating the distance between like users and items. Jaccard distance is a common approach when building graphs of items x users relationships. For each user you have a vector of N items that they had the potential to buy. For each product you have a vector of M users that bought that product. You can calculate a euclidean distance matrix of user x user pairs and product x product pairs using these vectors. Calculating the distance between u1 and u2: f(u1, u2) = intersection(u1, u2) / (len(u1) + len(u2) - intersection(u1, u2)) same with products: f(p1, p2) = intersection(p1, p2) / (len(p1) + len(p2) - intersection(p1, p2)) You do this for each of the N^2 and M^2 pairs. Then you rank each row of the euclidean matrices for the product matrix and the users matrix. This will give you rows of rankings for each user; Example: "product p1's closest products p4, p600, p5, etc..." These rankings are according to purchase behavior. Similar to Amazon's "people who bought this also bought..." This is only working with the purchase graph. You could segment users by price of item bought. Someone who bought a Macbook retina probably have enough money to buy an another expensive laptop but kids of only paid $30 for headphones probably don't. That is one way but also clustering algorithms can help in doing it in a more efficient ways |
Senior Data Scientist at Netflix was asked...
How would you build and test a metric to compare two user's ranked lists of movie/tv show preferences? 3 AnswersProbably incorrectly. 1) Develop a list of shows/movies that are representative of different taste categries (more on this later) 2) Obtain ranking of the items in the list from 2 users 3) Use Spearman's rho (or other test that works with rankings) to assess dependence/conguence between the 2 people's rankings. * To find shows/movies to include in the measurement instrument, maybe do cluster analysis on large number of viewer's viewing habits. Look at the mean average precision of the movies that the users watch out of the rankings. So if out of 10 recommended movies one user prefers the third and the other user prefers the sixth, the recommendation engine of the user who preferred the third would be better. InterviewQuery.com has it more in depth of an answer. |
What made you interested in this position and this field of work? 1 AnswerMy education, previous work in the military and the opportunity to work both indoors and outdoors, travel, and intermixing with the public. |
Why Regeneron? Are you prepared for a job or a career? What unique lab skills will you bring to the department? 1 AnswerWhy Regeneron? - "a stable company working on pharmaceuticals" Are you prepared for a job or a career? - "that should be evident from the resume, and it was your job to read it" What unique lab skills will you bring to the department?" - "I do not know ppl's skills in the department, so how I know mine are unique? you can tell and I cannot" |
The three data structure questions are: 1. the difference between linked list and array; 2. the difference between stack and queue; 3. describe hash table. 1 AnswerWow... pathetically easy |
Scientist at Georgia Tech was asked...
Describe past research and what you want to research in the future. 1 AnswerBest if these answers involve cutting edge stuff. |
Data Scientist at Facebook was asked...
You're about to get on a plane to Seattle. You want to know if you should bring an umbrella. You call 3 random friends of yours who live there and ask each independently if it's raining. Each of your friends has a 2/3 chance of telling you the truth and a 1/3 chance of messing with you by lying. All 3 friends tell you that "Yes" it is raining. What is the probability that it's actually raining in Seattle? 38 AnswersBayesian stats: you should estimate the prior probability that it's raining on any given day in Seattle. If you mention this or ask the interviewer will tell you to use 25%. Then it's straight-forward: P(raining | Yes,Yes,Yes) = Prior(raining) * P(Yes,Yes,Yes | raining) / P(Yes, Yes, Yes) P(Yes,Yes,Yes) = P(raining) * P(Yes,Yes,Yes | raining) + P(not-raining) * P(Yes,Yes,Yes | not-raining) = 0.25*(2/3)^3 + 0.75*(1/3)^3 = 0.25*(8/27) + 0.75*(1/27) P(raining | Yes,Yes,Yes) = 0.25*(8/27) / ( 0.25*8/27 + 0.75*1/27 ) **Bonus points if you notice that you don't need a calculator since all the 27's cancel out and you can multiply top and bottom by 4. P(training | Yes,Yes,Yes) = 8 / ( 8 + 3 ) = 8/11 But honestly, you're going to Seattle, so the answer should always be: "YES, I'm bringing an umbrella!" (yeah yeah, unless your friends mess with you ALL the time ;) I thought about this a little differently from a non-bayes perspective. It's raining if any ONE of the friends is telling the truth, because if they are telling the truth then it is raining. If all of them are lieing, then it isn't raining because they told you that it was raining. So what you want is the probability that any one person is telling the truth. Which is simply 1-Pr(all lie) = 26/27 Anyone let me know if I'm wrong here! Here's another perspective on how to answer a question like this: Bring an umbrella. It's Seattle - if it's not raining right now, it probably will be by the time you get there. Show More Responses I flagged Nub data scientist's answer as useful, because it shows an interesting flaw in reasoning. The 3 random variables are not to be treated as intrinsically independent. Only conditioned on the truth (raining/not raining) are they independent. Isn't the answer 2/3. The key thing is that they are ALL saying "Yes". You can't have all 3 says yes and have some people lying and some people telling the truth. It either is raining or it isn't. Not both. They either are all lying or all telling the truth. Since they are all in agreement (all lying or all truthful), they are essentially voting as one person. What is the probability that one person is telling the truth? 2/3 Answer from a frequentist perspective: Suppose there was one person. P(YES|raining) is twice (2/3 / 1/3) as likely as P(LIE|notraining), so the P(raining) is 2/3. If instead n people all say YES, then they are either all telling the truth, or all lying. The outcome that they are all telling the truth is (2/3)^n / (1/3)^n = 2^n as likely as the outcome that they are not. Thus P(ALL YES | raining) = 2^n / (2^n + 1) = 8/9 for n=3 Notice that this corresponds exactly the bayesian answer when prior(raining) = 1/2. TLP and nub data scientists, Your answers include possibilities which are not feasible; we cannot have any combination of 2/3 and 1/3 together... what about (2/3)^3? I agree with TLP and nub scientist. For me, the question is really (1 - the odds that all three of your friends are lying to you) Clearly 1 - 1/3 * 1/3 * 1/3. It's convenient that they all gave the same answer, otherwise it would be more difficult. Let Y denote rain, N denote no rain Actual Answer probability ------------------------------------------ Y=> 8/27 YYY, 1/27 NNN, 12/27 YYN, 6/27 YNN N=> 1/27 YYY, 8/27 NNN, 6/27 YYN, 12/27 YNN So, P(Y|YYY) = (8/8+1) = 8/9 The probability of raining is that they are all telling the truth, therefore, (2/3)^3. 26/27 is incorrect. That is the number of times that at least one friend would tell you the truth (i.e., 1 - probability that would all lie: 1/27). What you have to figure out is the odds it raining | (i.e., given) all 3 friends told you the same thing. Because they all say the same thing, they must all either be lying or they must all be telling the truth. What are the odds that would all lie and all tell the truth? In 1/27 times, they would the all lie and and in 8/27 times they would all tell the truth. So there are 9 ways in which all your friends would tell you the same thing. And in 8 of them (8 out of 9) they would be telling you the truth. Show More Responses There is an obvious conceptual reason as to why several answers here (ones that don't use Bayes' formula) are incorrect. The probability in question has to depend on the probability of rain in Seattle. If, for the sake of discussion, it ALWAYS rains in Seattle, i.e. P(rain)=1, then the required prob. is always 1 as well. Likewise if it's a place where it never rains, or if the question asks about the prob. of it raining elephants given the 3 friends said yes, it'd be still 0. I believe this is a std. textbook example of the Bayes' formula, anything short of that I don't think will work out. Please correct me if incorrect. But I would just prefer to condition. either they are all telling the truth and its it raining or they are all lying and it is not raining. P(rain)=P(rain|truth,truth,truth)*P(truth,truth, truth)+P(rain|lie,lie,lie)*P(lie,lie,lie) notice that truth does not mean yes it is raining, it simply corresponds to them telling the truth. Since they said yes, IF they were lying and we knew they were lying then the probability of rain would be zero, thus eliminating the second term. P(rain)=P(rain|3xtruth)*P(3xtruth) and the probability of the truth is (2/3)^3 and the probability of rain if they are telling the truth is 1. I did a little skipping of steps, since truth doesnt equal yes, but i just sort of meshed it toegher towards the end YES=yes,yes,yes T=truth, truth, truth L=lie,lie,lie P(Rain|YES)=P(Rain|YES,T)*P(T)+P(Rain|YES,L)*P(L) P(Rain|YES,L)=0==> whats the probability of rain given we know that they are lying and theyve told us it is raining. P(Rain|YES)=P(Rain|YES,T)*P(T) P(Rain|YES,T)=1==> whats the probability of it raining given that they are telling the truth and have told us its raining then P(T)=(2/3)^3 its obvious. why in the world would i do bayesian methods when its certain I agree with (2/3)^3. Interview Candidate solves this problem using Bayesian stats despite the fact that no enough information is given to do Bayesian probability analysis i.e. he had to pull the probability of it raining in Seattle out of thin air when it was not given in the interview question. With only the information from the interview question, we have to assume that friends are either all lying or all telling the truth. Let truth=T and lie=L P(TTT)=8/27, P(LLL)=1/27, P(TLL)=2/27,P(TTL)=4/27. But we know that they all had the same answer, so we must compare P(TTT) to P(LLL). P(TTT) is 8 times more likely than P(LLL), so we have P(All same answers|TTT)=8/9, P(All same answers|LLL)=1/9. Therefore the solution given ONLY THE INFORMATION GIVEN is P(Rain)=8/9, P(Dry)=1/9. This problem requires the marginal probability of rain to solve, following Interview Candidate's answer. M.B. provides the rationale behind why the bayes approach is necessary: if the pr(rain) = 0, then the pr(rain|y, y, y) = 0. (maybe it is July in Seattle). A few conceptual problems in many answers that I want to point out: 1) There is lots of conflation between Pr(truth) and Pr(Y). Pr(truth) = Pr(Y|R) does not equal Pr(Y). 2) Consider there is only a single friend and they say yes, the logical conclusion from a lot of these answers is that Pr(Rain|Yes) = Pr(Yes|Rain) = 2/3, which is not correct. Bayes' rule is very clear in this simpler case. 3) The friends' answers are conditionally independent assuming no collusion. The combinations of their honesty/lying adds no additional information. The marginal probabilities are not independent, Pr(y,y,y) does not equal pr(y)^3, it equals pr(y,y,y,rain) + pr(y,y,y, no rain), the integration of the joint space over rain. Using conditional independence and bayes rule, this becomes: pr(y|rain)^3*pr(rain) + pr(y|no rain)^3(1-pr(rain)). A more general solution using Pr(rain) = r. Pr(rain|y,y,y) = Pr(y,y,y|rain)*pr(rain)/pr(y,y,y) #Bayes' formula pr(y,y,y|rain) = pr(y|rain)^3 = (2/3)^3 #conditional independence pr(y,y,y) = pr(y|rain)^3*pr(rain) + pr(y|no rain)^3*pr(no rain) #by definition, see point 3 the answer: r*(2/3)^3 / [r*(2/3)^3 + (1 - r)*(1/3)^3] It should be (2/3)^3, I think zen and todo is correct. Most of the answers/comments made all unconditional assumptions except a few reasonings that lead to the 8/9 probability. Note that the question states that "Each of your friends has a 2/3 chance of telling you the truth". This essentially means P(raining, yes) + P (non-raining, no) = 2/3. Any attempts to interpret this as conditional probability P(raining | yes) = 2/3 or P(yes | raining) = 2/3 are making other assumptions. Show More Responses 8/27 is not the answer. For the weather to be nice in this case, all 3 of your friend NEED to have lied to you. Therefor the odds are 1/27. What if the answer is 50% since the chance of rain and not rain does not depend on what your friends tell you. In the absence of further information, the only correct answer is the posterior probability of rain p is in the interval (0, 1). In the absence of further information any prior is as good as any other, so by implication the posterior can take any value as well. The interval for p can be restricted to [0, 1] on the assumption that the question to the friends would not be posed if the prior is absolute certainty whether it will rain or not. With the further assumption that the prior probability is measured with limited precision (e.g. rounded to a percentage point), the posterior would be in the interval (0,075, 1). If the alternative assumption is made that information from the friends will be requested only if it had any chance to move the posterior below or above 0.5, the posterior interval for the probability is (0.5, 1). any more precise answer than that requires further information about the prior which is not supplied in the original problem formulation. Also note that even a precise answer about the probability of rain is not sufficient to answer the question whether an umbrella should be brought or not. The probability of each of the friend say "YES" is 2/3 * 2/3 * 2/3 = 8/27. Now the probability that it is actually raining in Seattle depends on that how do I select them to phone. There is only three way to select and phone them. So, the probability that it is actually raining in Seattle is 3 * (8/27) = 8/9. Rule of conditional probability states P(A|B) = P( A & B ) / P(B) Reformulating to this case, P(Rain | 3Y) = P(R & 3Y) / P(3Y) P(R & 3Y) = 2/3 ^3 (if it is raining, then they must all speak the truth) = 8/27 (one could multiply probability of rain here. I assumed as prior) P(3y) = all truth or all lie = 2/3 ^ 3 + 1/3 ^3 = 9/27 hence P(R | 3Y) = 8/9 Let X be the probability it's raining. Obviously we want P(X|all three say yes). Now let Y be the probability at least one of them is lying. If Y = 0 it's easy to solve, if not then not so easy. Now you keep going. Obvious, bayesian is a way to go... Show More Responses There is a way to easily confirm the right answer. Just write a computer simulation and run it a few million times, which I did. If the long term chance of rain in Seattle is 25%, the chance it is raining now, given the YYY answers and the 2/3 truth 1/3 lying, is 73% (rounded to whole number), which is the same as 8/11, so the reasoning with the Bayesian math is correct. This can easily be solved without Bayes: There are two cases: Case 1: It is raining and all friends are telling the truth: 0.25*(2/3)^3 = 1/4*8/27 Case1: It is not raining and all friends are lying: 0.75*(1/3)^3 = 3/4*1/27 Probability: P(E) = Case1 / (Case1+Case2) = (1/4*8/27) / (3/4*1/27 + 1/4*8/27) = 2 / (11/4) = 8/11 Closest points One or more comments have been removed. |
See Interview Questions for Similar Jobs
- Data Scientist
- Intern
- Software Engineer
- Data Analyst
- Software Engineer Intern
- Quantitative Analyst
- Business Analyst
- Analyst
- Software Engineering Intern
- Senior Data Scientist
- Senior Software Engineer
- Data Engineer
- Research Scientist
- Software Development Engineer
- Software Developer