# Data Analyst Interview Questions in Hartford, CT

In a data analyst interview, employers ask questions that help ascertain your level of technical skills, including familiarity with data analysis software, data analysis terminology, and statistical methods. Interviewers may also use guesstimate questions to learn about your problem-solving skills and creative thinking.

53,583 Data Analyst interview questions shared by candidates

## Top Data Analyst Interview Questions & How to Answer

Here are three data analyst interview questions and how to answer them:

### Question #1: What data analysis software have you worked with?

How to answer: Data analysts use software to perform many of their job duties. Interviewers ask this question to determine your current skill level and how much training you may require. List and discuss any data analysis software you have used. Be sure to mention specific software training and give examples of how you used the software on the job.

### Question #2: Which statistical methods are most beneficial for data analysis?

How to answer: With this question, interviewers are testing your understanding of the most commonly used statistical methods, such as the simplex algorithm, imputation, the Bayesian method, and the Markov process. List and discuss data analysis methods, including the specific benefits of each process. If possible, include examples of how you used specific statistical methods on the job.

### Question #3: Explain how you would estimate the pairs of rain boots sold in Seattle in May?

How to answer: Guesstimate questions are often used in data analyst interviews to test your analytical thinking and problem-solving methods. Share how you would work through the problem, including the data sets you would need, how you would go about finding data, and the methods you would use to calculate an estimated answer.

## Top Interview Questions

Sort: Relevance|Popular|Date
Data Scientist Intern was asked...February 25, 2012

### Find the second largest element in a Binary Search Tree

The above answer is also wrong; Node findSceondLargest(Node root) { // If tree is null or is single node only, return null (no second largest) if (root==null || (root.left==null &amp;&amp; root.right==null)) return null; Node parent = null, child = root; // find the right most child while (child.right!=null) { parent = child; child = child.right; } // if the right most child has no left child, then it's parent is second largest if (child.left==null) return parent; // otherwise, return left child's rightmost child as second largest child = child.left; while (child.right!=null) child = child.right; return child; } Less

find the right most element. If this is a right node with no children, return its parent. if this is not, return the largest element of its left child. Less

One addition is the situation where the tree has no right branch (root is largest). In this special case, it does not have a parent. So it's better to keep track of parent and current pointers, if different, the original method by the candidate works well, if the same (which means the root situation), find the largest of its left branch. Less

### Case interview: basic business problem (if product X costs Capital One \$4.00 per unit, with a \$800 sunk cost, and we charge X amount of dollars along with a \$10 annual fee, how many do we need to sell to break even, etc). Followed by a longer discussion of more complex problems that the situation might entail.

Erin - it probably shouldn't be. The Revenue side should be (unit price + 10) * number of units sold. Dapo has assumed that unit price and number of units sold is the same since the letter X is used to describe both variables. That might be correct, but it would be an odd quirk of the question - I'd bet that you can ask what the unit price is. Less

How is it possible to answer this without knowing the selling price per product? # of quantities to sell in order to breakeven depends on how much the selling price per product is? Since the selling price says X, you can sell it for \$790 and breakeven by just selling 1 product (\$790 per product + \$10 annual fee (assuming it's per product)). But I may be misunderstanding the information provided. Any feedback? Less

let's say we should sell y+1 units to break even 800 + 4.y = y*x + 10*y 800 = y(x + 10 - 4) 800/(x+6) = y so we should sell (800/(x+6)) + 1 units Less

Data Analyst was asked...December 24, 2012

### If you have 10 bags of marbles with 10 marbles each and one bag has marbles that weigh differently than the others, how would you figure it out from one weighing

Assume all marbles are from 10g and the heavier one is 11g Take 1 marble from bag1 ,2 from bag2, 3 from bag3, 4 from bag4, 5 from bag5, 6 from bag6, 7 from bag7, 8 from bag8, 9 from bag9,10 from bag 10..and weigh them together Let it be W. So if bag5 contains the heavy marbles The total weight (W) will be 10+20+30+40+55+60+70+80+90+100 = 555. where as if all were of 10g it should have been 550. meaning the bag which is heavy will always be MeasuredWeight - 550 Mathematically if bag X is the one which is heavy, X can be found using one weighing of sample (W) - (N(N+1)/2) Less

Terribly worded question; you never specified that we are given the normal and heavy weights. Without that information, the votes up solution here doesn’t work. Less

Using a series sum [N(N+1)/2], will not work in this case, as we've not been provided with enough specifics (such as "whats the standard weight of the marbles?). In fact any solution dependent on summed weight will have overlapping solution space and fail. Consider if the standard weight is 10 ounce per marble, and bag 1 is the exception at 12 ounce per marble, for a total weight of 552. Now consider if bag 2 is the exception at 11 ounce per marble - also for a total weight of 552. There is no way to distinguish between the two cases. Given the setup, if appears the interviewer did expect the answer to use some variation of summed series, and I suspect the poster has paraphrased the question and missed some key language - and as best I can tell is not solvable as stated. Google will expect you to provide a generalized solution that can be automated, so the "add the bags one at a time", while simple and clever, would probably not be acceptable as a "weigh only once" solution, and if accepted would be a lesser answer as it does not provide a code-able business solution. Less

### How would you test if survey responses were filled at random by certain individuals, as opposed to truthful selections?

This is a very basic psychometrics question. Calculate Cronbach's alpha for the survey items. If it is low (below .5), it is very likely that the questions were answered at random. Less

I would design the test in a way that certain information is asked two different ways. if two answers disagree with each other I would seriously doubt the validity of the answers. Less

We need to find the histograms of the questions in the survey to see the distribution of each answer in each question. All question histograms will likely follow the normal distribution if they are truthful selection. If one response with more than of half of total answers being located outside of 95% confidential interval in each histogram, the response will be categorized as random fall out of mean plus tw Less

### You are compiling a report for user content uploaded every month and notice a spike in uploads in October. In particular, a spike in picture uploads. What might you think is the cause of this, and how would you test it?

We cannot say what has caused the spike since causal relationship cannot be established with observed data. But we can compare the averages of all the months by performing a hypothesis testing and rejecting the null hypothesis if the F1 score is significant. Less

The photos are definitely Halloween pictures. Segment by country and date and check for a continual rise in photo uploads leading up to October 31st and a few days after for the lag. There's also a ton of these product questions like this on InterviewQuery.com for data scientists Less

Hypothesis: the photos are Halloween pictures. Test: look at upload trends in countries that do not observe Halloween as a sort of counter-factual analysis. Less

### The three data structure questions are: 1. the difference between linked list and array; 2. the difference between stack and queue; 3. describe hash table.

Arrays are more efficient for accessing elements , while linked list are better for inserting or deleting elements, the choice between the two data structure depends on the specific requirements of the problem being solved. Less

Stack and queues have different order of processing, operations for adding and removing elements, and usage scenarios.The choice between the two data structure depends on the specific requirements of the problem being solved Less

A hash table is a data structure that allows for efficient insertion, deletion, and lookup of key-value pairs. It is based on the idea of hashing, which involves mapping each key to a specific index in an array using a hash function. The hash function takes a key as input and returns a unique index in the array. In order to handle collisions (when two or more keys map to the same index), some form of collision resolution mechanism is used, such as separate chaining or open addressing. In separate chaining, each index in the array is a linked list, and each key-value pair is stored in a node in the corresponding linked list. When a collision occurs, the new key-value pair is added to the end of the linked list at the corresponding index. In open addressing, when a collision occurs, a different index in the array is searched for to store the new key-value pair. There are several techniques for open addressing, such as linear probing, quadratic probing, and double hashing. Hash tables have an average case time complexity of O(1) for insertion, deletion, and lookup operations, making them a highly efficient data structure for many applications, such as database indexing, caching, and compiler symbol tables. However, their worst-case time complexity can be as bad as O(n) in rare cases, such as when there are many collisions and the hash table needs to be resized. Less

### How would you build and test a metric to compare two user's ranked lists of movie/tv show preferences?

Look at the mean average precision of the movies that the users watch out of the rankings. So if out of 10 recommended movies one user prefers the third and the other user prefers the sixth, the recommendation engine of the user who preferred the third would be better. InterviewQuery.com has it more in depth of an answer. Less

1) Develop a list of shows/movies that are representative of different taste categries (more on this later) 2) Obtain ranking of the items in the list from 2 users 3) Use Spearman's rho (or other test that works with rankings) to assess dependence/conguence between the 2 people's rankings. * To find shows/movies to include in the measurement instrument, maybe do cluster analysis on large number of viewer's viewing habits. Less

It's essential to demonstrate that you can really go deep... there are plenty of followup questions and (sometimes tangential) angles to explore. There's a lot of Senior Data Scientist experts who've worked at Netflix, who provide this sort of practice through mock interviews. There's a whole list of them curated on Prepfully. prepfully.com/practice-interviews Less

### If you have a 5*5*5 cubic, what is the outside surface area?

Its surface area...not volume .So answer should be 6*(5*5)

Looking at the first answer makes me doubtful about what a cubic means. If its a cube only, then the answer should be 6*(5*5), as mentioned above. Less

The simplest answer will be 5*5*5-3*3*3=98

### How do you take millions of users with 100's of transactions each, amongst 10k's of products and group the users together in a meaningful segments?

You can group similar users and similar items by calculating the distance between like users and items. Jaccard distance is a common approach when building graphs of items x users relationships. For each user you have a vector of N items that they had the potential to buy. For each product you have a vector of M users that bought that product. You can calculate a euclidean distance matrix of user x user pairs and product x product pairs using these vectors. Calculating the distance between u1 and u2: f(u1, u2) = intersection(u1, u2) / (len(u1) + len(u2) - intersection(u1, u2)) same with products: f(p1, p2) = intersection(p1, p2) / (len(p1) + len(p2) - intersection(p1, p2)) You do this for each of the N^2 and M^2 pairs. Then you rank each row of the euclidean matrices for the product matrix and the users matrix. This will give you rows of rankings for each user; Example: "product p1's closest products p4, p600, p5, etc..." These rankings are according to purchase behavior. Similar to Amazon's "people who bought this also bought..." This is only working with the purchase graph. You could segment users by price of item bought. Someone who bought a Macbook retina probably have enough money to buy an another expensive laptop but kids of only paid \$30 for headphones probably don't. Less

That is one way but also clustering algorithms can help in doing it in a more efficient ways Less