Sales and analytics intern Interview Questions
sales and analytics intern interview questions shared by candidates
Top Interview Questions
How would I validate the data for integrity of results? 1 AnswerCreate quick pivot and then connect to the totals for different reports, using simple logical function in Excel. |
Write a SQL query to compute a frequency table of a certain attribute involving two joins. What if you want to GROUP or ORDER BY some attribute? What changes would you need to make? How would you account for NULLs? 26 AnswersIn my case this question was like: 'you have a table Submissions with the submission_id, the body, and the parent_id. Submissions can be posts, or comments to a post. In posts, parent_id is null, and in comments, the parent_id is the post the comment is commenting about. How would you go and make a histogram of number of posts per comment_count?' I think i solved it along the lines of: SELECT comment_counts.n_comments, count distinct(n_comments.submission_id) ( select s1.submission_id, COUNT DISTINCT(s2.parent_id) as n_comments OUTER join submissions on s1.submission_id = s2.parent_id group by submission_id) comment_counts GROUP BY comment_counts.n_comments Can you explain why you would even need the self-join here? Can you not just group by parent_id and do the COUNT() on each group, since the parent_id values correspond to the post values when they're not null? Show More Responses If you group by parent_id, you'll be leaving out all posts with zero comments. select number_comments, count(submission_id) as number_posts from ( # more than zero comments select submission_id, count(post_id) as number_comments from ( select submission_id, case when parent_id is null 1 else 0 end as post, case when parent_id is not null parent_id else null end as post_id, body from Submissions )k where post =0 group by submission_id ) k1 group by number_comments union select number_comments, count(submission_id) as number_posts from ( # comments= 0 select submission_id, 0 as number_comments from ( select submission_id, case when parent_id is null 1 else 0 end as post, case when parent_id is not null parent_id else null end as post_id, body from Submissions )k where post =1 group by submission_id ) k1 group by number_comments @ RLeung shouldn't you use left join? You are effectively losing all posts with zero comment. select k.post_id, count(submission_id) -1 from (select submission_id, case when parent_id is null then submission_id else parent_id end as post_id from submissions) t group by post_id select t.post_id, count(t.submission_id) -1 from (select submission_id, case when parent_id is null then submission_id else parent_id end as post_id from submissions) t group by post_id select parent_id as post, count(parent_id) as num_of_comments from submissions group by parent_id union select submission_id as post, 0 as num_of_comments from submissions where parent.id=null select comments_count, count(submission_id) as post_count from ( select submission_id, count( distinct parent_id) as comments_count from Table A group by submission_id )A group by comments_count I think all of the Posts are missing Parent_ID. I am editing the code shared above. This will solve the duplicate problem select parent_id as post, count(parent_id) as num_of_comments from submissions group by parent_id union select submission_id as post, 0 as num_of_comments from submissions where parent.id not in (select submission_id from submissions) Here is the solution. You need a left self join that accounts for posts with zero comments. Select children , count(submission_id) from ( Select a.submission_id, count(b.submission_id) as children from Submissions a Left Join submissions b on On a.submission_id=b.parent_id Where a.parent_id is null Group by a.submission_id ) a Group by children I've tested all these on a mock data set and none of them work! Does anyone have the correct solution? I'm stuck on this one.. Show More Responses Posts and comments in the same table looks weird. Here's my attempt (made easy with CASE) to exclude all the posts from the table and grouping/counting comments. SEL parent_id ,COUNT(*) as comment_count ( SEL * ,CASE WHEN perent_id IS NULL THEN 'Post' ELSE 'comment' END as post_or_comment FROM Submissions ) a WHERE post_or_comment = 'comment' I think it is pretty straight forward. All the posts will have null parent_id. Considering the table schema to be something like this: CREATE TABLE submissions ( submission_id INT, body VARCHAR(500), parent_id INT ); SELECT DISTINCT nvl(parent_id::TEXT,'Post with no comments') AS post_id, COUNT(CASE WHEN parent_id IS NOT NULL THEN submission_id ELSE 0 END) AS number_of_comments_or_post FROM submissions GROUP BY 1; This will give results like this: post_id number_of_comments_or_post Post with no comments 8 1 10 7 11 13 8 19 9 25 7 So, the first row will give the number of posts with no comments which is 8 and remaining rows tell the number of comments per post. Is there a flaw in this? Not the shortest answer but I think much clearer than anything posted here. Also gives output table that could actually be fed directly into a histogram which was part of the question. SELECT CASE WHEN num_comments IS NULL THEN 0 ELSE num_comments END AS num_comments, COUNT(parent_post_id) AS cnt_posts FROM ( SELECT submission_id AS parent_post_id, comment_count.num_comments FROM Submissions WHERE parent_id IS NULL LEFT JOIN ( SELECT parent_id, COUNT(parent_id) AS num_comments FROM Submissions WHERE parent_id IS NOT NULL GROUP BY 1 ) comment_count ON submission_id = comment_count.parent_id ) GROUP BY 1 ORDER BY 1 select p.parent_id as posts, count(c.submission_id) as commentcount from submissions c inner join submissions p on c.parent_id = p.submission_id group by p.parent_id; select case when parent_id is not null then parent_id else sub_id end as post_id, sum(case when parent_id is not null then 1 else 0 end) as comment_count from submissions group by case when parent_id is not null then parent_id else sub_id end; Create table: create table submissions ( submission_id int null, body varchar(500) null, parent_id int null ); Insert records: (change your database name) INSERT INTO employees.submissions (submission_id, body, parent_id) VALUES (1, 'POST1', null); INSERT INTO employees.submissions (submission_id, body, parent_id) VALUES (null, 'C1', 1); INSERT INTO employees.submissions (submission_id, body, parent_id) VALUES (2, 'POST2', null); INSERT INTO employees.submissions (submission_id, body, parent_id) VALUES (3, 'POST3', null); INSERT INTO employees.submissions (submission_id, body, parent_id) VALUES (null, 'C2', 3); INSERT INTO employees.submissions (submission_id, body, parent_id) VALUES (null, 'C3', 3); Solution: SELECT a.submission_id AS post_id, a.body, sum(CASE WHEN t.parent_id > 0 THEN 1 ELSE IFNULL(t.parent_id,0) END) AS comment_id FROM submissions AS a LEFT JOIN (SELECT b.parent_id FROM submissions AS b) t ON a.submission_id = t.parent_id WHERE a.submission_id IS NOT NULL GROUP BY post_id; Results: 1 POST1 1 2 POST2 0 3 POST3 2 CREATE TABLE users( sid INT , pid INT , body Varchar(255)); insert into users Values ( 2,null, "cover"), (1,2,"Ami is"),(3,2,"hi"),(4,2,"good pic"),(5,null ,"profil pic"),(6,5,"nice"); (select pid , COUNT(pid) as total from users where pid is not null group by pid) create table subs( sub_id integer, parent_id integer ) insert into subs values(1,null); insert into subs values(2,null); insert into subs values(3,null); insert into subs values(4,null); commit; insert into subs values(5,1); insert into subs values(6,1); insert into subs values(7,1); insert into subs values(8,1); insert into subs values(9,2); insert into subs values(10,2); insert into subs values(11,3); insert into subs values(12,3); insert into subs values(12,4); commit; select * from subs select cc, count(sub_id) from ( select a.sub_id, count(b.sub_id) cc from subs a inner join subs b on(b.parent_id = a.sub_id) group by 1) group by 1 I found it easier to explain when I broke it out into named sub tables to handle the case when there are no comments on a post and you want the end result to be the histogram of the number of comments per post: with parent_comment_ct as ( SELECT parent_id, COUNT(parent_id) AS num_comments FROM submissions WHERE parent_id IS NOT NULL GROUP BY parent_id ), submission_comment_ct as ( SELECT su.submission_id AS parent_post_id, pcc.num_comments AS num_comments FROM submissions su LEFT JOIN parent_comment_ct pcc ON su.submission_id = pcc.parent_id WHERE su.parent_id IS NULL ) SELECT CASE WHEN scc.num_comments IS NULL THEN 0 ELSE scc.num_comments END AS num_comments, COUNT(scc.parent_post_id) AS cnt_posts FROM submission_comment_ct scc GROUP BY 1 ORDER BY 1 SELECT recommended_page FROM (SELECT f.user1_id as users, f.user2_id as freinds, l.page_id as recommended_page FROM friendship f INNER JOIN likes l ON f.user2_id = l.user_id WHERE f.user1_id = 1 UNION ALL SELECT f.user2_id as users,f.user1_id as friends,l.page_id as recommended_page FROM friendship f INNER JOIN likes l ON f.user1_id = l.user_id WHERE f.user2_id = 1) MINUS (SELECT page_id as recommended_page FROM likes WHERE user_id = 1); Show More Responses SELECT recommended_page FROM (SELECT f.user1_id as users, f.user2_id as freinds, l.page_id as recommended_page FROM friendship f INNER JOIN likes l ON f.user2_id = l.user_id WHERE f.user1_id = 1 UNION ALL SELECT f.user2_id as users,f.user1_id as friends,l.page_id as recommended_page FROM friendship f INNER JOIN likes l ON f.user1_id = l.user_id WHERE f.user2_id = 1) MINUS (SELECT page_id as recommended_page FROM likes WHERE user_id = 1); select c.subID as SubmissionID, count(c.body)-1 as Counts_Comments from subm c LEFT JOIN subm b ON c.subID = b.pID where b.pID is null AND c.pID is NULL group by c.subID UNION ALL select a.pID as SubmissionID, count(a.body) as Counts_Comments from ( select *, case when pID IS NULL then 'P' Else 'C' END as P_O_C from subm)a where P_O_C = 'C' group by a.pID Order by SubmissionID; select a.user_name,b.user_name,page_liked from services_db.pages_liked a, services_db.user_friends b where 1=1 and a.user_name = b.friend_user and a.page_liked not in ( select page_liked from services_db.pages_liked c where 1=1 and c.user_name = b.user_name ) ; One or more comments have been removed. |
Data Scientist - Analytics at Facebook was asked...
There are two mobile restroom stalls at a construction site where I work. There are also three situations that have an equal chance of occurrence: a. none of them is occupied b. only one of them is occupied c. both are occupied 1. If I were to pick one at random, what is the probability that it is occupied? 2. If it turns out that that first one I go to is occupied and I decide to try the other one, what is the probability that the second one is also occupied? 12 AnswersProbability of selecting a restroom out of two restrooms is 0.5. Then probability of being occupied the selected restroom is 0.5. So, the total probability is 0.5 X 0.5 = 0.25 Above answer is wrong, the answer to the first and second question are both 1/2. the answer to the first question is - 1/3 + 1/3*1/2 = 1/2 the answer to the second question require the formula of conditional probability. Let's say: P(A) - probability that second stall is occupied P(B) - probability that the first stall is occupied P(A\B) = P(AandB) / P(B) P(B) = 1/2 (first question) P(AandB) = 1/3 P(A\B) = (1/3) / (1/2) = 2/3 Show More Responses This is not clear at all. Can you please explain how do you derive this mathematically? "the answer to the first question is - 1/3 + 1/3*1/2 = 1/2" Can you please explain how you derive this? I think a more rigorous approach to problem #1 would be: P(choosing restroom A) * P(restroom A being occupied) + P(choosing restroom B) * P(restroom B being occupied) " "the answer to the first question is - 1/3 + 1/3*1/2 = 1/2" Can you please explain how you derive this? " There are also three situations that have an equal chance of occurrence - meaning each have probability 1/3 to occur. 1/3 - the probability that I chose the option where they are both occupied. 1/3 * 1/2 - the probability that if one occupied and the other isn't , I chose the one that is occupied. Situation 1: They're both occupied (P = 1/3) A B 1 1 ====================================================== Situation 2: Only one is occupied (P = 1/3) A B 1 0 OR A B 0 1 ====================================================== Situation 3: None of them is occupied A B 0 0 Situation 1: They're both occupied (P = 1/3) A B 1 1 ====================================================== Situation 2: Only one is occupied (P = 1/3) A B 1 0 OR A B 0 1 ====================================================== Situation 3: None of them is occupied (P = 1/3) A B 0 0 The given equally likely conditions , i.e: a. none of them is occupied b. only one of them is occupied c. both are occupied Tells us the probability of any stall being occupied is 0.5. Hence, shouldn't the solution be Sol_1 = 1/2 and Sol_2= 1/2 *1/2 =1/4 ?? One or more comments have been removed. |
Lets say the population on Facebook clicks ads with a click-through-rate of P. We select a sample of size N and examine the sample's conversion rate, denoted by hat{P}, what is the minimum sample size N such that Probability( ABS(hat{P} - P) < DELTA ) = 95%. In other words (this is my translation), find the minimum sample size N such that our sample estimate hat{P} is within DELTA of the true click through rate P, with 95% confidence. 6 AnswersInterpret the question this way: we want to choose an N such that P_hat is an element of [P - delta, P + delta] with probability 95%. First, note that since P_hat is the sum of N Bernoulli trials with some common parameter (by assumption) that we are trying to estimate, we can safely assume P_hat to be normally distributed with mean equal to the true mean (P) and variance equal to (P)(1 - P) / N. Now, we when does a normally distributed random variable fall within delta of it's mean with 95% probability? The answer depends on how big delta is. Since P_hat is normally distributed, we know from our statistics classes that 95% of the time it will fall within 2 standard deviations of its mean. So in other words, we want [P - delta, P + delta] = [P - 2*SE(P_hat), P + 2*SE(P_hat)]. That is, we want delta = SE(P_hat). So what is the SE ("standard error") of P_hat? Well that's just the square root of its (sample) variance, or Sqrt(P_hat * (1 - P_hat) / N). But wait! We haven't run the experiment yet! How can we know what P_hat is? We can either (a) make an educated guess, or (b) take the "worst" possible case and use that to upper bound N. Let's go with option (b): P_hat * (1 - P_hat) is maximized when P_hat is .5, so the product is 0.25. To put it all together: delta = 2 * Sqrt(0.25) / Sqrt(N) = 2 * .5 / Sqrt(N) => N = (1 / delta) ^ 2. So when N is greater than (1 / delta)^2, we can rest assured that P_hat will fall within the acceptable range 95% of the time. Hi, I am not sure I understand your solution. Could anyone explain more? Why is the variance P(1-P) / N. Isn't it NP(1-P), because it is the binomial distribution (sum of Bernoulli trials)? Show More Responses Use Chebyshev's inequality Rate has Poisson distribution, not Bernoulli. The mean equals the variance, SE = sqrt(P/N). Doen't the answer imply N >=1? |
Given two binary strings, write a function that adds them. You are not allowed to use any built in string to int conversions or parsing tools. E.g. Given "100" and "111" you should return "1011". What is the time and space complexity of your algorithm? 7 AnswersIn Python: def normalize_length(str1, str2): len1 = len(str1) len2 = len(str2) if (len1 = 0): if (input2[i] == "1") and (input1[i] == "1"): if(carry): result = "1" + result carry = 1 else: carry = 1 result = "0" + result i -= 1 if (input2[i] == "1") and (input1[i] == "0"): if (carry): result = "0" + result else: result = "1" + result i -= 1 if (input2[i] == "0") and (input1[i] == "1"): if (carry): result = "0" + result else: result = "1" + result i -=1 if (input2[i] == "0") and (input1[i] == "0"): if (carry): result = "1" + result carry = 0 else: result = "0" + result i -=1 if(carry): result = "1" + result carry = 0 return(result) str1 = "111" str2 = "1011" print(normalize_length(str1, str2)) print(add_binary(str1, str2)) Obviously there are better ways to do this, but hey: my solution is O(N). Ignore the answer above - didn't realize that Glassdoor would cut off parts of my answer for being too long. Assuming you already wrote the normalizing code to make the input lengths the same by adding zeros: def add_binary(input1, input2): normalized = normalize_length(input1, input2) input1 = normalized[0] input2 = normalized[1] length = len(input1) result = "" carry = 0 i = length-1 while(i >= 0): if (input2[i] == "1") and (input1[i] == "1"): if(carry): result = "1" + result carry = 1 else: carry = 1 result = "0" + result i -= 1 if (input2[i] == "1") and (input1[i] == "0"): if (carry): result = "0" + result else: result = "1" + result i -= 1 if (input2[i] == "0") and (input1[i] == "1"): if (carry): result = "0" + result else: result = "1" + result i -=1 if (input2[i] == "0") and (input1[i] == "0"): if (carry): result = "1" + result carry = 0 else: result = "0" + result i -=1 if(carry): result = "1" + result carry = 0 return(result) def calc_bin_sum(bin1, bin2): ## bin1 conversion to a number based in 10 b1 = 0 for i in range(len(bin1)): b1 = b1 + int(bin1[i]) * (2**i) ## bin2 conversion to a number based in 10 b2 = 0 for j in range(len(bin2)): b2 = b2 + int(bin2[j]) * (2**i) ## Add two numbers corr_based_10 = b1 + b2 ## Change it back to binary def trans(x): binary = [] while x: binary.append(x % 2) x >>= 1 return binary return ''.join(map(str, trans(corr_based_10))) Show More Responses A good perl programmer can write perl in any language reduce( lambda st_car, nxt : (st_car[0] + str((st_car[1] + nxt) % 2), int((st_car[1] + nxt)/2)), map(lambda a:sum(map(int, a)), zip_longest(reversed('11'),reversed('101010'), fillvalue='0')), ('', 0) )[0] Using Python, def binaryAdd(str1, str2): len_str1 = len(str1) len_str2 = len(str2) if len_str1 > len_str2: max_len = len_str1 min_len = len_str2 max_str = str1 min_str = str2 else: max_len = len_str2 min_len = len_str1 max_str = str2 min_str = str1 # resulting str res = '' carry_over = 0 # 0 or 1 at all times for i in range(min_len-1,-1,-1): temp_sum = int(min_str[i]) + int(max_str[i+(max_len-min_len)]) + carry_over res = str(temp_sum % 2) + res # modulo 2; remainder is the digit carry_over = int(temp_sum/2) for i in range(max_len-min_len-1,-1,-1): temp_sum = int(max_str[i]) + carry_over res = str(temp_sum % 2) + res carry_over = int(temp_sum/2) if carry_over == 1: res = '1' + res print(res) Here's how I would've done it: def add_bin(str1, str2): len1 = len(str1) len2 = len(str2) diff = len2 - len1 #padded zeros if diff 0: for zero in range(diff): str1 = '0'+str1 results = [] carry = 0 #binary algebra for bit1,bit2 in zip(reversed(str1), reversed(str2)): if bit1 == '1' and bit2 == '1': if carry == 1: results.append('1') carry = 1 else: results.append('0') carry = 1 elif (bit1 == '0' and bit2 == '1') or (bit1 == '1' and bit2 == '0'): # print('hi') if carry == 1: results.append('0') carry = 1 else: results.append('1') carry = 0 elif (bit1 == '0' and bit2 == '0'): if carry == 1: results.append('1') carry = 0 else: results.append('0') carry = 0 #last carry: if carry == 1: results.append('1') return ''.join(reversed(results)) str1 = "100" str2 = "111" print(add_bin(str1, str2)) One or more comments have been removed. |
We have a table called ad_accounts(account_id, date, status). Status can be active/closed/fraud. A) what percent of active accounts are fraud? B) How many accounts became fraud today for the first time? C) What would be the financial impact of letting fraud accounts become active (how would you approach this question)? 6 AnswersA) what percent of active accounts are fraud? SELECT COUNT(DISTINCT t2.account_id)/COUNT( DISTINCT t1.account_id) AS perc_fraud FROM ad_accounts AS t1 INNER JOIN ad_accounts AS t2 ON t1.account_id = t2.account_id AND t2.status = 'fraud' AND t2.date > t1.date WHERE t1.account_id = 'active' B) How many accounts became fraud today for the first time? SELECT COUNT(DISTINCT t1.account_id) AS fraud_today FROM ad_accounts AS t1 INNER JOIN ad_accounts AS t2 ON t1.account_id = t2.account_id AND t2.status 'fraud' AND t2.date < t1.date WHERE t1.status = 'fraud' AND DATE_TRUNC('day, t1.date) = '2019-04-20'::timestamp ^ You need to left join Yep, should be A) what percent of active accounts are fraud? SELECT COUNT(DISTINCT t2.account_id)/COUNT( DISTINCT t1.account_id) AS perc_fraud FROM ad_accounts AS t1 LEFT JOIN ad_accounts AS t2 ON t1.account_id = t2.account_id AND t2.status = 'fraud' AND t2.date > t1.date WHERE t1.status = 'active' Show More Responses For question B, if I assume i have today's data ans yesterday's data in the table, would this work? Select Count (distinct a.Account_id) From ad_accounts A Inner join ad_accounts b On a.account_id=b.account_id Where a.date=current_data and b.date=date_add (‘day’, -1, current_date) And a.status=’fraud’ And b.status!=’fraud’ A) what percent of active accounts are fraud? Select sum(Case when status = ‘fraud’ then 1 else 0 end)/count(*) as Fraud_percentage from ad_accounts where status ‘closed’; B) How many accounts became fraud today for the first time? select count(*) from ( select account_id, min(date) as First_fraud from ad_accounts where status = 'fraud' group by account_id having First_fraud = current_date() ); ^ For the first query, status DIFFERENT than 'closed'. |
what is logistic regression? How to perform variable selection 4 AnswersSee answers for Capital One statistician questions A Measurement of Variables Logistic regression is a predictive analysis. To explain the relationship between the one dependent variable with another independent variable. Show More Responses Logistic regression is a form of predictive modeling where the variable you're predicting has a binary or yes/no answer. Variable selection can be exploratory or confirmatory. In confirmatory analysis, pre-planned models are created and compared for fit and provide relative significance to the model for each value. The best model is the one which balances prosody with explanatory power. In an exploratory analysis, starting from either a model containing all variables or no variables, perform a stepwise model creation to ensure you have good model fit and that the compared models are nested and therefore valid for comparison to one another. Methodology for selecting which variable to include or exclude is often implicit, but this is likely based on the significance (p-value) of each variable, and the correlation between that variable and the outcome variable. |
Business Analytics Intern at Enova was asked...
Imagine a cube 1x1x1, then imagine that you form a cube of 10x10x10 with all these little cubes of 1x1x1. How many cubes do you have to remove to get rid of the surface of cubes. 6 Answers(10*10*10)-(8*8*8) 10*10*2 + 10*8*2 + 8*8*2 2*(10*10)+2*(10*8)+2*(8*8)=488 Show More Responses n^3 - x = (n-2)^3 x = n^3 - (n-2)^3 x = 6n^2 - 12n +8 if n = 10 then x = 488 surface cubes to remove One or more comments have been removed. |
Analytics - MBA Intern at eBay was asked...
Mini Case: eBay is getting complaints from customers that the site doesn't have enough product selection. To increase product selection eBay runs a promotion and decides to waiver the listing fees for all sellers. eBay's earns revenue from the listing fees and as a % of sales transaction and the bulk of revenue comes as a commission over sales transactions. The promotion instead of increasing the revenue results in a drop in revenue for eBay. Explain 4 AnswersI asked a few questions about what percent of revenue comes from each of the revenue stream. Came up with a few recommendations on running a more focused promotion campaign to incentive sellers with the right kind of products. This would ensure that promotion budget is optimized and also help buyers see only the kind of products that they were interested in purchasing. Ebay has actually tried this a few times. The listing fee keeps auctions that have no chance of selling off the site. If there is no listing fee, ebay loses that revenue and still only sees the same amount of auctions close with a sale. Buyers also tire of looking through a glut of items up for sale that they aren't interested in. The decrease in revenue was expected because eBay will not earn listing fees. However, the objective was better "customer experience", which might lead to more transaction and revenue. Now, despite availability of better listing, revenue had not decreased implying customer experience has not improved - perhaps their search engine or product listing segmentation is not good. Show More Responses Let's keep this equation, Revenue = Listing fee + % commision * (orders * Avg order value) i'm assuming the hypothesis of removing the listing fee is to enable more products are introduced into the market and hence likely more orders. This could lead to the following outcome the # of orders are going up, but the AOV is dropping - It's possible that people are starting to sell items which they might have not before, because the listing fee eats into their margins. |
If you had 2 6-sided dice, what's the probability you get a 7? 2 Answershere are the different ways you can get 7 1-6 6-1 2-5 5-2 3-4 4-3 prob(7) = [Prob(1)*prob(6) + prob(6)*prob(1).......................+prob(4)*Prob(3)] = 1/6 You should ask if the dice are fair. |
See Interview Questions for Similar Jobs
- Engineering
- Dental Assistant
- Hospitality
- Social Worker
- Cashier
- Truck Driver
- Security Guard
- Administrative Assistant
- Receptionist
- Accountant
- Graphic Designer
- IT