Amazon Data Engineer Interview Questions

Amazon Data Engineer Interview Questions

Data engineers are IT professionals who are needed in almost every industry. Data engineers monitor data trends to determine best next steps for companies. A critical part of a data engineer job is to process raw data into usable data by creating data pipelines and building data systems.

591Amazon Data Engineer interview questions shared by candidates

Top Data Engineer Interview Questions & How To Answer

Here are three top data engineer interview questions and how to answer them:

Question #1: Can you describe in detail your level of expertise with programming languages?

How to answer: Before the interview, review your resume and/or portfolio and make a list of the programs you are most proficient with. If you find that you are lacking the expertise in a program that the company predominately uses, describe yourself as a highly motivated self-starter who will work tirelessly to learn the program(s).

Question #2: Explain data engineering in your own words.

How to answer: Highlight your role in relation to the larger organization and other roles like data scientists to clearly define your contribution to the overall system of business. Clarify the difference between a database-centric engineer and a pipeline-centric engineer.

Question #3: Can you describe your experience working with Apache Hadoop and cloud data management environments?

How to answer: Research the company's software, data cloud products, and use of Apache Hadoop to be prepared for this inquiry. Data Engineers must be fluent in programming languages and data management systems used throughout the industry such as Apache Hadoop.

Top Interview Questions

Sort: Relevance|Popular|Date
Amazon
Data Scientist & Machine Learning was asked...October 19, 2013

The three data structure questions are: 1. the difference between linked list and array; 2. the difference between stack and queue; 3. describe hash table.

4 Answers

Arrays are more efficient for accessing elements , while linked list are better for inserting or deleting elements, the choice between the two data structure depends on the specific requirements of the problem being solved. Less

Stack and queues have different order of processing, operations for adding and removing elements, and usage scenarios.The choice between the two data structure depends on the specific requirements of the problem being solved Less

A hash table is a data structure that allows for efficient insertion, deletion, and lookup of key-value pairs. It is based on the idea of hashing, which involves mapping each key to a specific index in an array using a hash function. The hash function takes a key as input and returns a unique index in the array. In order to handle collisions (when two or more keys map to the same index), some form of collision resolution mechanism is used, such as separate chaining or open addressing. In separate chaining, each index in the array is a linked list, and each key-value pair is stored in a node in the corresponding linked list. When a collision occurs, the new key-value pair is added to the end of the linked list at the corresponding index. In open addressing, when a collision occurs, a different index in the array is searched for to store the new key-value pair. There are several techniques for open addressing, such as linear probing, quadratic probing, and double hashing. Hash tables have an average case time complexity of O(1) for insertion, deletion, and lookup operations, making them a highly efficient data structure for many applications, such as database indexing, caching, and compiler symbol tables. However, their worst-case time complexity can be as bad as O(n) in rare cases, such as when there are many collisions and the hash table needs to be resized. Less

Show More Responses
Amazon

They asked probability question: 1) The probability that item an item at location A is 0.6 , and 0.8 at location B. What is the probability that item would be found on Amazon website. 2). I have table 1, with 1million records, with ID, AGE (column names) , Table 2 with 100 records with ID and Salary then the interviewer gave me the following SQL script SELECT A.ID,A.AGE,B.SALARY FROM TABLE 1 A LEFT JOIN TABLE 2 B ON A.ID = B.ID + WHERE B.SALARY > 50000 ( HE ASKED TO MODIFY THIS LINE OF QUERY) How many records would be returned? 3. Give a csv file with ID, and Quantity columns, 50million records and size of data is 2gig, write a program in any language of your choice to aggregate the QUANTITY column.

10 Answers

P(A) = 0.6, P(B) = 0.8 P(AUB) = P(A) + P(B) - P(A and B) = 0.6 + 0.4 - 0.48 = 0.92 [Independent Events] i.e., there is 8% probability the item is not from either A or B. So, there is 92% probability the item listed on the website is from A or B. This takes an assumption that the items available in A and B are listed on the website. We don't have conditional probability of item being found on Website given it is from A or B . P(W/A) or P(W/B) is not available. Hence, insufficient data available to answer what is the probability of item being found on website. Less

Insufficient data to make an inference on what is the probability that an item is in Amazon given that the probability of the item being in A is 0.6 and probability that it is in B is 0.8. Are A and B associated with Amazon? Is the vendor who owns the item an exclusive vendor of Amazon? Is the item easily available? Less

The simple explanation is to take the probability of not finding the two items (0.4 * 0.2 = 0.08). So the probability of both items not showing would be 8%. That means you would have a 92% probability of at least one item showing up. Less

Show More Responses
Amazon

In python code, given a json object with nested objects, write a function that flattens all the objects to a single key value dictionary. Do not use the lib that actually performs this function. { a:{b:c,d:e} } becomes {a_b:c, a_d:e} ( not, a:"b:c,d:e" }

7 Answers

def separate(x): y={} for key,value in x.items(): for key1,value1 in value.items(): y[key+'_'+key1]=value1 return y Less

x = {'a':{'b':'c','d':{'e':{'f':'g'}}} } def get_key_val(arg_dict): return_dict = {} for arg_key, arg_val in arg_dict.items(): if isinstance(arg_val, dict): for in_key, in_val in arg_val.items(): for return_key, return_val in get_key_val({arg_key+'_'+in_key: in_val}).items(): return_dict[return_key] = return_val else: return_dict[arg_key] = arg_val return return_dict get_key_val(x) Less

def flatten_json(j): global flattened_json_table = {} def helper(i, j, s = ""): if type(j) is dict: for x, y in j.items(): helper(x, y, s + "_" + x) else: flattened_json_table.insert(s[1:], j) helper(None, j) return flattened_json_table Less

Show More Responses
Amazon

SQL Select the value of a column based on the max of a different column from each grouping of yet a third column. Column A, Column B, Column C. For each group based on Column A, give value of Column B, where Column C is max for that group.

6 Answers

select ColumnA,ColumnB from (select ColumnA,ColumnB,ColumnC,ROW_NUMBER() OVER ( PARTITION BY ColumnA ORDER BY ColumnC desc ) as rn from Table) A where rn=1 Less

1: select A from T1 where B in ( select max(B) from T1 group by C ) 2: select B from T1 p where C = ( select max(c) from T1 q where q.A = p.A ) group by A Less

select b from t inner join (select a, max(c) as c from t group by a) as grouped_t on t.a = grouped_t.a and t.c = grouped_t.c Less

Show More Responses
Amazon

1. What difference have you made in current team apart from regular work ? 2. What are the steps you follow to rebuild a table in database ? 3. How did you do performance tuning ? 4. How do you find the skewness of data in table ? 5. Difference between RDBMS and Dimensional Modeling SQL 1) purchase customer_id product_id quantity purchase_date 1 111 1 01/01/2017 1 111 2 01/02/2107 1 222 2 01/02/2017 2 111 3 01/04/2017 2 222 1 01/03/2017 3 222 1 01/05/2017 3 222 1 01/06/2017 3 111 1 01/06/2017 3 111 1 01/04/2017 Q: How many customers bought each product how many times during the week? Product_Id Number_of Customers Number_of_Times 111 2 2 111 1 1 222 2 1 222 1 2 2) daily_usage account_id usage_amount usage_date 1 10 1 1 20 2 1 15 3 1 30 4 Q. a) How do you print the usage_amount of previous/consecutive rows b) Without using window functions

6 Answers

SQL 1) with c1 as (select product_id, customer_id, count(distinct purchase_date) as number_of_times from temporary.purchase group by product_id, customer_id ) select product_id, count(distinct customer_id) as number_of_customers, number_of_times from c1 group by product_id,number_of_times order by product_id,number_of_customers desc ; 2) a) Use Lag/Lead functions b) Not able to answer Less

For Question No 2: create table daily_usage(account_id number,usage number,pur_date date); insert into daily_usage values(1,10,to_date('10-Jan-2017','DD-MON-YYYY')); insert into daily_usage values(1,20,to_date('11-Jan-2017','DD-MON-YYYY')); insert into daily_usage values(1,30,to_date('12-Jan-2017','DD-MON-YYYY')); insert into daily_usage values(1,50,to_date('13-Jan-2017','DD-MON-YYYY')); commit; select today.account_id, today.usage now,prev.usage prev from daily_usage today LEFT OUTER JOIN daily_usage prev on today.pur_date = prev.pur_date +1 order by now ; Less

Answer to second question: ;WITH CTE as ( Select *, Row_number() over (order by amount) NewId From #Usage) Select a.*, B.amount, c.amount From CTE a Left join CTE b on a.newid= b.newid+1 Left join CTE c on a.newid= c.newid-1 Less

Show More Responses
Amazon

Coding test: moving average Input 10, 20, 30, 10, ... Output: 10, 15, 20, 17.5, ...

6 Answers

x=[10, 20, 30, 10] sum=x[0] arr=[] arr.append(sum) for i in range(1,len(x)): sum=sum+x[i] arr.append((sum/(i+1))) #print(sum,i,arr) print(arr) Less

num <- c(10,20,30,10) output <- c() for(i in 1:length(num)) { moving_avg <- sum(num[1:i])/i output[i] <- moving_avg } print(output) Less

x=[10, 20, 30, 10] for i in range(0,len(x)): print((float(x[i])/(i+1)) + (5 * i)) Less

Show More Responses
Amazon

How to count occurrences of a word in a sentence [python]

5 Answers

text = open('data/file.txt', 'r') word_dict = {} for line in text: for word in line.split(): if word.lower() not in word_dict: word_dict[word.lower()] = 1 else: word_dict[word.lower()] += 1 for i,j in word_dict.items(): print(i + ' : ' + str(j)) Less

def count_occurance(s): d= {} i =0 for words in se.split(' '): if words not in d: d[words] = i else: d[words] = i+1 return d Less

def Count_the_word(se,word): b=se.split() c=b.count(word) return c

Show More Responses
Amazon

first round - written: 3 sql and one about what will you do to improve the fastness of an insert on a huge table second round - get the players with highest streak get the employee details who has maximum members in a team. python-return the numbers which have maximum count in a list round 3: behavioral questions and 1 question on python lists. from the 2 lists get the numbers that are common , and return the numbers in the following way. [1,2,3,3,1,1,1],[1,1,2,2,3] - return [1,1,2,3]

5 Answers

def common(lst1,lst2): res=[] for i in lst1: if i in lst2: res.append(i) lst2.remove(i) res.sort() return res Less

a,b = [1,2,3,3,1,1,1],[1,1,2,2,3] common = [] for x in a: if x in b: b.remove(x) common.append(x) print(sorted(common)) Less

a = sorted([1,2,3,3,1,1,1]) b = sorted([1,1,2,2,3]) res = [] i, j = 0, 0 while ib[j]: j += 1 else: i+=1 res Less

Show More Responses
Amazon

Real time business problems and solutions

5 Answers

I cannot share the specifics as I have signed the NDA. All the best.

Hi could you please help me with the interview questions I have an interview scheduled in few days and need to prepare. This will be a great help and I would really appreciate your help and guidance. Less

plz!!!!!!

Show More Responses
Amazon

SQL question: 1 table with player_id, log in date, and 2 other fields and was asked to calculate first day retention rate. First day retention rate is defined as the player who log in the 2nd straight day, after the first ever time they log into game.

4 Answers

With cte as (select player_id, min(login_date)+1 as next day from player_details group by player_id) Select count(distinct cte.player_id)/count(distinct xyz.player_id) as retention_rate from player_details xyz left join cte on xyz.player_id = cte.player_id and xyz.login_date= cte.next_day Less

--calculating retention rate for the users drop table if exists playerinfo; create table playerinfo (player_id int, login_date timestamp); insert into playerinfo values (1,'2012-01-01'), (1,'2012-01-02'), (1,'2012-01-03'),(2,'2012-01-02'),(2,'2012-01-04'),(2,'2012-01-05'),(3,'2012-01-02'),(3,'2012-01-03'),(3,'2012-01-05'); with cte as ( select player_id,login_date, --DATE_PART('day', end - start) date_part('day',login_date - min(login_date) over (partition by player_id)) as loggedinafterfirstday from playerinfo group by player_id, login_date ) select (select count(distinct player_id) from cte where loggedinafterfirstday = 1)/ (select count(distinct player_id) from cte)::float as firstdayretentionrate Less

create table playerinfo (player_id int, login_date date); select * from playerinfo insert into playerinfo values (1,'2012-01-01'), (1,'2012-01-02'), (1,'2012-01-03'),(2,'2012-01-02'),(2,'2012-01-04'), (2,'2012-01-05'),(3,'2012-01-02'),(3,'2012-01-03'),(3,'2012-01-05'); with next_val AS (select player_id,login_date, lead(login_date) over(partition by player_id order by player_id,login_date) as nextday from playerinfo) SELECT distinct a.player_id, a.login_date, nextday from next_val , playerinfo a where dateadd(day,1,a.login_date)=nextday Less

Show More Responses
Viewing 1 - 10 of 591 interview questions

Glassdoor has 591 amazon interview questions and reports from Data engineer interviews at amazon. Prepare for your interview. Get hired. Love your job.