Data engineers are IT professionals who are needed in almost every industry. Data engineers monitor data trends to determine best next steps for companies. A critical part of a data engineer job is to process raw data into usable data by creating data pipelines and building data systems.
Here are three top data engineer interview questions and how to answer them:
How to answer: Before the interview, review your resume and/or portfolio and make a list of the programs you are most proficient with. If you find that you are lacking the expertise in a program that the company predominately uses, describe yourself as a highly motivated self-starter who will work tirelessly to learn the program(s).
How to answer: Highlight your role in relation to the larger organization and other roles like data scientists to clearly define your contribution to the overall system of business. Clarify the difference between a database-centric engineer and a pipeline-centric engineer.
How to answer: Research the company's software, data cloud products, and use of Apache Hadoop to be prepared for this inquiry. Data Engineers must be fluent in programming languages and data management systems used throughout the industry such as Apache Hadoop.
↳
Arrays are more efficient for accessing elements , while linked list are better for inserting or deleting elements, the choice between the two data structure depends on the specific requirements of the problem being solved. Less
↳
Stack and queues have different order of processing, operations for adding and removing elements, and usage scenarios.The choice between the two data structure depends on the specific requirements of the problem being solved Less
↳
A hash table is a data structure that allows for efficient insertion, deletion, and lookup of key-value pairs. It is based on the idea of hashing, which involves mapping each key to a specific index in an array using a hash function. The hash function takes a key as input and returns a unique index in the array. In order to handle collisions (when two or more keys map to the same index), some form of collision resolution mechanism is used, such as separate chaining or open addressing. In separate chaining, each index in the array is a linked list, and each key-value pair is stored in a node in the corresponding linked list. When a collision occurs, the new key-value pair is added to the end of the linked list at the corresponding index. In open addressing, when a collision occurs, a different index in the array is searched for to store the new key-value pair. There are several techniques for open addressing, such as linear probing, quadratic probing, and double hashing. Hash tables have an average case time complexity of O(1) for insertion, deletion, and lookup operations, making them a highly efficient data structure for many applications, such as database indexing, caching, and compiler symbol tables. However, their worst-case time complexity can be as bad as O(n) in rare cases, such as when there are many collisions and the hash table needs to be resized. Less
↳
P(A) = 0.6, P(B) = 0.8 P(AUB) = P(A) + P(B) - P(A and B) = 0.6 + 0.4 - 0.48 = 0.92 [Independent Events] i.e., there is 8% probability the item is not from either A or B. So, there is 92% probability the item listed on the website is from A or B. This takes an assumption that the items available in A and B are listed on the website. We don't have conditional probability of item being found on Website given it is from A or B . P(W/A) or P(W/B) is not available. Hence, insufficient data available to answer what is the probability of item being found on website. Less
↳
Insufficient data to make an inference on what is the probability that an item is in Amazon given that the probability of the item being in A is 0.6 and probability that it is in B is 0.8. Are A and B associated with Amazon? Is the vendor who owns the item an exclusive vendor of Amazon? Is the item easily available? Less
↳
The simple explanation is to take the probability of not finding the two items (0.4 * 0.2 = 0.08). So the probability of both items not showing would be 8%. That means you would have a 92% probability of at least one item showing up. Less
↳
def separate(x): y={} for key,value in x.items(): for key1,value1 in value.items(): y[key+'_'+key1]=value1 return y Less
↳
x = {'a':{'b':'c','d':{'e':{'f':'g'}}} } def get_key_val(arg_dict): return_dict = {} for arg_key, arg_val in arg_dict.items(): if isinstance(arg_val, dict): for in_key, in_val in arg_val.items(): for return_key, return_val in get_key_val({arg_key+'_'+in_key: in_val}).items(): return_dict[return_key] = return_val else: return_dict[arg_key] = arg_val return return_dict get_key_val(x) Less
↳
def flatten_json(j): global flattened_json_table = {} def helper(i, j, s = ""): if type(j) is dict: for x, y in j.items(): helper(x, y, s + "_" + x) else: flattened_json_table.insert(s[1:], j) helper(None, j) return flattened_json_table Less
↳
select ColumnA,ColumnB from (select ColumnA,ColumnB,ColumnC,ROW_NUMBER() OVER ( PARTITION BY ColumnA ORDER BY ColumnC desc ) as rn from Table) A where rn=1 Less
↳
1: select A from T1 where B in ( select max(B) from T1 group by C ) 2: select B from T1 p where C = ( select max(c) from T1 q where q.A = p.A ) group by A Less
↳
select b from t inner join (select a, max(c) as c from t group by a) as grouped_t on t.a = grouped_t.a and t.c = grouped_t.c Less
↳
SQL 1) with c1 as (select product_id, customer_id, count(distinct purchase_date) as number_of_times from temporary.purchase group by product_id, customer_id ) select product_id, count(distinct customer_id) as number_of_customers, number_of_times from c1 group by product_id,number_of_times order by product_id,number_of_customers desc ; 2) a) Use Lag/Lead functions b) Not able to answer Less
↳
For Question No 2: create table daily_usage(account_id number,usage number,pur_date date); insert into daily_usage values(1,10,to_date('10-Jan-2017','DD-MON-YYYY')); insert into daily_usage values(1,20,to_date('11-Jan-2017','DD-MON-YYYY')); insert into daily_usage values(1,30,to_date('12-Jan-2017','DD-MON-YYYY')); insert into daily_usage values(1,50,to_date('13-Jan-2017','DD-MON-YYYY')); commit; select today.account_id, today.usage now,prev.usage prev from daily_usage today LEFT OUTER JOIN daily_usage prev on today.pur_date = prev.pur_date +1 order by now ; Less
↳
Answer to second question: ;WITH CTE as ( Select *, Row_number() over (order by amount) NewId From #Usage) Select a.*, B.amount, c.amount From CTE a Left join CTE b on a.newid= b.newid+1 Left join CTE c on a.newid= c.newid-1 Less
↳
x=[10, 20, 30, 10] sum=x[0] arr=[] arr.append(sum) for i in range(1,len(x)): sum=sum+x[i] arr.append((sum/(i+1))) #print(sum,i,arr) print(arr) Less
↳
num <- c(10,20,30,10) output <- c() for(i in 1:length(num)) { moving_avg <- sum(num[1:i])/i output[i] <- moving_avg } print(output) Less
↳
x=[10, 20, 30, 10] for i in range(0,len(x)): print((float(x[i])/(i+1)) + (5 * i)) Less
↳
text = open('data/file.txt', 'r') word_dict = {} for line in text: for word in line.split(): if word.lower() not in word_dict: word_dict[word.lower()] = 1 else: word_dict[word.lower()] += 1 for i,j in word_dict.items(): print(i + ' : ' + str(j)) Less
↳
def count_occurance(s): d= {} i =0 for words in se.split(' '): if words not in d: d[words] = i else: d[words] = i+1 return d Less
↳
def Count_the_word(se,word): b=se.split() c=b.count(word) return c
↳
def common(lst1,lst2): res=[] for i in lst1: if i in lst2: res.append(i) lst2.remove(i) res.sort() return res Less
↳
a,b = [1,2,3,3,1,1,1],[1,1,2,2,3] common = [] for x in a: if x in b: b.remove(x) common.append(x) print(sorted(common)) Less
↳
a = sorted([1,2,3,3,1,1,1]) b = sorted([1,1,2,2,3]) res = [] i, j = 0, 0 while ib[j]: j += 1 else: i+=1 res Less
↳
I cannot share the specifics as I have signed the NDA. All the best.
↳
Hi could you please help me with the interview questions I have an interview scheduled in few days and need to prepare. This will be a great help and I would really appreciate your help and guidance. Less
↳
plz!!!!!!
↳
With cte as (select player_id, min(login_date)+1 as next day from player_details group by player_id) Select count(distinct cte.player_id)/count(distinct xyz.player_id) as retention_rate from player_details xyz left join cte on xyz.player_id = cte.player_id and xyz.login_date= cte.next_day Less
↳
--calculating retention rate for the users drop table if exists playerinfo; create table playerinfo (player_id int, login_date timestamp); insert into playerinfo values (1,'2012-01-01'), (1,'2012-01-02'), (1,'2012-01-03'),(2,'2012-01-02'),(2,'2012-01-04'),(2,'2012-01-05'),(3,'2012-01-02'),(3,'2012-01-03'),(3,'2012-01-05'); with cte as ( select player_id,login_date, --DATE_PART('day', end - start) date_part('day',login_date - min(login_date) over (partition by player_id)) as loggedinafterfirstday from playerinfo group by player_id, login_date ) select (select count(distinct player_id) from cte where loggedinafterfirstday = 1)/ (select count(distinct player_id) from cte)::float as firstdayretentionrate Less
↳
create table playerinfo (player_id int, login_date date); select * from playerinfo insert into playerinfo values (1,'2012-01-01'), (1,'2012-01-02'), (1,'2012-01-03'),(2,'2012-01-02'),(2,'2012-01-04'), (2,'2012-01-05'),(3,'2012-01-02'),(3,'2012-01-03'),(3,'2012-01-05'); with next_val AS (select player_id,login_date, lead(login_date) over(partition by player_id order by player_id,login_date) as nextday from playerinfo) SELECT distinct a.player_id, a.login_date, nextday from next_val , playerinfo a where dateadd(day,1,a.login_date)=nextday Less