Phone Screen 1. Asked general questions about me, what do I know about the company? 2. Provided information about the company 3. Asked about projects I have worked on before. 4. Technical screening from hiring team, rapid fire questions a. In a relational database, which operator is used to get the unique values of a column Ans: DISTINCT b. Which process would benefit from columnar data. MCQ: A. OLTP B. OLAP C. Quick read Ans: B (OLAP) - think data warehouse and analytics. Can ignore whole columns that aren’t needed. c. In python, which data structure consists of a key that can be looked up. Ans: dictionary d. Given an ordered list, what would be the time complexity of doing binary search on it? Ans: O(log n) - keyword is “ordered” here. Hiring Manager Interview 1. What am I looking for in my next role? 2. Provided information about company and team. Talked about the perks of being part of that team in particular. Lots of learning opportunities which is great. 3. Asked about a project I was proud of/successful in. a. Followed up with technical questions on it, for example, why did you choose to architect it that way versus another way. 4. I talked a lot about bronze and silver tables (medallion architecture in Databricks) a. Followed up with technical question: should a bronze table be read optimized or write optimized? Ans: write optimized - we are capturing large volumes of data from potentially multiple sources. 5. Asked about if I am comfortable with technologies like AWS, databricks, python, spark 6. Asked if my vanilla python skills are good? (vanilla as in not pyspark...or similar) Coding Challenge 1. Quick introduction then presented me with the problem. 2. Problem snippet Q1: Overall Goal: You want to pre-process this data such that you can look up a loan_id and transaction_id and efficiently get the latest data returned Context: The data is unique by loan_id-transaction_date Assumptions: for a loan_id, a row that comes later has the most update data, previous row can be ignored For example if we look up loan_id=123 and transaction_date='2022-12-01', we get 100.0 and 'def' back. file = [ (123, 50.0, '2022-11-01', 'xyz'), (234, 70.0, '2022-10-01', 'abc'), (123, 100.0, '2022-12-01', 'def'), (234, 50.0, '2022-11-01', 'xyz'), (567, 30.0, '2022-09-01', 'abc'), (123, 70.0, '2022-11-01', 'xyz'), ] # Q2: Which (primitive) data structure would you use to pre-process the data above to make it efficient to retrieve a certain loan_id on a certain date? Ans: dictionary (because of O(1) lookup) # Q3: Let's assume that any loan_id and transaction_date desired are present in the data, how does this change your approach or simplify the problem? Ans: This simplifies the problem because we can directly lookup the key we want # Q4: Let's assume that the above assumption is NOT true, that is we might need to lookup dates that are not present in the data and we want to find the closest date seen so far. For example if we look up loan_id=123 and transaction_date='2022-11-15', we get 70.0 and 'xyz' because the closest lower-bound date is '2022-11-01'. In the data, there are two records for that loan_id and transaction_date but we take the most recent one (the lower one in the data) Let's actually get to coding it now import pytest def file_preprocessing(file): transaction_dict = {} loan_date_dict = {} for loan_id, amount, transaction_date, application_id in file: if loan_id in transaction_dict: transaction_dict[loan_id][transaction_date] = (amount, application_id) loan_date_dict[loan_id].add(transaction_date) else: transaction_dict[loan_id] = {transaction_date: (amount, application_id)} loan_date_dict[loan_id] = {transaction_date} return transaction_dict, loan_date_dict def get_transaction(loan_id, transaction_date, transaction_dict, loan_date_dict): closest_date = max({tr_d for tr_d in loan_date_dict[loan_id] if tr_d <= transaction_date}) return transaction_dict[loan_id][closest_date] def test_file_processing(): assert len(file_preprocessing(file)) == 3 def test_file_processing_type(): assert isinstance(file_preprocessing(file), dict) def test_file_processing_1(): assert len(file_preprocessing(file)[123]) == 2 def test_get_transaction(): # did not reach this part pass if __name__ == "__main__": pytest.main()