VMware Interview Question: How do you validate your data... | Glassdoor

Interview Question

Senior Data Analyst Interview Palo Alto, CA

How do you validate your data

Answer

Interview Answer

2 Answers

0

To see if it make sense from different source

Interview Candidate on Oct 3, 2011
0

1. Look at the keys (UPC or SSN or Driver's ID, Bank account nbr etc). What are the unique keys, see if there are any NULL values. If there are then those are most likely bogus values.

2. Look at the time columns, see if the times makes sense. If you find something way back say 1900 or 2500 those are most likely placeholder values. Question why they are there? See if you can find other discrepancies.

3. Look at the facts in your data, calculate the mean and median, identify the outliers and check if they are valid or so way off that most likely are transcription or data collection errors. Build graphs to visualize.

4. Sum up all facts and see if they are reasonable. For eg you have a sales by item by day file, sum up sales for a day and check from some other source if that measure makes sense. Say, VMware sells 10 licenses a day on avg., on a given day your file says it has sold 1000 then there something wrong with the data in your file.

This is a limited means of validation. It is the toughest aspect of data analysis. There are multiple biases associated with data (e.g measurement bias- too much attention to things which you can measure), and in ambiguous situations, we don't know how much proof is proof enough.

Some good books:

http://www.amazon.com/Bad-Data-Handbook-Cleaning-Back/dp/1449321887

http://www.amazon.com/Naked-Statistics-Stripping-Dread-Data/dp/0393071952/ref=tmm_hrd_title_0?_encoding=UTF8&sr=&qid=

The chapter on biases is really good.

Anonymous on Jun 20, 2015

Add Answers or Comments

To comment on this, Sign In or Sign Up.