Data Scientist Personas: What Skills Do They Have and How Much Do They Make?
Data scientist is a job in high demand. Boasting a median base salary of $110,000, as well as a job satisfaction score of 4.4 out of 5, it is no wonder that it has claimed the top spot on Glassdoor’s Best Jobs in America list in 2017 and 2016. Despite its increasing popularity, what do we know about what it takes to be a data scientist? What are the responsibilities or the skills required to do the job? And, how much do data scientists earn?
In this study, we use data from Glassdoor to shed light on these questions. The data show the landscape of data science jobs is indeed characterized by a wide diversity of skills. Further, we find that data science jobs can be grouped into three main personas: Core data scientists, researchers, and big data specialists. Further, we break down the skills required and difference in average pay across the three groups.
What We Did
We started by extracting a sample of about 10,000 data scientist job listings from the millions of jobs listed on Glassdoor from January through July, 2017. Along with job descriptions, we looked at the estimated median base pay by job to better understand the market value of different kinds of data science skills.(1)
Next, we constructed a skills “dictionary” containing more than three dozen separate coding skills relevant to data science.(2) Armed with this dictionary, we searched job description text for the skills listed in each job posting. For simplicity, we focused on the job listings containing the ten most frequently mentioned skills:
- Hive, and
This resulted in a sample of 7,785 data science jobs. We then grouped these jobs into “archetypes” of data science roles using a data mining algorithm known as K-modes clustering.(3) This technique groups job postings into similar groups based on the skills present in job descriptions.
Let’s have a look at the landscape of skills required for today’s data science jobs.
The Data Science Skills Landscape
Which skills are most common in data science job postings? Which skills appear together most often? These are key questions any job seeker looking to enter the field of data science will want to know when looking for a new job.
A useful way to show what types of skills are common in data science job postings today is through a co-occurrence network graph. The graph, which is shown below in Figure 1, has two main features. The first is the size of the circle, or node, which is proportional to the number of job descriptions that included that skill. The second feature is the lines connecting each node, which grow wider and darker if the two skills they connect are commonly listed together in the same job description. In other words, a large circle will indicate that the skill is popular, and circles connected by thick lines will indicate that the two skills often appear together in the same job description.
Figure 1: Co-Occurrence Network Visualization of Data Science Skills
The most notable aspect of the above graph is the size of the three largest nodes, which dominate the landscape. These skills are Python, R, and SQL. Python is the most common skill and is listed in 72 percent of job descriptions, with R closely behind at 64 percent, and SQL at 51 percent. Not surprisingly, nine out of every 10 job postings in our sample require at least one of these skills.
However, the above graph also reveals the popularity of several other skills like Hadoop (39 percent of jobs), Java (33 percent), and SAS (30 percent). The full list of skills and their prevalence in job descriptions can be found below:
Table 1. The Ten Most Common Data Science Skills in Job Postings
Skills That Go Hand-in-Hand
The lines that connect skills in Figure 1 also provide some valuable insights. For example, the three most popular data science skills — Python, R, and SQL — are closely interconnected, as reflected by the width of the lines between them. The popularity of those three skills, along with their interconnectedness, makes them the bread and butter skills that every data science job seeker should know.
Further, there are close ties between many other skills shown in Figure 1. For instance, Java is closely connected with R and Python, and SAS is closely connected with R. Similarly, Hadoop, Spark, and Python all commonly appear together in job postings. For job seekers looking for advice on which skills to learn, this analysis shows which “bundles” of data science skills are best learned together, and which skills are unrelated to each other.
Data Science Personas
Next, we explored the connections between skills in a more systematic way. Using what is known as a “clustering” algorithm, we identified three main types — or “personas” — of data scientists that appear in job listings today. Clustering is a machine learning algorithm that organizes objects into groups so that the members of a group are similar to each other. In our case, we grouped job descriptions into personas based on the actual skills listed in them by employers.
After grouping job postings into three groups, the algorithm characterizes them by showing which skills are most likely to appear in job descriptions for each.(4) We then use Glassdoor’s salary estimates to calculate average estimated pay for each group in order to illustrate which skills — or groups of skills — have a higher market value.(5) The key results are shown below in Table 2.
Table 2: Three Data Scientist Personas and What They Earn
In the above table, the most common type of data scientist is type one, which we’ve labeled “core data scientist.” Seventy-one percent of job postings for data science jobs fall into this category. Type 1 data science jobs are likely to require three core skills: Python, R, and SQL. This large cluster comes as little surprise, given how popular those three data science skills are today. The average estimated salary for this type of data science job is about $116,000 per year. A few examples of companies hiring for these types of roles today are Google, Aetna, and Microsoft. (Glassdoor is also hiring data scientists! See open jobs).
The second most popular data science persona is type two, which we’ve labeled “the researcher.” These jobs comprise 15 percent of all data science job postings in our sample. These roles are likely to require core skills like Python and R, as well as other skills often used by think tanks, consultants, and academic researchers like SAS and Matlab. Jobs in this group have an average estimated salary of around $112,000 per year, about $4,000 less than our “core data scientist” group. A few examples of employers hiring for these roles today are KPMG, Bank of America, and Allstate.
Finally, the third type of data science persona is type three, which we’ve labeled “big data specialist.” These roles are the least common in our sample, comprising 14 percent of data science job postings. However, while these roles are less common, they are also the most highly paid of the three groups, with an average estimated pay of about $121,000 per year — $5,000 more than our “core data scientist” group. Aside from core skills like Python, this type of data scientist job requires more specialized skills used for big data analysis like Spark, Hive, and Hadoop. A few examples of companies hiring for these roles today are Experian, Amazon, and Zillow.
Bottom Line for Job Seekers and Employers
Because data science is such a new field, there’s very little consensus on what work they do, or what skills are required to be a data scientist. Based on our analysis, this lack of consensus is partly because there isn’t just one type of data scientist — there are several distinct personas within the field today, each requiring different skills and earning different pay.
There are two main takeaways from our analysis. First is that there are three core skills that are critical for most data science jobs today: R, Python, and SQL. These three skills form the core of the data science skill set and at least one of them is present in nine out of 10 job postings on Glassdoor. Second is that there are potentially important differences in pay across the three types of data science jobs, with a gap of $9,000 per year on average between the lowest and highest paid data scientist jobs today.
For job seekers looking for data science roles, this suggests that mastering just three core skills — Python, R, and SQL — can provide a solid foundation for seven out of 10 job openings today in data science. Second, it suggests that the skills you bring to the table can have a big impact on pay. Some skills have a higher market value than others, and our analysis above can help job seekers make a smart cost-benefit calculation about which skills are worth spending time learning — and which aren’t.
For employers, our analysis suggests they should think carefully about what skills they need from a data scientist before hiring. Does your company need a big data expert, a core data scientist, or a researcher? The answer to that question will determine which skills you need, and what you should budget to pay for them in today’s data science job market. When writing job descriptions, employers may want to focus on only the skills they need, rather than an exhaustive list. Doing so may help attract better fit candidates — and help improve the match between candidates and open data science positions.
(1) More information on Glassdoor salary estimates can be found here: http://help.glassdoor.com/article/What-are-Salary-Estimates-in-Job-Listings.
(2) In order to construct the skills dictionary, we draw from previous work by Jesse Steinweg-Woods (https://jessesw.com/Data-Science-Skills/) and Yuanyuan Shi (https://github.com/yuanyuanshi/Data_Skills).
(3) K-modes is an adaptation of the widely used K-means algorithm to binary data, introduced by Huang (1998). More information on this technique can be found here: http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf. In order to initialize the algorithm, we use the method outlined in Cao et al. (2009), which can be found here: http://www.sciencedirect.com/science/article/pii/S089812210900323X. Importantly, this initialization method is not seed dependent.
(4) The number of groups is a parameter of the algorithm which is set by the user and not decided upon by the algorithm itself.
(5) It is worth noting that our results are only illustrative of the field of data science. We cannot shed light on the value of these different skill sets in other occupations, like software engineering.
Cao, F., Liang, J, Bai, L., “A new initialization method for categorical data clustering,” Expert Systems with Applications 36(7) 2009, pp. 10223-10228.
Huang, Z., “Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values,” Data Mining and Knowledge Discovery, 1998 2: 283–304.