Homework: SQL Joins, Salaries, and Baby Names

This is a walkthrough as exercise, trying out some SQL JOINs, and slight practice on data-wrangling outside of the database.

Deadline

Thursday, November 3 at 1:30 PM

Points

5 Homework points

Deliverables

A Google doc named sql-joins-salaries with the SQL and results for every query (screenshot if you must)
Every query gets its own page.
On the cover page, list what city/jurisdiction you used.
Here’s a simple example

Requirements

Create a new Google Document (i.e. Word document) and name the file: sql-joins-salaries
Choose a California city/jurisdiction and download its 2015 salary data from Transparent California.
Clean the payroll data (spreadsheet recommended) so that it has a first_name column
Import it into a table in your own SQLite database
Answer the queries at the end of the walkthrough, paste the answers (SQL and results table) into the Word doc.

The database of gendered baby names

For this homework, you'll probably want to use this provided database:

thr-gendered-babynames.sqlite

You'll need the gendered_babynames table; you can ignore the thr_power_list table.

For context, read the lesson /tutorials/sql-joins-example-thr-power-gender

Remember that you have to produce a Google Doc with all the code/answers to the query at the bottom.

Thinking about salary data

This exercise is meant to acquaint you with the characteristics and quirks of salary data. The queries are meant to be straight forward, but you should be pondering the kinds of things you could analyze if you had more than one year's worth of one city's data to look at…

If you pick a city that has seemingly outrageous overtime or benefits packages, an easy approach is to see how much of an outlier they are among other cities and across years.

Transparent California

While California state publishes payroll data at publicpay.ca.gov, Transparent California pushes the ball by requesting and publishing payroll data with names on it. The difference in structure, and pluses and minuses between the 2 datasets could be its own lecture.

But it's enough to say that having names with the data allows for more worthwhile analysis, even if you don't personally know anyone on payroll.

Here's the top-paid Palo Alto city employees from TC's 2015 upload:

Downloading bulk data

TC's website is useful and valuable enough on its own. But they go the extra mile by providing bulk CSV files for download. For any given city or entity, there should be a Download records, which reveals links to each year's CSV for a given city:

For this homework assignment, download the latest dataset from a city, preferably 2015. Unfortunately, the data format changed over the years, and this is to brief an exercise to get into the fun of cleaning/normalizing legacy data.

Breaking names apart

To be sure, names are way more complicated than we've been led to believe. Naming things (and dealing with how others have named things) will likely be the hardest part of your data science/computational journalism life.

That said, one of my biggest pet peeves in day-to-day spreadsheets is when people don't take the time to split out the first and last name (if not the middle). This is what's going on in payroll data hosted by Transparent California:

At a glance, it seems we can probably get what we want by assuming that for most rows, the first name and last name are split by a space (though there are obvious exceptions, such as Mary Jo Smith). To get separate first and last name columns we just need to apply a function to that name column.

Unfortunately, we won't be doing it in SQLite.

Wrangling data using a spreadsheet

SQLite is an amazing database (checkout this Changelog podcast with its creator) and will likely outlive all of us. That said, it most definitely is not a tool for the efficient wrangling and cleaning of data.

That it lacks a SPLIT function is reason enough to export data from SQLite into spreadsheet land.

We'll get into SQLite soon, but first, let's do data work with Google Sheets. Up

Splitting and making room

This should feel familiar now (reference for a refresher). Something like:

  =SPLIT(A2, " ")

However, SPLIT is a little funky in that executing it will spill its results across adjacent cells. If cell A had 8 spaced words, then 8 cells in that row are going to get overwritten.

Since some names may have many spaces, we don't know how many columns to insert. The easiest way to get around this is to do the SPLIT at the other end of the spreadsheet (i.e. not next to the A) column.

Then you have all the room in the world:

Cutting out the dead weight

If you scroll, you'll see that names are indeed more complicated than just first name and last name:

Databases, as we know, aren't know for being flexible. It's going to be a real pain to deal with n-number of partial names.

So let's make it easy. Let's denote the first part of the name to be the first_name column. And then delete everything else. Cleaning up names requires more sophisticated tools (mostly, determination), which is not important to us now.

So let's make the import-to-database as easy as possible.

Make sure all columns to the right of the first_name column are wiped out:

Google Sheets quirk: The undying column

If you're like me, in which the first_name column is column M, and which also holds the SPLIT formula, you'll find that columns N onward will forever regenerate.

That's because column M contains a formula, which is always executing, including filling out columns N and so forth.

The easiest way to fix this (replace the column with literal values):

1. Highlight your `first_name` column

2. Select Copy and copy the entire column

Leave the column highlighted.

This should overwrite first_name, but with the same values:

If you inspect the first_name cells, you'll see that it contains just literal values – no formulas or other meta-spreadsheet content.

Export to CSV

Finally, export and download your salaries table as CSV.

Importing into SQLite

By this point, you should have downloaded the database with the table of gendered_names: thr-gendered-babynames.sqlite

I recommend using DB Browser as your client. Open up the above-downloaded SQLite database. You should see 2 tables, of which only gendered_names is relevant.

The import popup should be straightforward. I'm assuming you named your table salaries:

SQLite and its painful love of TEXT

OK, this might be the most painful part of the lesson. Depending on what was in your spreadsheet, and how it was imported/exported, DB Browser's import process may be inclined to every column as TEXT, even the ones dealing with salary.

What does this have to do with you? Try doing a sort by money, such as total_pay_benefits

(note that I alias the cumbersome column name to total_pay_benefits because who knows how SQL will interpret the ampersand)

SELECT EmployeeName, JobTitle, 
  "TotalPay&Benefits" AS total_pay_benefits
FROM salaries
ORDER BY total_pay_benefits DESC
LIMIT 10;

When numbers are treated as text, they are sorted alphabetically:

employee_name	job_title	total_pay_benefits
Mary Sekator	Administrative Associate	99914.43
Thanh-Tran Nguyen	Payroll Analyst	99778.22
Richard Hirano	Desktop Technician	99720.17
Vincent Vasquez	Gas System Tech	99699.05
Julie Weiss	Environmental Spec	99695.69
Gregory Smith	WQC Plt Oper II	99659.76
Jason Griffin	Recreation Leader II	9948.74
Joe Berkey	Meter Reader	99479.61
Ariel Berson	Prod Arts/Sci Prog	99405.11
Mary Grace Castor	Human Rsrce Asst Cnf	99340.08

Note that a non-Bay-Area resident might look at this dataset and think it was working — 999915 could buy a lot in other parts of the country. But those of you who live here, you know that 99914 is not a high wage for Palo Alto. Or that among Palo Alto's highest paid employees would be a meter reader.

Consider CAST and being explicit

The proper SQLite way to handle this would be to wrap all your number fields in a CAST function, in which you tell SQLite, Hey, this is a number!. Learning CAST is actually nice because it further acquaints you with how things work in computer-land, while giving you more practice with being explicit with how you code.

Here's the SQLite documentation on CAST.

Here it is in action:

SELECT EmployeeName, JobTitle, 
  CAST("TotalPay&Benefits" AS NUMERIC) AS total_pay_benefits
FROM salaries
ORDER BY total_pay_benefits DESC
LIMIT 10;

And there's the big salaries:

EmployeeName	JobTitle	total_pay_benefits
Kenneth Denson	Police Lieut-Adv	477703.45
Mark Venable	Police Captain-Adv	380998.8
Jason Amdur	Fire Cap Haz Mat EMT	378926.32
James Keene	City Manager	349964.47
John Parks	Fire Inspector EMT	334311.31
Douglas Keith	Management Spec	325115.7
Scott Wong	Management Spec	311450.19
Dennis Burns	Police Chief-Adv	309138.22
Michael Verga	Fire Cap Haz Mat EMT	294068.42
Thomas Newland	Fire Inspector EMT	289664.24

Learning CAST and using it regularly will help you in the long run, so don't shun it because of the extra typing…

DBrowser table alteration

That said, people don't really like writing extra function calls if they can help it.

One of the conveniences that DB Browser has, but that SQLite in general does not – is the ability to point-and-click and change a table's structure. This is nice when you've created a table and screwed a small thing up.

It's helpful in our situation because we can alter our salaries table to have proper numeric columns, which is something you can't really do in base SQLite without wiping out the table and re-importing.

That said, it's highly dependent on DB Browser, the fancy client, i.e. it's not a core SQLite skill. Today it works, maybe it won't later on.

You can get started by going to the Database Structure panel and right-clicking the salaries entry:

Then, for each field you want to be a number, change the dropbox from TEXT to NUMERIC:

Hit OK

Remember to save in DB Browser

(just another quirk, don't get bogged down in this detail)

Another piece of trivia for DB Browser that doesn't exist in other SQLite implementations, for better or worse. Anytime you change up a database or table, such as add a table, import data, delete data, etc, be sure to save your work (like a Word document) with Command-S. DB Browser has this feature in which it doesn't write things immediately. For people who are prone to making mistakes, this is great. For longtime SQL users who are used to the database getting things done on command, it can be quite confusing. It shouldn't impact you hugely as you're learning SQL.

Query the payroll

By now, you should have a database file with:

A salaries table, which includes 1 year of data and a first_name column
The gendered_names table.

Here are some casual queries to explore the data.

(These queries assume your salaries data table has proper NUMERIC columns, or if not, I assume you know how to use the CAST function

Instructions

For all of these queries, I want you to type out the queries, get a result. And then, in the Google doc that you turn in, I want to see the data. For most of you, the easiest way to do this is to screenshot the results and paste it into the Word doc.

Some of the queries are already written – that's OK, type them out on your own. Other queries need to be completed.

Number of full time employees

SELECT COUNT(*)
FROM salaries 
WHERE Status = 'FT';

Show the number of Full-Time vs Part-Time employees

Same as above, but WHERE is unnecessary. And, we need a GROUP BY. Here's a start:

SELECT COUNT(*)
FROM salaries
...

The 5 employees who make the most `Total Pay` compared to their base pay

Include a column that shows how high that ratio is
We'll have to filter out part-time employees for this

SELECT EmployeeName, JobTitle, 
  BasePay, TotalPay, 
  (TotalPay / BasePay) AS overbase_ratio
FROM salaries
ORDER BY overbase_ratio DESC
WHERE STATUS = 'FT'
LIMIT 5;

How many full-time employees have a (TotalPay/BasePay) ratio over 1.5?

Similar to the above query, except you just need 1 column:

SELECT COUNT(*)
FROM salaries
...

What are the 5 most common job titles

Just do a GROUP BY on job title:

SELECT GROUP BY(JobTitle), COUNT(*) as jcount
FROM salaries
ORDER BY jcount DESC
LIMIT 5;

Of the employees with 'Police' in the title, what is their average salary, overtime, etc?

Here's a start:

SELECT AVG(OverTimePay), AVG(OtherPay), AVG(Benefits), AVG(BasePay)
FROM salaries
...

The result shouldn't be more than one row.

How many employees are management level

There's no right or wrong answer to this. Look at the various job titles (I'd start with the highest paid people) and see what terms are used, e.g. Chief, City Manager, Director, and so forth.

Your query will probably have a lot of LIKE clauses and OR keywords.

What is the percentage of full-time employees who make over $150,000 in total pay?

This is a nice example of how to use a subquery to store a scalar value (typically, a single number).

The scalar in this case is the total number of FT employees:

SELECT COUNT(*) FROM salaries WHERE Status = 'FT';

In fact, the number of $150K FT employees is a scalar, i.e. just a count:

SELECT COUNT(*) FROM salaries 
WHERE Status = 'FT' AND TotalPay > 150000;

You can plug that query right into the main query that counts the $150K-making employees. But yeah, that's not too pretty:

SELECT 100.0 * COUNT(*) / 
       (SELECT COUNT(*) FROM salaries WHERE Status = 'FT') as ratio
FROM salaries 
WHERE Status = 'FT' 
  AND TotalPay > 150000;

Query and join the payroll

Assuming you're relatively new to JOINS, this section will take it easy. Make sure you do the queries. And pay attention to the details.

List the name, the probable sex, and the total salary + benefits of 5 highest paid employees

SELECT EmployeeName, primary_sex,
  "TotalPay&Benefits" AS tbpay
FROM 
  salaries
INNER JOIN
  gendered_names
    ON gendered_names.name = salaries.first_name
ORDER BY 
  tbpay DESC
LIMIT 5;

Quick tip about specifying tables

Note that it's possible to write the above query more vaguely, i.e. removing the table-prefix for gendered_names.name and salaries.first_name:

SELECT EmployeeName, primary_sex,
  "TotalPay&Benefits" AS tbpay
FROM 
  salaries
INNER JOIN
  gendered_names
    ON name = first_name
ORDER BY 
  tbpay DESC
LIMIT 5;

Why does that work? Because name is only used in gendered_names, likewise, with first_name and salaries. It's generally a good habit to include table names, but there's a definite balance between being explicit and adding a ton of unneeded clutter:

SELECT salaries.EmployeeName, 
       gendered_names.primary_sex,
       salaries."TotalPay&Benefits" AS tbpay
FROM 
  salaries
INNER JOIN
  gendered_names
    ON gendered_names.name = salaries.first_name
ORDER BY 
  tbpay DESC
LIMIT 5;

List the gender breakdown for employees paid over $150K

Same query as a previous query, except we use a GROUP BY, and we're getting an aggregate count (no need to get the names or salary):

SELECT primary_sex, 
    COUNT(*) AS pcount
FROM 
  salaries
INNER JOIN
  gendered_names
    ON gendered_names.name = salaries.first_name
WHERE TotalPay > 150000;

The only thing that's missing is the GROUP BY, which always comes after the WHERE, which means it also always comes after any JOINs:

SELECT primary_sex, 
    COUNT(*) AS pcount
FROM 
  salaries
INNER JOIN
  gendered_names
    ON gendered_names.name = salaries.first_name
WHERE TotalPay > 150000
GROUP BY primary_sex;

Count of FT employees by gender, and other salary metrics

Just the same kind of JOIN and GROUP BY as before, but with more metrics:

SELECT primary_sex, 
    COUNT(*) AS pcount,
    AVG(TotalPay) AS avg_total_pay,
    MIN(TotalPay) AS min_total_pay,
    MAX(TotalPay) AS max_total_pay
FROM 
  salaries
INNER JOIN
  gendered_names
    ON gendered_names.name = salaries.first_name
WHERE Status = 'FT'
GROUP BY primary_sex;

Average over-base-pay by gender

Let's close up with a fairly easy and derivative query: get an average of TotalPay/BasePay by gender for full-time employees. See if you can work this out on your own.

It may even help starting out a basic SELECT and starting with the gendered_names table instead of salaries:

SELECT primary_sex
FROM gendered_names;

Add the GROUP BY, though there's nothing to agg:

SELECT primary_sex
FROM gendered_names
GROUP BY primary_sex;

Let's add the JOIN:

SELECT primary_sex
FROM gendered_names
JOIN salaries
  ON salaries.first_name = gendered_names.name
GROUP BY primary_sex;

Now we have money-fields to AVG…and we're done:

SELECT primary_sex, 
  AVG(TotalPay/BasePay)
FROM gendered_names
JOIN salaries
  ON salaries.first_name = gendered_names.name
GROUP BY primary_sex;

Public Affairs Data Journalism at Stanford | Fall 2016

Homework: SQL Joins, Salaries, and Baby Names

The database of gendered baby names

Thinking about salary data

Transparent California

Downloading bulk data

Breaking names apart

Wrangling data using a spreadsheet

Splitting and making room

Cutting out the dead weight

Google Sheets quirk: The undying column

1. Highlight your `first_name` column

2. Select Copy and copy the entire column

3. Go to the Edit menu, then the Paste Special submenu, then Paste Values

Export to CSV

Importing into SQLite

SQLite and its painful love of TEXT

Consider CAST and being explicit

DBrowser table alteration

Remember to save in DB Browser

Query the payroll

Instructions

Number of full time employees

Show the number of Full-Time vs Part-Time employees

The 5 employees who make the most `Total Pay` compared to their base pay

How many full-time employees have a (TotalPay/BasePay) ratio over 1.5?

What are the 5 most common job titles

Of the employees with 'Police' in the title, what is their average salary, overtime, etc?

How many employees are management level

What is the percentage of full-time employees who make over $150,000 in total pay?

Query and join the payroll

List the name, the probable sex, and the total salary + benefits of 5 highest paid employees

List the gender breakdown for employees paid over $150K

Count of FT employees by gender, and other salary metrics

Average over-base-pay by gender

Homework: SQL Joins, Salaries, and Baby Names

The database of gendered baby names

Thinking about salary data

Transparent California

Downloading bulk data

Breaking names apart

Wrangling data using a spreadsheet

Splitting and making room

Cutting out the dead weight

Google Sheets quirk: The undying column

1. Highlight your first_name column

2. Select Copy and copy the entire column

3. Go to the Edit menu, then the Paste Special submenu, then Paste Values

Export to CSV

Importing into SQLite

SQLite and its painful love of TEXT

Consider CAST and being explicit

DBrowser table alteration

Remember to save in DB Browser

Query the payroll

Instructions

Number of full time employees

Show the number of Full-Time vs Part-Time employees

The 5 employees who make the most Total Pay compared to their base pay

How many full-time employees have a (TotalPay/BasePay) ratio over 1.5?

What are the 5 most common job titles

Of the employees with 'Police' in the title, what is their average salary, overtime, etc?

How many employees are management level

What is the percentage of full-time employees who make over $150,000 in total pay?

Query and join the payroll

List the name, the probable sex, and the total salary + benefits of 5 highest paid employees

List the gender breakdown for employees paid over $150K

Count of FT employees by gender, and other salary metrics

Average over-base-pay by gender

1. Highlight your `first_name` column

The 5 employees who make the most `Total Pay` compared to their base pay