SQL Joins for estimating gender and Hollywood power

THR released a list of Most Powerful People in Entertainment. How can we analyze that as data?

The data

The first part of the tutorial goes over how I manually spreadsheetified the Hollywood Reporter feature, The THR 100: Hollywood Reporter's Most Powerful People in Entertainment

To do the SQL in this lesson, download the database here thr-gendered-babynames.sqlite

Checking out Hollywood Reporter's 100 Most Powerful Entertainers list

In 2016, THR put out its first power ranking of Hollywood entertainers: It's [a great read:

The THR 100: Hollywood Reporter's Most Powerful People in Entertainment

Here's their methodology – THR 100: How the Staff Put Together This Power List – though you may want to skip it until we run our SQL analysis, as to not spoil the surprise of accuracy.

Wrangling a story-list into a spreadsheet

I love lists. Lists are a great way for displaying and organizing data, and THR's feature is no different. That said, anything that can be a list can be a better spreadsheet at some point. What if there were other ways to look at this list other than a ranking of 100 to 1? What if we wanted to count number of actors vs. producers vs. CEO types, for example?

Even this straightforward narrative list can be separated into enough interesting parts to warrant a spreadsheet. Unfortunately, because this was a very narrative feature, there is no easy program that will suck up the data in this list and make it structured.

Doing things by hand

So I did it the old fashioned way: I read the entire list, and then I hand-entered it into a spreadsheet, which you can view here on Google Drive.

Here's the final version of the spreadsheet, ready to be imported into a database:

And here's how it was first entered into the spreadsheet – basically copying-and-pasting efficiently:

Tidying up narrative data

The THR feature made for a great read, but things got messy when trying to break it down to data. For example, each rank does not represent a single person. For rank 11, they stuffed 3 people – Jeff Shell, Donna Langley and Ron Meyer – and their respective titles into a single slot:

It makes editorial sense – the team of Shell, Langley, and Meyer is what makes them the 11th most powerful Hollywood entertaining force. But it causes a lot of inconveniences in data, especially with databases when we want one piece of information per cell. Here's what I ended up doing – each person gets their own line, and I split up the job titles as best I could:

When each person has their own line, there are 125 individual humans in this list. You can view it on Google Spreadsheet.

Or, you can download the SQLite database I've created, which contains this simple table of THR entertainers and also, a table of gendered babyname data. You can probably guess what comes next.

Download the database here: thr-gendered-babynames.sqlite

About the gendered_names table

Once you've download the SQLite database – thr-gendered-babynames.sqlite – you'll notice that besides the THR table, there's a table named gendered_babynames which looks like this:

It's a transformed version of the Social Security Baby Names for All States, 2015 that we've come across. The original database contains a row for male and female count of a given name in a given state:

SELECT state, year, name, sex, count 
FROM babynames 
WHERE name = "Parker" 
  AND state = "US";

Take note of how there are substantially more male Parkers than female.

state	year	name	sex	count
US	2015	Parker	M	5383
US	2015	Parker	F	1439

The raw baby name data has always had potential for gender-analysis use. To simplify things for SQL beginners, I've taken the liberty of aggregating/transforming the data so that names like Parker only get one row:

SELECT *
FROM gendered_names
WHERE name = "Parker"
  AND state = 'US';

year	state	name	total_babies	primary_sex	primary_sex_ratio	per_100k_in_state
2015	US	Parker	6822	M	78.9	186

The original data is still (mostly) there – to calculate number of females named Parker in 2015, we multiply 6822 by (100 - 78.9).

But the main reason to have the baby name data in this simplified format is so that we can do queries like these:

What are the most popular gender-neutral names that have at least 2,000 total babies nationwide?

SELECT name, total_babies, primary_sex, primary_sex_ratio
FROM gendered_names
WHERE total_babies >= 2000
  AND primary_sex_ratio < 70
LIMIT 10;

name	total_babies	primary_sex	primary_sex_ratio
Hayden	4415	M	61.4
Charlie	3216	M	51.7
Rowan	2781	M	64.7
River	2440	M	61.4
Amari	2035	M	66.9
Emerson	2971	F	59.9
Finley	2653	F	60.2
Dakota	2254	F	58.7
Skyler	2013	F	55.2

What are the most male-skewed names that begin and end in vowels?

(Making a crude judgment that guy names have "harsh" sounds)

SELECT name, total_babies, primary_sex_ratio
FROM gendered_names
WHERE SUBSTR(name, 1, 1) IN ('A', 'E', 'I', 'O', 'U')
  AND SUBSTR(name, -1, 1) IN ('a', 'e', 'i', 'o', 'u')
  AND primary_sex = 'M'
ORDER BY primary_sex_ratio DESC, total_babies DESC
LIMIT 10;

As it turns out, vowelly names for boys is a popular thing outside of the traditional U.S.:

name	total_babies	primary_sex_ratio
Antonio	2708	100
Alejandro	2303	100
Emiliano	1660	100
Eduardo	1440	100
Andre	1310	100
Angelo	1200	100
Emilio	1182	100
Enzo	1039	100
Armando	749	100
Eli	741	100

OK, now that we have an idea of what's in the THR table and what's in gendered_names, let's put the two together for a quick analysis of THR's list.

Gender analysis of THR's Powerful Entertainers

The THR 100 is new this year, but THR has traditionally done separate 100 lists, including favorite films, lawyers, and women in entertainment

No longer pitting women vs. women

Last year, THR decided to stop doing a separate Entertainment Power 100 List for women, and wrote a great essay explaining their thoughts:

And yet today, in legend and reality, women fight for position on these lists in ways that don't always make them, or us, comfortable. THR's Power 100 list, by its nature, pits the town's most impressive females against one another. I can't help but think of a telling passage I read from Lansing's upcoming biography that describes her ascension in an era when men felt there was room for just one alpha woman at a time at the studios.

I've come to believe that something as simple as our ranked women's lists contributes to keeping that sense alive, that we accidentally created a beauty pageant of brains where only one woman gets crowned. Some women have publicly cried upon seeing their rankings. That is funny to some people. But it's depressing as hell to me.

In their methodology, THR admits that it's not exactly about hard numbers:

For some, power is the ability to get a green light. Others say it's the ability to give a green light. Access to gigantic sums of money was a part of the equation, as were fame (turning heads when you enter a restaurant is practically a primal form of alpha-ism) and smarts (information is power). The key criterion was influence in the transactional machinery of day-to-day Hollywood. Boiled down, the simplest test of determining the differential between people became this: Whose phone call would you return first? Not because you like them but because your financial future and career just might, might, be altered by the time you hung up?

As one metric for whether or not the THR 100 reflects rankings, THR has a nice chart ranking the THR 100 by how much ink they received:

But what about THR's desire to recognize the influence and power of women entertainers? Was that reflected in the THR 100. What metric would we use to decide that?

Well, the best metrics are sometimes the simplest: let's just count how many women are on the list.

Joining gender to THR names

The THR 100 list is short enough. But with SQL, we can actually do this calculation in mere seconds. How?

By connecting the gender-of-name data found in the gendered_names table to each name in the thr_power_list table.

Let's review what thr_power_list contains: names, ranks, and job info

SELECT * 
FROM thr_power_list
ORDER BY rank
LIMIT 10;

SELECT * 
FROM thr_power_list
ORDER BY rank
LIMIT 10;

rank	full_name	first_name	last_name	occupation	company
1	Bob Iger	Bob	Iger	Chairman and CEO	Walt Disney Co.
2	James Murdoch	James	Murdoch	CEO	21st Century Fox
2	Lachlan Murdoch	Lachlan	Murdoch	Executive chairman	21st Century Fox
2	Rupert Murdoch	Rupert	Murdoch	Executive chairman	21st Century Fox
3	Leslie Moonves	Leslie	Moonves	President and CEO	CBS Corp.
4	Steve Burke	Steve	Burke	CEO	NBC Universal
5	Ted Sarandos	Ted	Sarandos	Chief content officer	Netflix
6	Sumner Redstone	Sumner	Redstone	President, national amusements, and vice chair,	Viacom and CBS Boards
6	Shari Redstone	Shari	Redstone	Chairman emeritus	Viacom and CBS Corp.
7	Jeff Bewkes	Jeff	Bewkes	Chairman and CEO	Time Warner

Pretend we we knew nothing about typical American names, such as the traditional gender for people named Bob, James, Lachlan, and Leslie.

How would we find the gender-ratio for each of these names, using just the database that we have?

Just do a simple SELECT/WHERE on gendered_names for each of the first_name values in thr_power_list, and tally up the males and females:

SELECT name, primary_sex, primary_sex_ratio, total_babies 
FROM gendered_names
WHERE name = 'Bob';

(wow, only 14 babes named Bob in the entire U.S. for 2015?)

name	primary_sex	primary_sex_ratio	total_babies
Bob	M	100	14

SELECT name, primary_sex, primary_sex_ratio, total_babies 
FROM gendered_names
WHERE name = 'Leslie';

This is where our data is wrong. For a more robust gender detector, we should be using data from past decades, back when Leslie was still a common male name, like Leslie Nielsen and, in this case, Leslie Moonves. Alas, our gendered_names data only includes 2015 data, and apparently, Leslie is now almost exclusively a female name:

name	primary_sex	primary_sex_ratio	total_babies
Leslie	F	94.2	930

Let's JOIN

OK, now that we've got the idea, let's see how the JOIN clause can make our analysis so much quicker (the accuracy problem will have to be dealt with later).

Start with the non-JOIN SQL we know

Let's start with what we know we want and what we know we can do, at the very least: getting a list of first names from thr_power_list:

SELECT first_name
FROM thr_power_list;

Which looks like this:

first_name
Chris
Jason
Robert
Emma

Given the above table, let's now think (but not overthink) about the very next data point we need to count the genders of this list.

Matching sex/gender to name

To count the genders-by-name in this list, we need to know what gender each name is. Even if we don't know how to get to this result, at least we know we need this result:

first_name	primary_sex
Chris	M
Jason	M
Robert	M
Emma	F

primary_sex – the column that states whether a given name is mostly-male or mostly-female based on baby count – is in the gendered_names table.

To get the results table we want, it would seem we would have to join the two tables, thr_power_list and gendered_names.

The JOIN syntax

OK, finally, getting to the point of writing JOIN in a SQL query:

Starting with the answer

The section goes down a rabbit hole of what happens when you misunderstand the JOIN concept and/or syntax. It's important to go down this rabbit hole. But let's not miss the big picture or what best practice is.

So, if this is the result we want and need (and it is) to count up gender per name in the THR list:

first_name	primary_sex
Chris	M
Jason	M
Robert	M
Emma	F

This is the SQL JOIN statement we write:

SELECT thr_power_list.first_name,
       gendered_names.primary_sex
FROM 
  thr_power_list
INNER JOIN 
  gendered_names ON
    thr_power_list.first_name = gendered_names.name;

Run the query in your own database. See if you can figure out what's needed to get an aggregate count of males and females…

Otherwise, back to exploring JOIN, slowly and painfully:

The FROM part of a JOIN

A JOIN involves joining two tables. That first table comes from the FROM clause:

SELECT first_name
FROM thr_power_list;

The JOIN part of a JOIN

To include a second table, we use the JOIN keyword – it comes after the FROM clause, and we give it a table name, i.e. the second table we want to join:

Note – do not run this query. Nothing will go wrong, per se, but it may take a very long time:

SELECT first_name
FROM thr_power_list
JOIN gendered_names;

Always write `INNER JOIN` instead of just `JOIN`

Second thing I want to note: even though JOIN by itself defaults to a INNER JOIN operation, I believe it is best practice to always write out INNER JOIN and will do so for every lesson and example going forward. It's just one more word for a lot more clarity:

SELECT first_name
FROM thr_power_list
INNER JOIN gendered_names;

The INNER JOIN is just cosmetic. Don't run the above query unless you want to tie up your computer.

Blowing things up when joining without a condition

Let's take a step back, why is the previous SQL query so bad?

First of all, it won't be an error because it's perfectly good SQL. And the database engine will just run it. Unfortunately, running this type of query – a JOIN without an ON conditiona – can be extremely computationally expensive.

The illogical WHERE

The closest analogy I can think of is when we write a WHERE clause, but give it a boolean expression that is always true (or false)

SELECT *
FROM mytable
WHERE 1;

1 is considered a truthy value, all the time, so this query retrieves all the rows and columns from mytable, just as if we wrote:

SELECT *
FROM mytable;

So what's the point of that WHERE, except to confuse everyone who reads the code? Or, more likely, it's a human-created error, but one that doesn't cause the query to crash with an error response.

So going back to the bad JOIN query:

SELECT first_name
FROM thr_power_list
INNER JOIN gendered_names;

We need to give it an ON clause with a boolean expression. This boolean expression tells the interpreter:

If this ON condition is true for a row in thr_power_list, and another row in gendered_names, then the results should contain a row consisting of the two rows JOINED together.

Specifically, in the example below, the ON condition tests if a given row in thr_power_list has a first_name equal to a given row in gendered_names. If true, then the result set will have a row consisting of the two joined rows.

Try it out:

SELECT first_name
FROM 
  thr_power_list
INNER JOIN 
  gendered_names
    ON 
    gendered_names.name = thr_power_list.first_name;

Think of ON as almost the same usage as WHERE, except we never use WHERE inside a JOIN clause. WHERE belongs where it does, later in the query, outside of the JOIN.

Note: I like adding extra return lines and indentation to emphasize that two tables are being accessed:

– which feels a lot like WHERE except that we just never use WHERE, the keyword

– has a similar logic flaw. We almost always want to JOIN based on a conditional statement, just like we almost always use WHERE to filter out results.

In the situation of WHERE, when there isn't a real conditional clause/test on how to filter/reject results from a query, the database engine will return all ows in a result.

In the situation of JOIN, when we omit the conditional clause/test, the database engine will proceed to JOIN the two tables in the most broadly, brute way possible.

People and suits example

It's easier to show using a small example:

people

name	size
Dan	M
Bobbi	XL
Ted	L

suits

color	size
Red	S
Green	M
Pink	L
Blue	XL

I want to join people and suits, so that each person row is associated with a corresponding suit row. How do we match people with suits? With size. A proper join query would look like this:

SELECT * 
FROM  people
INNER JOIN  suits
  ON people.size = suits.size;

And the result, everyone gets a suit (notice how size from both tables is repeated, thanks to my lazy use of *):

name	size	color	size
Dan	M	Green	M
Bobbi	XL	Blue	XL
Ted	L	Pink	L

Blowing up the JOIN

OK, what does a bad JOIN and ON condition look like? How about when ON is always False?

SELECT * 
FROM  people
INNER JOIN  suits
  ON 1 = 2;

The result: 0 rows. But nothing too crazy happened.

Let's do the JOIN that I warned about: a JOIN with no ON condition, and thus, a JOIN that will mash up every combination of row:

SELECT * 
FROM  people
INNER JOIN  suits;

The result is a non-sensical table with as many rows as multiplying people by suits. Because that's basically what happened:

name	size	color	size
Dan	M	Red	S
Dan	M	Green	M
Dan	M	Pink	L
Dan	M	Blue	XL
Bobbi	XL	Red	S
Bobbi	XL	Green	M
Bobbi	XL	Pink	L
Bobbi	XL	Blue	XL
Ted	L	Red	S
Ted	L	Green	M
Ted	L	Pink	L
Ted	L	Blue	XL

Analyzing the THR list with SQL and INNER JOIN

That was enough off meandering over syntax-details. Let's get to the real work: using a JOIN to quickly count the male and female entrants to THR's Top 100 Most Powerful in Hollywood.

Returning to an earlier query:

SELECT thr_power_list.first_name,
       gendered_names.primary_sex
FROM 
  thr_power_list
INNER JOIN 
  gendered_names ON
    thr_power_list.first_name = gendered_names.name;

And breaking it down:

Specify the 2 columns we want, first_name and primary_sex
Specify the first table to select FROM: thr_power_list
Specify the second table to JOIN wit: gendered_names
For that join, specify the condition that a row from gendered_names joins with a row from thr_power_list (when first_name is equal to name)

Joining sex to THR's list of names

And here are the results of that join query – looks like a lot of guys in the first dozen:

first_name	primary_sex
Chris	M
Jason	M
Robert	M
Neal	M
Brian	M
Alejandro	M
Scott	M
Kevin	M
Aaron	M
David	M
Doug	M
Amy	F
Thomas	M
Mary	F
Zack	M
Jamie	F

Doing a GROUP BY aggregate count to sum up males vs females

Joins add a seemingly significant amount of syntax and cognitive load – though it's worth it for what they do. However, once you get used to seeing them, you'll realize that all of the other things you've learned still do what they used to, and JOINs are just another piece.

Case in point, how do we get a count by gender for the THR list? Just add a GROUP BY and aggregate function:

SELECT gendered_names.primary_sex, 
  COUNT(*) AS the_count
FROM 
  thr_power_list
INNER JOIN 
  gendered_names ON
    thr_power_list.first_name = gendered_names.name
GROUP BY gendered_names.primary_sex
ORDER BY the_count DESC;

primary_sex	the_count
M	99
F	19

How close were we? According to Hollywood Reporter's methodology, we were pretty much on the dot:

And lastly, yes, we know that a lot of this list is white guys. We won't need social media, thanks anyway, to drop an anvil on our heads to realize this. Out of the 124 names on the list (due to some shared groupings), there are only 19 women and 10 people of color (six African-Americans, two Asians, two Latinos). I'd like to think we cover the issues of gender and diversity that plague Hollywood with a critical eye and will continue to. And I hope and expect that one day in the near future this list has a wholly improved and reflective composition.

…though obviously some of that was luck. And I'm not sure why we ended up with 128 people instead of 124, but we can clear out some of the hiccups (usually dealing with GROUP BY) at another point.

Public Affairs Data Journalism at Stanford | Fall 2016

SQL Joins for estimating gender and Hollywood power

The data

Checking out Hollywood Reporter's 100 Most Powerful Entertainers list

Wrangling a story-list into a spreadsheet

Doing things by hand

Tidying up narrative data

About the gendered_names table

What are the most popular gender-neutral names that have at least 2,000 total babies nationwide?

What are the most male-skewed names that begin and end in vowels?

Gender analysis of THR's Powerful Entertainers

No longer pitting women vs. women

Joining gender to THR names

Let's JOIN

Start with the non-JOIN SQL we know

Matching sex/gender to name

The JOIN syntax

Starting with the answer

The FROM part of a JOIN

The JOIN part of a JOIN

Always write `INNER JOIN` instead of just `JOIN`

Blowing things up when joining without a condition

The illogical WHERE

People and suits example

Blowing up the JOIN

Analyzing the THR list with SQL and INNER JOIN

Joining sex to THR's list of names

Doing a GROUP BY aggregate count to sum up males vs females

References

SQL Joins for estimating gender and Hollywood power

The data

Checking out Hollywood Reporter's 100 Most Powerful Entertainers list

Wrangling a story-list into a spreadsheet

Doing things by hand

Tidying up narrative data

About the gendered_names table

What are the most popular gender-neutral names that have at least 2,000 total babies nationwide?

What are the most male-skewed names that begin and end in vowels?

Gender analysis of THR's Powerful Entertainers

No longer pitting women vs. women

Joining gender to THR names

Let's JOIN

Start with the non-JOIN SQL we know

Matching sex/gender to name

The JOIN syntax

Starting with the answer

The FROM part of a JOIN

The JOIN part of a JOIN

Always write INNER JOIN instead of just JOIN

Blowing things up when joining without a condition

The illogical WHERE

People and suits example

Blowing up the JOIN

Analyzing the THR list with SQL and INNER JOIN

Joining sex to THR's list of names

Doing a GROUP BY aggregate count to sum up males vs females

References

Always write `INNER JOIN` instead of just `JOIN`