Homework: Election Predictions with SQL
Today is Election Day. Learn new SQL statement, compete with your classmates, and win the prediction pool by copying Nate Silver.
-
Create a new spreadsheet named
elections-predictions
in yourpadjo-2016
folder. -
The first sheet should consist of the
predictions
table, with 52 rows. -
One row should have a
state
value ofNationwide
and what you’ve guessed forClinton
andTrump
-
The other 51 rows are the 50 states + District of Columbia.
-
The second sheet should be a simple list of SQL queries, with a semi-colon after each one. I should be able to paste in your queries as is and recreate your table.
-
Download my database election2016.sqlite
-
Make a copy of my Google Spreadsheet of data and example queries and try those queries out.
-
Delete all the pre-generated data in the
predictions
table -
Start writing queries to SELECT data from the other data tables and INSERT, UPDATE, and even DELETE your way to a
predictions
table of election forecasts by state. -
Before exporting your data from the
predictions
table, run these SQL queries to make sure the data is at least logical. -
Every data row in the
predictions
table and exported to yourelections-predictions
table must be created with SQL queries, not manual entry. Whether you choose to fill in each state one-by-one or by one massively generalized analysis, it’s up to you.
The Contest
Why guess the future of our country at the national level when we can efficiently play that game at every state? To celebrate the occasion, this is an assignment to use SQL queries to automate a crude prediction of the electoral results for every state.
Winner/winners will be chosen, based on an algorithm to calculate how accurately each student predicted the outcome.
If you hate prizes and/or competition, then just write the 2 or 3 queries needed to fill the required spreadsheet with (logical) junk. But you probably want to practice the spreadsheet of queries before the midterm: Elections data queries
The deliverable
Read on for more details on the data and examples of how to make calculations. But the end goal is that you turn in a simple 52-row spreadsheet that looks like this, with a Trump vs. Clinton popular vote prediction:
I don't care how you actually come up with the numbers. You could fill every row for Trump with 99
and Clinton with 0.5
. I just want you to do it automatically. If you're pressed for time, maybe you'll turn that in, but I think you'll find automating the analysis of data easier than expected.
Automate the boring analysis
In this assignment, you'll download the election2016.sqlite database Your two main predictions are:
- Predict what Trump and Clinton will receive in terms of popular vote nationwide. This can be just your gut feeling or ripping off Nate Silver. But you'll
INSERT
it via SQL code. - Predict the popular vote for Trump and Clinton for all 50 states and D.C. You can do it row-by-row, like your nationwide prediction. Or you can use the JOINs and WHEREs and GROUP BYs to efficiently and intelligently fill out each prediction, using historical election and poll data.
Each of the rows and values in my predictions table) was generated by a SQL query. There is no hand-entry of data for this exercise. Let's make SQL do all the grunt work. This is our first time altering a database – e.g. inserting, updating, and deleting rows – so we'll learn a bit here. But all the logic we've learned before applies just the same.
Your spreadsheet will have a second sheet that looks like this:
Because every data point was created by code, we can recreate it without worry or work. It's just copy and paste. And that includes when we want to recreate the data and explore different options and analysis.
Practice those queries
Because you can complete this assignment in 2 queries as far, there's not much point in going over very detailed way you could built out your prediction data. That's fine, the main goal is to provide more practice before the midterm. So here's a spreadsheet of SQL queries specific to this data, categorized, and rated:
These queries were meant to give you practice for the midterm and to introduce you to the different tables.
Getting started
Download the database
The election2016.sqlite database is a relatively small database of 4 tables, including predictions
, which you fill out on your own.
So, by all means, start out by downloading this:
See the elections2016 spreadsheet with queries on Google Sheets
This homework is meant to also practice the variety of queries that you might see on the midterm. The midterm won't be dealing with this exact dataset. But rather than write another walkthrough of a new dataset, I've posted a list of queries relevant to this homework.
Here's what the queries tab looks like:
Many of those queries are highly relevant to this homework…So make a copy of this spreadsheet, sort and filter the queries table to your heart's delight. Get an idea of what this dataset contains.
Just to remind you of the link here it is, don't forget it!
Google Spreadsheet "padjo-2016-election2016-with-queries"
What your elections-predictions
spreadsheet should look like
Remember that you have to create a new spreadsheet, elections-predictions
, and turn it in (the elections-queries spreadsheet is just for reference.)
Here's my version:
List of queries with optional commentary
Let's just get it done
OK, admittedly, I assigned this homework very late because I was trying to mix it into Thursday's midterm. So you just have a few hours to get acquainted with the dataset and run some queries, which are very similar in logic to what we've done before (hence, the practice), but are also completely nothing that we've done in class or will for the midterm (you won't have to insert, update, or delete rows).
So I'll just do 3 quick walkthroughs to get the rhythm. Down later in this section, I show a more fuller example with more talk about the new syntax.
It's up to you to get ideas from my Google Spreadsheet "padjo-2016-election2016-with-queries", which contains queries directly relevant to this dataset and to the midterm…
OK, assuming you've downloaded election2016.sqlite to get started…
MUST GET IT DONE BY 5PM
The bare minimum of work to fill in a complete spreadsheet, using SQL alone:
1. Delete every record from predictions
DELETE FROM predictions;
2. Insert a nationwide prediction
INSERT INTO predictions(sunetid, state, Trump, Clinton)
VALUES('mystanfordid', 'Nationwide', 99, 1);
3. Fill in the rest of predictions
with a list of states from any of the other tables:
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 'mystanfordid', state, 2, 88
FROM results;
4. Making sure the data makes sense
Jump down to the checklist of things – SQL queries to run – to ensure your data is good. Not necessarily that your data is right, but that you have 52 rows, having no funny numbers add up out of no where, and so forth.
4. Exports to Google spreadsheet
And you're pretty much done. Export predictions
to CSV, upload to the first tab of elections-predictions
.
Then, starting with DELETE FROM predictions;
, copy and paste each query into a separate second sheet of elections-predictions
.
Here's my example on Google Spreadsheets.
Let's practice joins and other intellectual fun
Hopefully you want to try winning the contest, while further testing your SQL knowledge. The new INSERT
, UPDATE
. DELETE
statements are things that are very useful in real-life, but not so much for the midterm or for this homework. But we can still use what we've learned.
1. Delete every record from predictions
Start clean:
DELETE FROM predictions;
2. Insert a nationwide prediction
Go ahead and make a prediction, any prediction. It's meant to be independent from the statewide data.
INSERT INTO predictions(sunetid, state, Trump, Clinton)
VALUES('mystanfordid', 'Nationwide', 46.3, 47.7);
3. Join and include data from last year's election
The results
table the state by state results for 2012, including popular vote for each state. Both results
and predictions
uses the same state
column, so we can join on it (romney
and obama
are the columns used to label who got what in terms of votes):
SELECT predictions.state, romney, obama
FROM predictions
INNER JOIN
results ON
results.state = predictions.state;
The trick to learn (which I've never learned, just Googled-on-demand) is that we can INSERT INTO
a batch of data, including data that results from a query. It's not an INNER JOIN, per se, it's more of a subquery that's dumped into the empty table.
INSERT INTO
predictions(sunetid, state, Trump, Clinton)
SELECT 'mynameisdan', results.state,
romney, obama
FROM results;
The result is that predictions
has 51 new rows with relevant political data.
4. Add demographic data penalty/benefits
We could finish here, but let's try to include demographic insights from the acs_1yr
table, which is a selection of Census demographic data.
Let's try to express that claim that minority groups, particularly Hispanic and Latino communities, are wary of Trump.
Find all states in which the Hispanic population is at least 25% of the white population:
SELECT name
FROM acs_1yr
WHERE Hispanic >= White * 0.25;
How do we join this with predictions
. We can't with UPDATE
.
But we do can perform a sneaky kind of a subquery:
- get a list of state names
- Use
IN
on this list, in aWHERE
clause
This won't be on the test, but it's useful to know and kind of intuitive. Below, I check if predictions.state
is IN
a list of state names that happen to be states with a large Hispanic population. If true, the Clinton value of the given state's row gets 3 points and Trump gets -4.5 points.
UPDATE predictions
SET Clinton = Clinton + 3
Trump = Trump - 4.5
WHERE state IN(
SELECT name
FROM acs_1yr
WHERE Hispanic >= White * 0.25
);
5. An anti-immigration boost for Trump
Let's give Trump a boost. He thinks the lower class has been left behind. And he thinks the Democrats are giving too many resources to immigrants. So for every state in which
But let's go with what Trump argues, that the lower class isn't being helped by the Democrats. And that an increase in foreign-born population has angered people.
States where the Hispanic + Asian + Foreign-born are 3 times the population below poverty line – keep in mind that there's double counting here – have Clinton losing 5.4 points and Trump gaining 4.2 points. Whatever.
UPDATE predictions
SET Clinton = Clinton - 5.4,
Trump = Trump + 4.2
WHERE state IN(
SELECT name
FROM acs_1yr
WHERE
(foreign_born_population + hispanic + asian)
> below_poverty_line * 3
);
5. Counting electoral votes
At this point, we can move on to the Google Spreadsheet phase and be done.
Instead, let's figure out how much things have changed from last year, given our demographic-political-based tinkering on the Obama-vs-Romney data.
Let's do a JOIN between predictions
and results
. Then add the electoral_votes
for rows in which the Trump
column beats Clinton
:
SELECT SUM(results.electoral_votes)
FROM predictions
INNER JOIN
results ON
results.state = predictions.state
WHERE (Trump - Clinton) > 0;
The answer:
SUM(results.electoral_votes) |
---|
250 |
Not a bad margin for Trump!
Trying our hand at polling
The most complicated prediction to make involves using polling data – quality of pollsters varies, as does time: a poll a day before the election is much different than a February poll.
And pollsters run many polls for the same states. The fact that some states have many more polls can be used to deal specifically with competitive races.
I won't walk through queries and SQL and setting up the predictions
table from scratch as before, I'll just show how easy it is (and how complex it can be) to add data via JOINS.
(Start off with deleting predictions on any new analysis)
DELETE FROM predictions;
1. Fill in the poll numbers from the most comprehensive polls since October 2016
Not all polls are created alike. I'm going to define "high-quality" as any poll that started in October or later and has gone on for at least 10 days.
For a major state like Ohio, you can imagine that several companies and their polls fall in this criteria. We group them by state and average what they recorded for dem
and rep
, the approval ratings for Clinton and Trump, respectively:
This query fines 7 states:
INSERT INTO
predictions(sunetid, state, Trump, Clinton)
SELECT 'stanford', state, AVG(rep), AVG(dem)
FROM polls
WHERE date > '2016-10'
AND number_of_days >= 10
GROUP BY state;
2. Fill in the states that are missing with fun SurveyMonkey polls
So we have 44 other states to plug-in. Rather than dig around the polls
database, I'm just going to do something very lazy and broad and hope it all cancels out: let's just get survey from a place that does a lot of them and average those together.
It's easy enough to do a group-by-count-sort and count the polling company with the most polls since September 2016:
SELECT poll_name,
COUNT(*) AS ct
FROM polls
WHERE date > '2016-09'
GROUP BY poll_name
ORDER BY ct DESC
LIMIT 10;
It's SurveyMonkey by a landslide, though Emerson College, which shows up as 2 aggregates because of inconsistent naming, looks to be plentiful, too:
poll_name | ct |
---|---|
SurveyMonkey | 120 |
Emerson Coll. | 39 |
PPP | 24 |
Monmouth U. | 20 |
Quinnipiac U. | 20 |
SurveyUSA | 20 |
YouGov | 20 |
Marist Coll. | 16 |
Opinion Research | 12 |
Emerson College | 9 |
Nate Silver doesn't seem to be a huge fan of SurveyMonkey, giving it a C. But if we do a group-by state for recent SurveyMonkey polls, thus getting a list of all states for which SM has polled recently, we can see that SurveyMonkey has total coverage of all 51 states:
SELECT COUNT(*)
FROM(
SELECT state,
COUNT(*) AS ct
FROM polls
WHERE date > '2016-10'
AND poll_name = 'SurveyMonkey'
GROUP BY state
);
Sounds good enough to me.
Now we get an average of rep
and dem
per state…
SELECT
state, AVG(rep), AVG(dem)
FROM polls
WHERE date > '2016-10'
AND poll_name = 'SurveyMonkey'
GROUP BY state
…but filter out states that are already listed in predictions
, using this jury-rigged subquery with NOT IN
:
SELECT
state, AVG(rep), AVG(dem)
FROM polls
WHERE date > '2016-10'
AND poll_name = 'SurveyMonkey'
AND state NOT IN (SELECT state FROM predictions)
GROUP BY state;
The insertion is just a little more code…
INSERT INTO
predictions(sunetid, state, Trump, Clinton)
SELECT
'stanford', state, AVG(rep), AVG(dem)
FROM polls
WHERE date > '2016-10'
AND poll_name = 'SurveyMonkey'
AND state NOT IN (SELECT state FROM predictions)
GROUP BY state;
OK, enough examples for now. Keep reading for more details. But your best bet is to try out the spreadsheet of queries.
The data
Let's look more closely at the data contained in the elections data tables
Guessing the national popular vote
The first requirement of this exercise is that you make a guess about what the percentage popular vote will be for Clinton and Trump (to keep things simple, we're keeping it between these two and not third-party candidates).
Start off with the results
table of the database, which looks like this on Google Sheets:
romney
and obama
indicate how much of the state's popular vote each candidate received. Pretty easy to figure out who won which state and by what margins and so forth.
For 2016, it's a guess. You can use your gut: Trump: 88%, Clinton 5%. Just make sure the numbers add up to 100% or less (probably less).
But maybe you don't like guessing. That's OK, that's what Google and news sites are for.
Here's a few resources for getting the national (and local) picture:
New York Times Latest Election Polls 2016
FiveThirtyEight is of course, polling central
But be careful that you don't interpret "Chance of winning" as their polling prediction, i.e., 538 doesn't think that Clinton will get 71.9% of the popular vote:
Instead, check out its table for Popular Vote:
Huffington Post has a wonderful page and data (though I couldn't find its popular vote):
And then you have the somewhat (in)famous Daybreak Poll from the Los Angeles Times and USC, "infamous" because several people really don't like seeing its trend line:
Note that the Daybreak poll isn't a typical poll. It doesn't ask people to make a decision, but to rate on scale from 0 to 100 on how likely they are to vote. The poll was used in 2012 and apparently was one of the most accurate, guessing Obama's victory margin to be 3.32 points (it was 3.85).
State data
The state data – particularly the polls
data – used for this lesson comes courtesy of Electoral-Vote.com, which you can peruse for a lot of interesting historical data.
Check out the polls
in your database. Here's what it looks like in Google Spreadsheets:
It's not just one poll per state per race. Pollsters almost all run more than one poll per state, and the timing is important. For some states, major pollsters don't even bother checking much (think of how some states aren't of much interest when it comes to predicting how they'll sway).
538's Pollster Ratings
Not all Pollsters are equal, and Nate Silver at FiveThirtyEight has put together a great table showing his analysis of the bias and relative success of each pollster:
I would have loved to include that data in our exercise; in fact, 538 has a repo for the raw CSV, so feel free to import it yourself.
However, it requires the hardest work of all – data cleaning and reconciliation of names – before you can easily join it to polls
. For instance, "PPP" in the Electoral-Vote.com data corresponds to, "Public Policy Polling" with 538.
That said you, can still just glance at 538's table
Learning SQL beyond SELECT
There's not enough time in this course to cover the SQL statements that, unlike SELECT, modify the database. So this is a good time to shove that in. Luckily, most of the syntax and concepts are the same, so it's an opportunity to practice what we've learned with JOINS and aggregations.
The INSERT statement
To insert a row into a table, we use this syntax: INSERT INTO
to specify the table and columns (and their exact order). And then VALUES
to specify the row of values:
INSERT INTO some_table(x, y, z)
VALUES(111, 222, 'hello');
– where x
, y
, z
are columns in some_table
.
To insert the row for the required "Nationwide" guess, with the columns sunetid
, state
, Trump
, and Clinton
being set, here's how to do it manually:
INSERT INTO predictions(sunetid, state, Trump, Clinton)
VALUES('dun', 'Nationwide', 49, 48);
What if you hit the Execute button too many times? Then you get too many rows:
The DELETE statement
Sometimes you just have to wipe things out. To wipe out every row in the predictions
table:
DELETE FROM predictions;
Deleting conditionally with WHERE
Sometimes you need a bit more precision. Pretend you have most of your data in the table and you make one mistake – you have the wrong value for Trump/Clinton for the New York
row. Then add a WHERE clause:
DELETE FROM predictions
WHERE state = 'New York';
Of course, the above condition will delete every row
The UPDATE statement
UPDATE predictions
SET sunetid = 'dun';
Set Trump all the way!
UPDATE predictions
SET Trump = 100;
UPDATE on a conditional
Target a row with WHERE
and update a field:
UPDATE predictions
SET Clinton = 55
WHERE predictions.state = 'Nationwide';
Target a row and update 2 fields at once:
UPDATE predictions
SET Clinton = 30, Trump = 65
WHERE predictions.state = 'California';
And of course, you can dumbly update multiple rows with any condition(s) you want. The following will update all states with a name longer than "ClintonTrump" with the Clinton and Trump columns set to 99:
UPDATE predictions
SET Clinton = 99, Trump = 99
WHERE LENGTH(state) > LENGTH('Clinton') + LENGTH('Trump');
Combining INSERT INTO with SELECT
What if we wanted to fill predictions
with the values for romney
and obama
from results
, but for Trump
and Clinton
, because we think that nothing ever changes etc etc? Then it would seem convenient to dump rows from one table into another.
No JOIN/UPDATE in SQLite
You might think of it as simply SELECTing the pertinent columns from results
:
SELECT state, romney, trump
FROM results;
And joining to an UPDATE
statement on predictions
and doing a SET
like before:
UPDATE predictions
INNER JOIN
results ON
results.state = predictions.state
SET predictions.state = results.state,
Trump = romney,
Clinton = obama;
Not a bad guess, because that's what I guessed. But SQLite doesn't support this syntax.
INSERT INTO then SELECT
Luckily, INSERT INTO
can work in conjunction with a standard SELECT
statement. Here's how to insert my SUNETID, state names from results
, and 50
, 50
, respectively into Trump
and Clinton
:
(Note how we don't use VALUES
)
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 'dun', state, 50, 50
FROM results;
That SELECT
statement can be any query. Here's how to fit each state row with a corresponding rep
and dem
group average from polls
:
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 'dun', state, AVG(rep), AVG(dem)
FROM polls
GROUP BY state;
Sample JOIN with ethnic demographic data
But why stop with a single SELECT statement? What if we want to use the demographic data in the ACS-1-year table, i.e. acs_1yr
, as well as the 2012 results
for Obama and Romney.
Let's buy into the assumption that Hispanic voters will be reluctant to vote for Trump. Hispanics' affinity towards Obama and Democrats is already baked into the Obama vs. Romney 2012 votes. But maybe we think that Hispanic votes will be even more spurred to vote Democrat.
Here's how we would quantify that:
- If the Hispanic population is greater than 20% of the total population for a given state, set the
Clinton
andTrump
to1.015
. and0.97
of the correspondingobama
andromney
values.
Selecting, joining Hispanic states and voting results
(remember that the state name is in name
for acs_1yr
)
SELECT name
FROM acs_1yr
WHERE (100 * hispanic / total_population) > 10;
Now, the basic select for results
:
SELECT state, romney, obama
FROM results;
Joining:
SELECT results.state,
romney * 0.97,
obama * 1.015
FROM results
INNER JOIN acs_1yr
ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) >= 10;
And then inserting:
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT
'dun', results.state,
romney * 0.97, obama * 1.015
FROM results
INNER JOIN acs_1yr
ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) >= 10;
- Else, keep
Clinton
andTrump
the same asObama
andRomney
for now – i.e. insert all other states that don't have a Hispanic population +10%
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT
'dun', results.state,
romney, obama
FROM results
INNER JOIN acs_1yr
ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) < 10;
Updates using subqueries
(Note that doing a subquery to generate a list of things to use IN
with won't be on the Midterm)
Nate Silver told ABC News that he doesn't see the same turnout from African-American voters as in 2012, which could be pivotal for some major states.
So let's use that bit of info to subtract from Clinton's turnout.
Selecting from all states in which blacks make up 10+ percent of the population:
SELECT *
FROM acs_1yr
WHERE (100 * black / total_population) >= 10;
We can't join with an UPDATE
statement, but we can use a subquery like so; the subquery acts as a list of state names for the WHERE condition:
UPDATE predictions
SET Clinton = Clinton * 0.99
WHERE predictions.state IN(
SELECT name
FROM acs_1yr
WHERE (100 * black / total_population) >= 10
)
Individual updates
And of course you can always set values manually. The North Carolina results, being mostly based off of Romney's win in 2012, gives poor chances to Clinton. But you see in the New York Times that it's basically a statistical dead heat at 44%
So you could manually tweak it:
UPDATE predictions
SET Clinton = 44.0, Trump = 44.1
WHERE state = 'North Carolina';
Or, you could pull in from the polls
data, such as selecting the average of the North Carolina polls since mid-October:
UPDATE predictions
SET
Clinton = (SELECT AVG(dem) FROM polls WHERE date > '2016-10-15'
AND state = 'North Carolina'),
Trump = (SELECT AVG(rep) FROM polls WHERE date > '2016-10-15'
AND state = 'North Carolina');
…but actually, don't do the above. Because you need to make sure that state
inside the subquery, and inside the main query, are equal to 'North Carolina':
(confusing, I know…which is why this is not on the test)
UPDATE predictions
SET
Clinton = (SELECT AVG(dem) FROM polls WHERE date > '2016-10-15'
AND state = 'North Carolina'),
Trump = (SELECT AVG(rep) FROM polls WHERE date > '2016-10-15'
AND state = 'North Carolina')
WHERE state = 'North Carolina';
If you didn't mess up, and you haven't already manually inputted your Nationwide
estimate, here's how to manually insert a new row:
INSERT INTO
predictions(sunetid, state, Clinton, Trump)
VALUES('dun', 'Nationwide', 48.8, 46.3);
Messing up is OK because playback is fun!
But pretend you did mess up. Well that's the fun part of keeping track of your queries. I recommend throwing them into a spreadsheet for easier copy-and-paste playback. Plus, you already have to do that for the requirements of this assignment.
All over again, with a paste
Starting over is as easy as copying-and-pasting all the steps in, all at once. That semi-colon which doesn't seem to do anything? Well, when you need to do a bunch of things in a batch, especially non-selecting queries, it's efficient to paste all the lines in. The semi-colons tell the interpreter when one commands and the other begins.
Note: if you're using a client like DB Browser, it's best, after you've pasted the query, to Select All (Cmd-A) and then execute. Sometimes I hit the "execute line" by accident.
DELETE FROM predictions;
/* add states with high numbers of hispanics*/
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT
'dun', results.state,
romney * 0.97, obama * 1.015
FROM results
INNER JOIN acs_1yr
ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) >= 10;
/* insert all the other states */
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT
'dun', results.state,
romney, obama
FROM results
INNER JOIN acs_1yr
ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) < 10;
/* adjust Clinton's states where there's a large black population
inspired to vote for Obama */
UPDATE predictions
SET Clinton = Clinton * 0.99
WHERE predictions.state
IN(
SELECT name
FROM acs_1yr
WHERE (100 * black / total_population) >= 10
);
/* Adjust North Carolina with this ugly ugly code
...yeah, probably don't do this on your own*/
UPDATE predictions
SET
Clinton = (SELECT AVG(dem) FROM polls WHERE date > '2016-10-15'
AND state = 'North Carolina'),
Trump = (SELECT AVG(rep) FROM polls WHERE date > '2016-10-15'
AND state = 'North Carolina')
WHERE state = 'North Carolina';
/* finally, insert my national predictions */
INSERT INTO
predictions(sunetid, state, Clinton, Trump)
VALUES('dun', 'Nationwide', 48.8, 46.3);
Checks and balances list
OK, before finishing up, let's make sure you have all of these things done in your predictions
table:
- There should be 52 rows.
- There should be 538 electoral votes total (like 538 get it?)
- There should be one row in which the
state
isNationwide
(with proper capitalization) sunetid
should be straight-filled with your lowercase Stanford SUnet ID, e.g.bobby
.- The total of
Trump
andClinton
for a given row should be between 0 and 100. - No rows should have a negative
Trump
orClinton
Can you figure out the queries needed to confirm all of the above? See if you can write them, then see how you compare to me.
There should be 52 rows
Easy:
SELECT COUNT(*) FROM predictions;
There should be 538 total electoral votes
The electoral_votes
is in results
, so this will add up:
SELECT SUM(electoral_votes) FROM results;
But we already knew that. We need to know if that corresponds to what we have in predictions
:
SELECT SUM(electoral_votes) FROM results
INNER JOIN predictions
ON predictions.state = results.state;
If your predictions
has mismatched/misspelled names, then INNER JOIN won't result in a 52-row match, nor an accurate sum of electoral_votes
Optional: count by candidate
If you're interested in how your electoral votes per candidate held up, it's a simple SUM
plus JOIN
and filter with WHERE
.
To get Trump's count of electoral votes, do a SUM of results.electoral_votes
(we only need one column), joining predictions to results on the state
field.
Then filter out rows on the condition that Trump
(his share of the popular vote) is bigger than Clinton
SELECT SUM(results.electoral_votes)
FROM predictions
INNER JOIN
results ON
results.state = predictions.state
WHERE (Trump - Clinton) > 0;
CASE is cool
If you want to be fancy and practice a very useful thing not on the test, use the CASE expression, which is SQLite's version of a spreadsheet's if/else.
With CASE
, you effectively can create a category column, i.e. Trump
when he has more votes, Clinton
otherwise. In the previous query, we would have to run it twice.
Note how the CASE
expression is aliased AS winner
…that big thing is just a column value in the end.
The query below, we run it once and do good ol' GROUP BY
:
SELECT
CASE WHEN Trump > Clinton
THEN 'Trump'
ELSE 'Clinton'
END AS winner,
SUM(results.electoral_votes) AS evotes
FROM predictions
INNER JOIN
results ON
results.state = predictions.state
GROUP BY winner
ORDER BY evotes DESC;
winner | evotes |
---|---|
Clinton | 347 |
Trump | 191 |
One row with Nationwide
as state
The following query should equal 1
:
SELECT COUNT(*) FROM predictions
WHERE `state` = 'Nationwide';
All sunetid
values should be your SUNet ID
The following query should equal 0
:
SELECT COUNT(*) FROM predictions
WHERE sunetid != 'your_sunet_id_here';
0 <= Trump + Clinton <= 100
There should be 0 rows in which the following condition is true:
SELECT COUNT(*) FROM predictions
WHERE (Trump + Clinton)
NOT BETWEEN 0 AND 100;
Trump >= 0 AND Clinton >= 0
No rows should have either Trump nor Clinton below 0:
SELECT COUNT(*) FROM predictions
WHERE
Trump < 0 OR Clinton < 0;
All done
If you got here, then your predictions
table is ready to be exported to CSV as is, i.e. do SELECT * FROM predictions;
Then import it into your elections-predictions
spreadsheet.
Don't forget to create a second spreadsheet in elections-predictions
, in which every row is just a single column of a query (ended by a semi-colon) needed to build your table. Feel free to add notes in adjacent columns.
You can see how I did it here:
List of queries with optional commentary
More queries to try
If you haven't already, check out my spreadsheet of election queries, which also has copies of the three data tables: padjo-2016-election-queries
Make a copy of the spreadsheet, or at least the queries page, and sort-and-search and try the queries on your own: