Homework: Election Predictions with SQL

Today is Election Day. Learn new SQL statement, compete with your classmates, and win the prediction pool by copying Nate Silver.

Deadline

Tuesday, November 8 at 5:00 PM

Points

5 Homework points

Deliverables

Create a new spreadsheet named elections-predictions in your padjo-2016 folder.
The first sheet should consist of the predictions table, with 52 rows.
One row should have a state value of Nationwide and what you’ve guessed for Clinton and Trump
The other 51 rows are the 50 states + District of Columbia.
The second sheet should be a simple list of SQL queries, with a semi-colon after each one. I should be able to paste in your queries as is and recreate your table.

Requirements

Download my database election2016.sqlite
Make a copy of my Google Spreadsheet of data and example queries and try those queries out.
Delete all the pre-generated data in the predictions table
Start writing queries to SELECT data from the other data tables and INSERT, UPDATE, and even DELETE your way to a predictions table of election forecasts by state.
Before exporting your data from the predictions table, run these SQL queries to make sure the data is at least logical.
Every data row in the predictions table and exported to your elections-predictions table must be created with SQL queries, not manual entry. Whether you choose to fill in each state one-by-one or by one massively generalized analysis, it’s up to you.

The Contest

Polling is serious business:

Why guess the future of our country at the national level when we can efficiently play that game at every state? To celebrate the occasion, this is an assignment to use SQL queries to automate a crude prediction of the electoral results for every state.

Winner/winners will be chosen, based on an algorithm to calculate how accurately each student predicted the outcome.

If you hate prizes and/or competition, then just write the 2 or 3 queries needed to fill the required spreadsheet with (logical) junk. But you probably want to practice the spreadsheet of queries before the midterm: Elections data queries

The deliverable

Read on for more details on the data and examples of how to make calculations. But the end goal is that you turn in a simple 52-row spreadsheet that looks like this, with a Trump vs. Clinton popular vote prediction:

I don't care how you actually come up with the numbers. You could fill every row for Trump with 99 and Clinton with 0.5. I just want you to do it automatically. If you're pressed for time, maybe you'll turn that in, but I think you'll find automating the analysis of data easier than expected.

Automate the boring analysis

In this assignment, you'll download the election2016.sqlite database Your two main predictions are:

Predict what Trump and Clinton will receive in terms of popular vote nationwide. This can be just your gut feeling or ripping off Nate Silver. But you'll INSERT it via SQL code.
Predict the popular vote for Trump and Clinton for all 50 states and D.C. You can do it row-by-row, like your nationwide prediction. Or you can use the JOINs and WHEREs and GROUP BYs to efficiently and intelligently fill out each prediction, using historical election and poll data.

Each of the rows and values in my predictions table) was generated by a SQL query. There is no hand-entry of data for this exercise. Let's make SQL do all the grunt work. This is our first time altering a database – e.g. inserting, updating, and deleting rows – so we'll learn a bit here. But all the logic we've learned before applies just the same.

Your spreadsheet will have a second sheet that looks like this:

Because every data point was created by code, we can recreate it without worry or work. It's just copy and paste. And that includes when we want to recreate the data and explore different options and analysis.

Practice those queries

Because you can complete this assignment in 2 queries as far, there's not much point in going over very detailed way you could built out your prediction data. That's fine, the main goal is to provide more practice before the midterm. So here's a spreadsheet of SQL queries specific to this data, categorized, and rated:

Elections data queries

These queries were meant to give you practice for the midterm and to introduce you to the different tables.

Getting started

Download the database

The election2016.sqlite database is a relatively small database of 4 tables, including predictions, which you fill out on your own.

So, by all means, start out by downloading this:

election2016.sqlite database

See the elections2016 spreadsheet with queries on Google Sheets

This homework is meant to also practice the variety of queries that you might see on the midterm. The midterm won't be dealing with this exact dataset. But rather than write another walkthrough of a new dataset, I've posted a list of queries relevant to this homework.

Here's what the queries tab looks like:

Many of those queries are highly relevant to this homework…So make a copy of this spreadsheet, sort and filter the queries table to your heart's delight. Get an idea of what this dataset contains.

Just to remind you of the link here it is, don't forget it!

Google Spreadsheet "padjo-2016-election2016-with-queries"

What your `elections-predictions` spreadsheet should look like

Remember that you have to create a new spreadsheet, elections-predictions, and turn it in (the elections-queries spreadsheet is just for reference.)

Here's my version:

My exported predictions

List of queries with optional commentary

Let's just get it done

OK, admittedly, I assigned this homework very late because I was trying to mix it into Thursday's midterm. So you just have a few hours to get acquainted with the dataset and run some queries, which are very similar in logic to what we've done before (hence, the practice), but are also completely nothing that we've done in class or will for the midterm (you won't have to insert, update, or delete rows).

So I'll just do 3 quick walkthroughs to get the rhythm. Down later in this section, I show a more fuller example with more talk about the new syntax.

It's up to you to get ideas from my Google Spreadsheet "padjo-2016-election2016-with-queries", which contains queries directly relevant to this dataset and to the midterm…

OK, assuming you've downloaded election2016.sqlite to get started…

MUST GET IT DONE BY 5PM

The bare minimum of work to fill in a complete spreadsheet, using SQL alone:

1. Delete every record from `predictions`

DELETE FROM predictions;

2. Insert a nationwide prediction

INSERT INTO predictions(sunetid, state, Trump, Clinton)
VALUES('mystanfordid', 'Nationwide', 99, 1);

3. Fill in the rest of `predictions` with a list of states from any of the other tables:

INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 'mystanfordid', state, 2, 88
FROM results;

4. Making sure the data makes sense

Jump down to the checklist of things – SQL queries to run – to ensure your data is good. Not necessarily that your data is right, but that you have 52 rows, having no funny numbers add up out of no where, and so forth.

4. Exports to Google spreadsheet

And you're pretty much done. Export predictions to CSV, upload to the first tab of elections-predictions.

Then, starting with DELETE FROM predictions;, copy and paste each query into a separate second sheet of elections-predictions.

Here's my example on Google Spreadsheets.

Let's practice joins and other intellectual fun

Hopefully you want to try winning the contest, while further testing your SQL knowledge. The new INSERT, UPDATE. DELETE statements are things that are very useful in real-life, but not so much for the midterm or for this homework. But we can still use what we've learned.

1. Delete every record from `predictions`

Start clean:

DELETE FROM predictions;

2. Insert a nationwide prediction

Go ahead and make a prediction, any prediction. It's meant to be independent from the statewide data.

INSERT INTO predictions(sunetid, state, Trump, Clinton)
VALUES('mystanfordid', 'Nationwide', 46.3, 47.7);

3. Join and include data from last year's election

The results table the state by state results for 2012, including popular vote for each state. Both results and predictions uses the same state column, so we can join on it (romney and obama are the columns used to label who got what in terms of votes):

SELECT predictions.state, romney, obama 
FROM predictions
INNER JOIN 
  results ON
  results.state = predictions.state;

The trick to learn (which I've never learned, just Googled-on-demand) is that we can INSERT INTO a batch of data, including data that results from a query. It's not an INNER JOIN, per se, it's more of a subquery that's dumped into the empty table.

INSERT INTO 
  predictions(sunetid, state, Trump, Clinton)
  SELECT 'mynameisdan', results.state, 
    romney, obama 
  FROM results;

The result is that predictions has 51 new rows with relevant political data.

4. Add demographic data penalty/benefits

We could finish here, but let's try to include demographic insights from the acs_1yr table, which is a selection of Census demographic data.

Let's try to express that claim that minority groups, particularly Hispanic and Latino communities, are wary of Trump.

Find all states in which the Hispanic population is at least 25% of the white population:

SELECT name
FROM acs_1yr
WHERE Hispanic >= White * 0.25;

How do we join this with predictions. We can't with UPDATE.

But we do can perform a sneaky kind of a subquery:

get a list of state names
Use IN on this list, in a WHERE clause

This won't be on the test, but it's useful to know and kind of intuitive. Below, I check if predictions.state is IN a list of state names that happen to be states with a large Hispanic population. If true, the Clinton value of the given state's row gets 3 points and Trump gets -4.5 points.

UPDATE predictions
  SET Clinton = Clinton + 3
    Trump = Trump - 4.5
  WHERE state IN(
    SELECT name
    FROM acs_1yr
    WHERE Hispanic >= White * 0.25
  );

5. An anti-immigration boost for Trump

Let's give Trump a boost. He thinks the lower class has been left behind. And he thinks the Democrats are giving too many resources to immigrants. So for every state in which

But let's go with what Trump argues, that the lower class isn't being helped by the Democrats. And that an increase in foreign-born population has angered people.

States where the Hispanic + Asian + Foreign-born are 3 times the population below poverty line – keep in mind that there's double counting here – have Clinton losing 5.4 points and Trump gaining 4.2 points. Whatever.

UPDATE predictions
  SET Clinton = Clinton - 5.4,
  Trump = Trump +  4.2
  WHERE state  IN(
      SELECT name
      FROM acs_1yr
      WHERE 
        (foreign_born_population + hispanic + asian) 
          > below_poverty_line * 3
  );

5. Counting electoral votes

At this point, we can move on to the Google Spreadsheet phase and be done.

Instead, let's figure out how much things have changed from last year, given our demographic-political-based tinkering on the Obama-vs-Romney data.

Let's do a JOIN between predictions and results. Then add the electoral_votes for rows in which the Trump column beats Clinton:

SELECT SUM(results.electoral_votes) 
FROM predictions
INNER JOIN
  results ON
    results.state = predictions.state
WHERE (Trump - Clinton)  > 0;

The answer:

SUM(results.electoral_votes)
250

Not a bad margin for Trump!

Trying our hand at polling

The most complicated prediction to make involves using polling data – quality of pollsters varies, as does time: a poll a day before the election is much different than a February poll.

And pollsters run many polls for the same states. The fact that some states have many more polls can be used to deal specifically with competitive races.

I won't walk through queries and SQL and setting up the predictions table from scratch as before, I'll just show how easy it is (and how complex it can be) to add data via JOINS.

(Start off with deleting predictions on any new analysis)

DELETE FROM predictions;

1. Fill in the poll numbers from the most comprehensive polls since October 2016

Not all polls are created alike. I'm going to define "high-quality" as any poll that started in October or later and has gone on for at least 10 days.

For a major state like Ohio, you can imagine that several companies and their polls fall in this criteria. We group them by state and average what they recorded for dem and rep, the approval ratings for Clinton and Trump, respectively:

This query fines 7 states:

INSERT INTO 
  predictions(sunetid, state, Trump, Clinton)
SELECT 'stanford', state, AVG(rep), AVG(dem)
FROM polls
WHERE date > '2016-10'
  AND number_of_days >= 10
GROUP BY state;

2. Fill in the states that are missing with fun SurveyMonkey polls

So we have 44 other states to plug-in. Rather than dig around the polls database, I'm just going to do something very lazy and broad and hope it all cancels out: let's just get survey from a place that does a lot of them and average those together.

It's easy enough to do a group-by-count-sort and count the polling company with the most polls since September 2016:

SELECT poll_name, 
  COUNT(*) AS ct
FROM polls
WHERE date > '2016-09'
GROUP BY poll_name
ORDER BY ct DESC
LIMIT 10;

It's SurveyMonkey by a landslide, though Emerson College, which shows up as 2 aggregates because of inconsistent naming, looks to be plentiful, too:

poll_name	ct
SurveyMonkey	120
Emerson Coll.	39
PPP	24
Monmouth U.	20
Quinnipiac U.	20
SurveyUSA	20
YouGov	20
Marist Coll.	16
Opinion Research	12
Emerson College	9

Nate Silver doesn't seem to be a huge fan of SurveyMonkey, giving it a C. But if we do a group-by state for recent SurveyMonkey polls, thus getting a list of all states for which SM has polled recently, we can see that SurveyMonkey has total coverage of all 51 states:

SELECT COUNT(*)
FROM(
      SELECT state,
        COUNT(*) AS ct
      FROM polls
      WHERE date > '2016-10'
        AND poll_name = 'SurveyMonkey'
      GROUP BY state
);

Sounds good enough to me.

Now we get an average of rep and dem per state…

SELECT 
  state, AVG(rep), AVG(dem)
FROM polls
WHERE date > '2016-10'
  AND poll_name = 'SurveyMonkey'
GROUP BY state

…but filter out states that are already listed in predictions, using this jury-rigged subquery with NOT IN:

SELECT 
  state, AVG(rep), AVG(dem)
FROM polls
WHERE date > '2016-10'
  AND poll_name = 'SurveyMonkey'
  AND state NOT IN (SELECT state FROM predictions)
GROUP BY state;

The insertion is just a little more code…

INSERT INTO 
  predictions(sunetid, state, Trump, Clinton)
SELECT 
  'stanford', state, AVG(rep), AVG(dem)
FROM polls
WHERE date > '2016-10'
  AND poll_name = 'SurveyMonkey'
  AND state NOT IN (SELECT state FROM predictions)
GROUP BY state;

OK, enough examples for now. Keep reading for more details. But your best bet is to try out the spreadsheet of queries.

The data

Let's look more closely at the data contained in the elections data tables

Guessing the national popular vote

The first requirement of this exercise is that you make a guess about what the percentage popular vote will be for Clinton and Trump (to keep things simple, we're keeping it between these two and not third-party candidates).

Start off with the results table of the database, which looks like this on Google Sheets:

romney and obama indicate how much of the state's popular vote each candidate received. Pretty easy to figure out who won which state and by what margins and so forth.

For 2016, it's a guess. You can use your gut: Trump: 88%, Clinton 5%. Just make sure the numbers add up to 100% or less (probably less).

But maybe you don't like guessing. That's OK, that's what Google and news sites are for.

Here's a few resources for getting the national (and local) picture:

New York Times Latest Election Polls 2016

FiveThirtyEight is of course, polling central

But be careful that you don't interpret "Chance of winning" as their polling prediction, i.e., 538 doesn't think that Clinton will get 71.9% of the popular vote:

Instead, check out its table for Popular Vote:

Huffington Post has a wonderful page and data (though I couldn't find its popular vote):

And then you have the somewhat (in)famous Daybreak Poll from the Los Angeles Times and USC, "infamous" because several people really don't like seeing its trend line:

Note that the Daybreak poll isn't a typical poll. It doesn't ask people to make a decision, but to rate on scale from 0 to 100 on how likely they are to vote. The poll was used in 2012 and apparently was one of the most accurate, guessing Obama's victory margin to be 3.32 points (it was 3.85).

State data

The state data – particularly the polls data – used for this lesson comes courtesy of Electoral-Vote.com, which you can peruse for a lot of interesting historical data.

Check out the polls in your database. Here's what it looks like in Google Spreadsheets:

It's not just one poll per state per race. Pollsters almost all run more than one poll per state, and the timing is important. For some states, major pollsters don't even bother checking much (think of how some states aren't of much interest when it comes to predicting how they'll sway).

538's Pollster Ratings

Not all Pollsters are equal, and Nate Silver at FiveThirtyEight has put together a great table showing his analysis of the bias and relative success of each pollster:

I would have loved to include that data in our exercise; in fact, 538 has a repo for the raw CSV, so feel free to import it yourself.

However, it requires the hardest work of all – data cleaning and reconciliation of names – before you can easily join it to polls. For instance, "PPP" in the Electoral-Vote.com data corresponds to, "Public Policy Polling" with 538.

That said you, can still just glance at 538's table

Learning SQL beyond SELECT

There's not enough time in this course to cover the SQL statements that, unlike SELECT, modify the database. So this is a good time to shove that in. Luckily, most of the syntax and concepts are the same, so it's an opportunity to practice what we've learned with JOINS and aggregations.

The INSERT statement

To insert a row into a table, we use this syntax: INSERT INTO to specify the table and columns (and their exact order). And then VALUES to specify the row of values:

INSERT INTO some_table(x, y, z)
VALUES(111, 222, 'hello');

– where x, y, z are columns in some_table.

To insert the row for the required "Nationwide" guess, with the columns sunetid, state, Trump, and Clinton being set, here's how to do it manually:

INSERT INTO predictions(sunetid, state, Trump, Clinton)
VALUES('dun', 'Nationwide', 49, 48);

What if you hit the Execute button too many times? Then you get too many rows:

The DELETE statement

Sometimes you just have to wipe things out. To wipe out every row in the predictions table:

DELETE FROM predictions;

Deleting conditionally with WHERE

Sometimes you need a bit more precision. Pretend you have most of your data in the table and you make one mistake – you have the wrong value for Trump/Clinton for the New York row. Then add a WHERE clause:

DELETE FROM predictions
WHERE state = 'New York';

Of course, the above condition will delete every row

The UPDATE statement

UPDATE predictions
SET sunetid = 'dun';

Set Trump all the way!

UPDATE predictions
SET Trump = 100;

UPDATE on a conditional

Target a row with WHERE and update a field:

UPDATE predictions
SET Clinton = 55
WHERE predictions.state = 'Nationwide';

Target a row and update 2 fields at once:

UPDATE predictions
SET Clinton = 30, Trump = 65
WHERE predictions.state = 'California';

And of course, you can dumbly update multiple rows with any condition(s) you want. The following will update all states with a name longer than "ClintonTrump" with the Clinton and Trump columns set to 99:

UPDATE predictions
SET Clinton = 99, Trump = 99
WHERE LENGTH(state) > LENGTH('Clinton') + LENGTH('Trump');

Combining INSERT INTO with SELECT

What if we wanted to fill predictions with the values for romney and obama from results, but for Trump and Clinton, because we think that nothing ever changes etc etc? Then it would seem convenient to dump rows from one table into another.

No JOIN/UPDATE in SQLite

You might think of it as simply SELECTing the pertinent columns from results:

SELECT state, romney, trump 
FROM results;

And joining to an UPDATE statement on predictions and doing a SET like before:

UPDATE predictions
INNER JOIN
  results ON 
    results.state = predictions.state
SET predictions.state = results.state,
  Trump = romney,
  Clinton = obama;

Not a bad guess, because that's what I guessed. But SQLite doesn't support this syntax.

INSERT INTO then SELECT

Luckily, INSERT INTO can work in conjunction with a standard SELECT statement. Here's how to insert my SUNETID, state names from results, and 50, 50, respectively into Trump and Clinton:

(Note how we don't use VALUES)

INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 'dun', state, 50, 50
FROM results;

That SELECT statement can be any query. Here's how to fit each state row with a corresponding rep and dem group average from polls:

INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 'dun', state, AVG(rep), AVG(dem)
FROM polls
GROUP BY state;

Sample JOIN with ethnic demographic data

But why stop with a single SELECT statement? What if we want to use the demographic data in the ACS-1-year table, i.e. acs_1yr, as well as the 2012 results for Obama and Romney.

Let's buy into the assumption that Hispanic voters will be reluctant to vote for Trump. Hispanics' affinity towards Obama and Democrats is already baked into the Obama vs. Romney 2012 votes. But maybe we think that Hispanic votes will be even more spurred to vote Democrat.

Here's how we would quantify that:

If the Hispanic population is greater than 20% of the total population for a given state, set the Clinton and Trump to 1.015. and 0.97 of the corresponding obama and romney values.

Selecting, joining Hispanic states and voting results

(remember that the state name is in name for acs_1yr)

SELECT name
FROM acs_1yr
WHERE (100 * hispanic / total_population) > 10;

Now, the basic select for results:

SELECT state, romney, obama
FROM results;

Joining:

SELECT results.state,
  romney * 0.97,
  obama * 1.015 
FROM results
INNER JOIN acs_1yr
  ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) >= 10;

And then inserting:

INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 
  'dun', results.state,
  romney * 0.97, obama * 1.015
FROM results
INNER JOIN acs_1yr
  ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) >= 10;

Else, keep Clinton and Trump the same as Obama and Romney for now – i.e. insert all other states that don't have a Hispanic population +10%

INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 
  'dun', results.state,
  romney, obama
FROM results
INNER JOIN acs_1yr
  ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) < 10;

Updates using subqueries

(Note that doing a subquery to generate a list of things to use IN with won't be on the Midterm)

Nate Silver told ABC News that he doesn't see the same turnout from African-American voters as in 2012, which could be pivotal for some major states.

So let's use that bit of info to subtract from Clinton's turnout.

Selecting from all states in which blacks make up 10+ percent of the population:

SELECT * 
FROM acs_1yr
WHERE (100 * black / total_population) >= 10;

We can't join with an UPDATE statement, but we can use a subquery like so; the subquery acts as a list of state names for the WHERE condition:

UPDATE predictions
SET Clinton = Clinton * 0.99
WHERE predictions.state IN(
  SELECT name
  FROM acs_1yr
  WHERE (100 * black / total_population) >= 10
)

Individual updates

And of course you can always set values manually. The North Carolina results, being mostly based off of Romney's win in 2012, gives poor chances to Clinton. But you see in the New York Times that it's basically a statistical dead heat at 44%

So you could manually tweak it:

UPDATE predictions
SET Clinton = 44.0, Trump = 44.1
WHERE state = 'North Carolina';

Or, you could pull in from the polls data, such as selecting the average of the North Carolina polls since mid-October:

UPDATE predictions
SET 
  Clinton = (SELECT AVG(dem) FROM polls WHERE date > '2016-10-15'
    AND state = 'North Carolina'),
  Trump = (SELECT AVG(rep) FROM polls WHERE date > '2016-10-15'
    AND state = 'North Carolina');

…but actually, don't do the above. Because you need to make sure that state inside the subquery, and inside the main query, are equal to 'North Carolina':

(confusing, I know…which is why this is not on the test)

UPDATE predictions
SET 
  Clinton = (SELECT AVG(dem) FROM polls WHERE date > '2016-10-15'
             AND state = 'North Carolina'),
  Trump = (SELECT AVG(rep) FROM polls WHERE date > '2016-10-15'
           AND state = 'North Carolina')  
WHERE state = 'North Carolina';

If you didn't mess up, and you haven't already manually inputted your Nationwide estimate, here's how to manually insert a new row:

INSERT INTO  
  predictions(sunetid, state, Clinton, Trump)
  VALUES('dun', 'Nationwide', 48.8, 46.3);

Messing up is OK because playback is fun!

But pretend you did mess up. Well that's the fun part of keeping track of your queries. I recommend throwing them into a spreadsheet for easier copy-and-paste playback. Plus, you already have to do that for the requirements of this assignment.

All over again, with a paste

Starting over is as easy as copying-and-pasting all the steps in, all at once. That semi-colon which doesn't seem to do anything? Well, when you need to do a bunch of things in a batch, especially non-selecting queries, it's efficient to paste all the lines in. The semi-colons tell the interpreter when one commands and the other begins.

Note: if you're using a client like DB Browser, it's best, after you've pasted the query, to Select All (Cmd-A) and then execute. Sometimes I hit the "execute line" by accident.

DELETE FROM predictions;

/* add states with high numbers of hispanics*/
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 
  'dun', results.state,
  romney * 0.97, obama * 1.015
FROM results
INNER JOIN acs_1yr
  ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) >= 10;

/* insert all the other states */
INSERT INTO predictions(sunetid, state, Trump, Clinton)
SELECT 
  'dun', results.state,
  romney, obama
FROM results
INNER JOIN acs_1yr
  ON acs_1yr.name = results.state
WHERE (100 * hispanic / total_population) < 10;

/* adjust Clinton's states where there's a large black population
inspired to vote for Obama */
UPDATE predictions
SET Clinton = Clinton * 0.99
WHERE predictions.state 
  IN(
    SELECT name
    FROM acs_1yr
    WHERE (100 * black / total_population) >= 10
);


/* Adjust North Carolina with this ugly ugly code 
...yeah, probably don't do this on your own*/
UPDATE predictions
SET 
  Clinton = (SELECT AVG(dem) FROM polls WHERE date > '2016-10-15'
             AND state = 'North Carolina'),
  Trump = (SELECT AVG(rep) FROM polls WHERE date > '2016-10-15'
           AND state = 'North Carolina')  
WHERE state = 'North Carolina';

/* finally, insert my national predictions */
INSERT INTO  
  predictions(sunetid, state, Clinton, Trump)
  VALUES('dun', 'Nationwide', 48.8, 46.3);

Checks and balances list

OK, before finishing up, let's make sure you have all of these things done in your predictions table:

There should be 52 rows.
There should be 538 electoral votes total (like 538 get it?)
There should be one row in which the state is Nationwide (with proper capitalization)
sunetid should be straight-filled with your lowercase Stanford SUnet ID, e.g. bobby.
The total of Trump and Clinton for a given row should be between 0 and 100.
No rows should have a negative Trump or Clinton

Can you figure out the queries needed to confirm all of the above? See if you can write them, then see how you compare to me.

There should be 52 rows

Easy:

SELECT COUNT(*) FROM predictions;

There should be 538 total electoral votes

The electoral_votes is in results, so this will add up:

SELECT SUM(electoral_votes) FROM results;

But we already knew that. We need to know if that corresponds to what we have in predictions:

SELECT SUM(electoral_votes) FROM results
INNER JOIN predictions
  ON predictions.state = results.state;

If your predictions has mismatched/misspelled names, then INNER JOIN won't result in a 52-row match, nor an accurate sum of electoral_votes

Optional: count by candidate

If you're interested in how your electoral votes per candidate held up, it's a simple SUM plus JOIN and filter with WHERE.

To get Trump's count of electoral votes, do a SUM of results.electoral_votes (we only need one column), joining predictions to results on the state field.

Then filter out rows on the condition that Trump (his share of the popular vote) is bigger than Clinton

SELECT SUM(results.electoral_votes) 
FROM predictions
INNER JOIN
  results ON
    results.state = predictions.state
WHERE (Trump - Clinton)  > 0;

CASE is cool

If you want to be fancy and practice a very useful thing not on the test, use the CASE expression, which is SQLite's version of a spreadsheet's if/else.

With CASE, you effectively can create a category column, i.e. Trump when he has more votes, Clinton otherwise. In the previous query, we would have to run it twice.

Note how the CASE expression is aliased AS winner…that big thing is just a column value in the end.

The query below, we run it once and do good ol' GROUP BY:

SELECT
  CASE WHEN Trump > Clinton 
       THEN 'Trump'
       ELSE 'Clinton' 
    END AS winner,
  SUM(results.electoral_votes) AS evotes
FROM predictions
INNER JOIN
  results ON
    results.state = predictions.state
GROUP BY winner
ORDER BY evotes DESC;

winner	evotes
Clinton	347
Trump	191

One row with `Nationwide` as state

The following query should equal 1:

SELECT COUNT(*) FROM predictions 
WHERE `state` = 'Nationwide';

All `sunetid` values should be your SUNet ID

The following query should equal 0:

SELECT COUNT(*) FROM predictions 
WHERE sunetid != 'your_sunet_id_here';

0 <= Trump + Clinton <= 100

There should be 0 rows in which the following condition is true:

SELECT COUNT(*) FROM predictions
WHERE (Trump + Clinton) 
  NOT BETWEEN 0 AND 100;

Trump >= 0 AND Clinton >= 0

No rows should have either Trump nor Clinton below 0:

SELECT COUNT(*) FROM predictions
WHERE 
  Trump < 0 OR Clinton < 0;

All done

If you got here, then your predictions table is ready to be exported to CSV as is, i.e. do SELECT * FROM predictions;

Then import it into your elections-predictions spreadsheet.

Don't forget to create a second spreadsheet in elections-predictions, in which every row is just a single column of a query (ended by a semi-colon) needed to build your table. Feel free to add notes in adjacent columns.

You can see how I did it here:

My exported predictions

List of queries with optional commentary

More queries to try

If you haven't already, check out my spreadsheet of election queries, which also has copies of the three data tables: padjo-2016-election-queries

Make a copy of the spreadsheet, or at least the queries page, and sort-and-search and try the queries on your own:

Homework: Election Predictions with SQL

The Contest

The deliverable

Automate the boring analysis

Practice those queries

Getting started

Download the database

See the elections2016 spreadsheet with queries on Google Sheets

What your elections-predictions spreadsheet should look like

Let's just get it done

MUST GET IT DONE BY 5PM

1. Delete every record from predictions

2. Insert a nationwide prediction

3. Fill in the rest of predictions with a list of states from any of the other tables:

4. Making sure the data makes sense

4. Exports to Google spreadsheet

Let's practice joins and other intellectual fun

1. Delete every record from predictions

2. Insert a nationwide prediction

3. Join and include data from last year's election

4. Add demographic data penalty/benefits

5. An anti-immigration boost for Trump

5. Counting electoral votes

Trying our hand at polling

1. Fill in the poll numbers from the most comprehensive polls since October 2016

2. Fill in the states that are missing with fun SurveyMonkey polls

The data

Guessing the national popular vote

State data

538's Pollster Ratings

Learning SQL beyond SELECT

The INSERT statement

The DELETE statement

Deleting conditionally with WHERE

The UPDATE statement

UPDATE on a conditional

Combining INSERT INTO with SELECT

No JOIN/UPDATE in SQLite

INSERT INTO then SELECT

Sample JOIN with ethnic demographic data

Selecting, joining Hispanic states and voting results

Updates using subqueries

Individual updates

Messing up is OK because playback is fun!

All over again, with a paste

Checks and balances list

There should be 52 rows

There should be 538 total electoral votes

Optional: count by candidate

CASE is cool

One row with Nationwide as state

All sunetid values should be your SUNet ID

0 <= Trump + Clinton <= 100

Trump >= 0 AND Clinton >= 0

All done

More queries to try

What your `elections-predictions` spreadsheet should look like

1. Delete every record from `predictions`

3. Fill in the rest of `predictions` with a list of states from any of the other tables:

1. Delete every record from `predictions`

One row with `Nationwide` as state

All `sunetid` values should be your SUNet ID