SQL Part 1: Select, Sort, and Transform Data
This this first unit of SQL lessons, we cover the syntax and concepts to fetch, sort, and – as we’ve done using spreadsheet functions – transform data values.
This article is part of a series:
An overview of selecting, sorting, and transforming data
This tutorial has two roles:
1. A walkthrough of basic SQL syntax and concepts, with a variety of exercises and real-world data to practice
We are using Census data for this SQL walkthrough. I won't be spending much time explaining the details or the foundational basics, because we don't need to worry about them yet. Instead, we'll focus on the "big picture" of learning and using SQL. The details can come later.
A landing page for the first unit of SQL lessons
The details – from how to write "Hello, world" in SQL (and re-write, break, and re-write), to the difference between single-quote-marks and double-quote-marks – will be covered in standalone lessons, to ease the cognitive overload. For example, last year, SELECT
and FROM
were in the same lesson. Now, they're in 2 separate lessons.
Note: These are still being refactored and cleaned from when they were just one, giant book of a webpage. Apologies for the delay, the rewrite/revisions ended up being more in-depth than anticipated:
- The SELECT statement: our most common kind of query. Virtually every program we write starts off with
SELECT
. - The FROM clause: A simple keyword used to specify exactly what datatable to select data from.
- LIMIT and ORDER: Sort results in the order you need; limit the distraction and damage from data overload.
- Using functions to transform data values: Remember doing functions in spreadsheets. Same deal with SQL.
- Aliasing: real-world databases have awful, confusing names. We can specify our own labels for tables, columns, and calculations, to write more readable queries.
Follow the walkthrough
So please read along and do the exercises, as knowledge and competency is not required. You'll get some experience from just typing out the queries.
- Start off with this backgrounder on Census data
- Download the database in SQLite format (here's the link) and open it in a SQLite client,
- e.g. DB Browser.
- Look at a copy of the data on Google Sheets. Anticipate interesting questions. Virtually all of the walkthrough's exercises can be done in a spreadsheet.
And here's the walkthrough
Each header is a new question. Some questions have multiple queries and examples
For what it's worth, the format of this walkthrough is a lot like the take-home midterm from last year, except a lot easier.
1. A simple list of states
Make a simple list of U.S. states with 3 columns: the name of the state, total population, median household incomes,
This is a nice warmup: print 3 columns for each of the states
records. This simple exercise also makes clear some fundamental differences in how work we work with spreadsheets versus databases.
You know how to solve this challenge in a spreadsheet. Make a copy of my Google spreadsheet. Open it in a webbrowser. Boom, you're pretty much done – the first tab in the spreadsheet is states
Now, with a database, we can't even open it in a web browser. We have to download and install a special client – such as DB Browser. Open the database file with DB Browser. And even with the niceties that DB Browser has, we have to work to just even see our data.
And in the client view that we'll be spending most of our time: it's just a blank text box:
To get data, we have to give commands:
SELECT name, total_population, median_household_income
FROM states;
This query tells the database engine that we want 3 columns: name
, total_population
, and median_household_income
. And we want these columns from the states
table, because unlike a spreadsheet, there is no default table. And database engine won't assume for us, even when the database has just one table.
If you're following along on your own computer and you're using a user-friendly client, just re-type the query into the query box. Then press the "Execute button":
For the remainder of this walkthrough, I won't be using screenshots of running SQLite queries using a client like DB Browser. It doesn't matter what client we use, it's all about the code we write and the results we get.
I'll be portraying the results as a table to keep things simple; in this example, I've chopped the list from all 50+ states to just the first few, for brevity's sake.
name | total_population | median_household_income |
---|---|---|
Alabama | 4858979 | 44765 |
Alaska | 738432 | 73355 |
Arizona | 6828065 | 51492 |
Arkansas | 2978204 | 41995 |
California | 39144818 | 64500 |
Colorado | 5456574 | 63909 |
Connecticut | 3590886 | 71346 |
OK, writing that SQL query was undoubtedly more upfront work just to get data, compared to the average spreadsheet. Learning a new language is tough. Learning grammar is always a pain. It might take you some time when you're at the point where it doesn't take you 10 minutes to go from blank scree to usable code.
But forget about that now. On the positive side, when we wrote the query and executed it, we got exactly the list we wanted. A row for each state, and 3 columns in the exact order.
Meanwhile, with the spreadsheet – just because you opened a spreadsheet doesn't mean you accomplished the task. The data is supposed to be a 3 column list, not the entire spreadsheet.
To do the work on the spreadsheet version, we would have to:
- Make a new copy of the
states
spreadsheet, assuming we want to keep using its data. - Manually hide or delete every column that is not
name
,total_population
, ormedian_household_income
. - Rearrange the existing columns as necessary.
Writing SQL will feel slower at first because…it's a new language. But with a little practice and familiarity, your database programming can match, if not surpass, your speed with spreadsheets.
2. Sort states by population
List the names of each state and their total population. Sort the list in descending order of total population.
The states
table seems to have been created by inserting records in alphabetical order of names. Not a bad default, but now we want to see states listed in descending order (biggest first) of total_population
.
Doing this in a spreadsheet is pretty easy: you sort the total_population
column. One drawback, though, is that the spreadsheet is changed. Sure, you can change it back by sorting it by name. But, not only is that more work, humans can be notoriously bad at undoing changes to data.
Consider the SQL approach.
Again, we do have to write out a query. We have to specify each column name and the specific table. And, for sorting, there's a new SQL keyword and clause ORDER BY (somevalue) ASC (or DESC)
to learn, memorize, and screw up in the future.
Here's one way to write the query (there's not much room for creativity):
SELECT name, total_population
FROM states
ORDER BY total_population DESC;
The top rows of the truncated results:
name | total_population |
---|---|
California | 39144818 |
Texas | 27469114 |
Florida | 20271272 |
New York | 19795791 |
Illinois | 12859995 |
Pennsylvania | 12802503 |
Ohio | 11613423 |
Georgia | 10214860 |
North Carolina | 10042802 |
Michigan | 9922576 |
New Jersey | 8958013 |
And again, unlike with a spreadsheet, we didn't have to do a bunch of work to hide all the unwanted column.
In fact, an important though subtle aspect of how SQL handles SELECT queries is that the queries do not alter the data, nor anything else about the tables.
For all intents and purposes, our queries create new tables of results. We can tweak queries as much as we want without worrying that we've accidentally changed something, anywhere in our database. And the table that we've queried for results, we can feel confident that, extreme scenarios aside, its integrity and consistency will be intact.
In fact, reliability and consistency are the features that make databases so invaluable and ubiquitous in all aspects of our information and transaction-based society. Ironically, these features have almost no bearing on the data work we do as journalists, and they aren't even the most interesting features when it comes to data work. Though, obviously it is nice to run queries on a database, and get fast, reliable results. But our query needs are just small potatoes to the average database.
3. The 10 oldest states
List the names and median ages of the top 10 states, sorted by the median age in descending order
Pretty similar to the previous query. But we add one more keyword and clause: LIMIT
, which comes after ORDER BY
and specifies the number of results to show.
SELECT
name, median_age
FROM
states
ORDER BY
median_age DESC
LIMIT
10;
I didn't truncate the results; there are exactly 10 rows because we specified it via LIMIT 10
:
name | median_age |
---|---|
Maine | 44.6 |
Vermont | 43.1 |
New Hampshire | 42.8 |
West Virginia | 42.2 |
Florida | 41.8 |
Pennsylvania | 40.7 |
Connecticut | 40.6 |
Puerto Rico | 40.0 |
Montana | 39.9 |
Delaware | 39.7 |
I love LIMIT
. So simple, and so straight to the core of data work. We live in a world of increasing data but non-increasing attention spans. We need as many tools to cut to the point as possible.
On a sidenote: this is where I show how little I know about a modern spreadsheet. Is it fairly routine in a workflow to sort data, then limit it to the top X rows? I suppose it can be as easy as hiding columns. But you always have to do the work to unhide things later.
4. Congressional districts, ranked by difference in number of Hispanic vs. white residents
List the name of each Congressional district, as well as the difference between number of Hispanic and white residents. Show only the top 10 districts as ranked by this calculated difference
OK, let's turn our attention to the congressional_districts
table. The data columns are the same as they are in states
. We're just dealing with different, smaller geographies. Also, there are a lot more Congressional districts than U.S. states.
The difference between number of white residents and Hispanic residents is calculated by, subtracting white
from hispanic
. Or is it the other way around?
Let's try a warmup query that selects just the name of each district:
We can add hispanic
and white
as 2 separate columns:
SELECT name, hispanic, white
FROM congressional_districts
limit 10;
name | hispanic | white |
---|---|---|
Congressional District 1 (114th Congress), Alabama | 21301 | 459077 |
Congressional District 2 (114th Congress), Alabama | 20532 | 427396 |
Congressional District 3 (114th Congress), Alabama | 21104 | 479091 |
Congressional District 4 (114th Congress), Alabama | 43398 | 574436 |
Congressional District 5 (114th Congress), Alabama | 35846 | 516031 |
Congressional District 6 (114th Congress), Alabama | 29320 | 538323 |
Congressional District 7 (114th Congress), Alabama | 21369 | 209722 |
Congressional District (at Large) (114th Congress), Alaska | 51825 | 452472 |
Congressional District 1 (114th Congress), Arizona | 173716 | 367186 |
Congressional District 2 (114th Congress), Arizona | 198714 | 442634 |
So how do we get a column containing the result of subtracting hispanic
from white
? Let's strip out what we SELECT
in our query. Forget about name
, hispanic
, and white
.
Let's select for 99 - 1
, as ridiculous as that might sound:
SELECT 99 - 1
FROM congressional_districts
LIMIT 10;
The predictable, but somewhat strange results (notice the header vs the 10 result rows):
99 - 1 |
---|
98 |
98 |
98 |
98 |
98 |
98 |
98 |
98 |
98 |
98 |
Let's get back to our previous query, now that we have a little assurance that a column can consist of a mathematical operation between two columns. And let's ditch the separate columns for hispanic
and white
: we just need the difference:
SELECT name, hispanic - white
FROM congressional_districts
limit 10;
name | hispanic - white |
---|---|
Congressional District 1 (114th Congress), Alabama | -437776 |
Congressional District 2 (114th Congress), Alabama | -406864 |
Congressional District 3 (114th Congress), Alabama | -457987 |
Congressional District 4 (114th Congress), Alabama | -531038 |
Congressional District 5 (114th Congress), Alabama | -480185 |
Congressional District 6 (114th Congress), Alabama | -509003 |
Congressional District 7 (114th Congress), Alabama | -188353 |
Congressional District (at Large) (114th Congress), Alaska | -400647 |
Congressional District 1 (114th Congress), Arizona | -193470 |
Congressional District 2 (114th Congress), Arizona | -243920 |
We still haven't figured out what order to subtract hispanic
from white
. Nor have we sorted the results, which by default lead with the state of Alabama.
There might not be a "correct" answer to how we subtract the values. Perhaps we're interested in districts where Hispanic residents have the most parity (or outnumber) white residents. Subtracting hispanic - white
and sorting by biggest value first would answer our curiosity.
Because hispanic - white
isn't a column name, per se, how do we sort it? Let's just roll with it for now:
SELECT name, hispanic - white
FROM congressional_districts
ORDER BY hispanic - white DESC
limit 10;
Apparently, that weird ORDER BY
class is valid SQL:
name | hispanic - white |
---|---|
Resident Commissioner District (at Large) (114th Congress), Puerto Rico | 3410957 |
Congressional District 40 (114th Congress), California | 596400 |
Congressional District 29 (114th Congress), Texas | 513683 |
Congressional District 34 (114th Congress), Texas | 499374 |
Congressional District 15 (114th Congress), Texas | 495643 |
Congressional District 16 (114th Congress), Texas | 483470 |
Congressional District 15 (114th Congress), New York | 479809 |
Congressional District 27 (114th Congress), Florida | 452811 |
Congressional District 28 (114th Congress), Texas | 451496 |
Congressional District 44 (114th Congress), California | 446237 |
That said, the syntax should feel weird. So, let's introduce a new concept: Aliasing.
No mystery here: aliasing a column or table name means renaming it. This includes giving a friendlier name to ad-hoc calculated columns. Pay special note to how I use the hdiff
alias in the ORDER BY
clause, too:
SELECT name, (hispanic - white) AS hdiff
FROM congressional_districts
ORDER BY hdiff DESC
limit 10;
Let's reverse the subtraction and find districts with the most white residents compared to Hispanic residents. You can keep the alias of hdiff
, but it's a good habit to properly rename things as you tweak your queries:
SELECT name, (white - hispanic) AS wdiff
FROM congressional_districts
ORDER BY wdiff DESC
limit 10;
name | wdiff |
---|---|
Congressional District (at Large) (114th Congress), Montana | 857243 |
Congressional District 3 (114th Congress), Missouri | 691121 |
Congressional District (at Large) (114th Congress), South Dakota | 680602 |
Congressional District 5 (114th Congress), Kentucky | 671852 |
Congressional District 8 (114th Congress), Missouri | 666221 |
Congressional District 7 (114th Congress), Ohio | 663250 |
Congressional District 1 (114th Congress), Iowa | 657752 |
Congressional District 6 (114th Congress), Ohio | 653298 |
Congressional District 4 (114th Congress), Kentucky | 651205 |
Congressional District 2 (114th Congress), Missouri | 648838 |
Another way to deal with the "which number should be subtracted from the other" is to calculate the absolute value of the difference. This is a slightly different question: we no longer care which districts are more white than Hispanic, and vice versa. We just want to see the districts with the biggest differences, either way.
To get the absolute value of any column, use the ABS()
function:
SELECT name, ABS(white - hispanic) AS diff
FROM congressional_districts
ORDER BY diff DESC
limit 10;
Now, with one query, we can see what the biggest gaps are, instead of running two queries with 2 different sorts:
name | diff |
---|---|
Resident Commissioner District (at Large) (114th Congress), Puerto Rico | 3410957 |
Congressional District (at Large) (114th Congress), Montana | 857243 |
Congressional District 3 (114th Congress), Missouri | 691121 |
Congressional District (at Large) (114th Congress), South Dakota | 680602 |
Congressional District 5 (114th Congress), Kentucky | 671852 |
Congressional District 8 (114th Congress), Missouri | 666221 |
Congressional District 7 (114th Congress), Ohio | 663250 |
Congressional District 1 (114th Congress), Iowa | 657752 |
Congressional District 6 (114th Congress), Ohio | 653298 |
Congressional District 4 (114th Congress), Kentucky | 651205 |
But there is a loss of information. For example, we see the districts with the biggest disparity. But we don't know which way the disparity went.
Nothing stopping us from using the same subtraction operation as before, and putting it in a different column. In the example below, a positive number means there were more Hispanic residents. A negative number means more white residents:
SELECT name, ABS(white - hispanic) AS diff,
hispanic - white AS hdiff
FROM congressional_districts
ORDER BY diff DESC
limit 10;
Only Puerto Rico has a large ratio of Hispanic-to-White residents, at least in the top 10:
name | diff | hdiff |
---|---|---|
Resident Commissioner District (at Large) (114th Congress), Puerto Rico | 3410957 | 3410957 |
Congressional District (at Large) (114th Congress), Montana | 857243 | -857243 |
Congressional District 3 (114th Congress), Missouri | 691121 | -691121 |
Congressional District (at Large) (114th Congress), South Dakota | 680602 | -680602 |
Congressional District 5 (114th Congress), Kentucky | 671852 | -671852 |
Congressional District 8 (114th Congress), Missouri | 666221 | -666221 |
Congressional District 7 (114th Congress), Ohio | 663250 | -663250 |
Congressional District 1 (114th Congress), Iowa | 657752 | -657752 |
Congressional District 6 (114th Congress), Ohio | 653298 | -653298 |
Congressional District 4 (114th Congress), Kentucky | 651205 | -651205 |
We're almost done here, but the last query resulted in a table with the information we wanted, but with additional, distracting noise in the results; diff
and hdiff
look equal, and they are, other than being positive/negative inverses.
We don't need hdiff
as a number: we just need it as a string literal H
or W
. There's a few ways to do this. I'm going to use what seems most straightforward: the CASE
statement.
Google Sheets has an IF
function so that we can fill a cell with one value or another, depending on a true/false condition.
SQLite has the CASE
function, which is more versatile than the binary either/or results of a typical IF
function. It's also incredibly more cumbersome to right.
But once you know how it works, it's just a matter of looking up an example – here's StackOverflow to the rescue – and adapting it to your own code.
Here's one way to do it: notice that whitespace, in the form of consecutive space characters, indentation, or newlines, doesn't affect the query. Use that to your advantage in writing and organizing more readable queries:
SELECT name, ABS(white - hispanic) AS diff,
CASE WHEN (hispanic - white) > 0
THEN 'H' ELSE 'W' END
FROM congressional_districts
ORDER BY diff DESC
limit 10;
I can clean this up with an alias of that really cumbersome CASE
statement. At the end of its work, it's just another column full of values:
SELECT name, ABS(white - hispanic) AS diff,
(CASE WHEN (hispanic - white) > 0
THEN 'H' ELSE 'W' END) AS h_or_w
FROM congressional_districts
ORDER BY diff DESC
limit 10;
name | diff | h_or_w |
---|---|---|
Resident Commissioner District (at Large) (114th Congress), Puerto Rico | 3410957 | H |
Congressional District (at Large) (114th Congress), Montana | 857243 | W |
Congressional District 3 (114th Congress), Missouri | 691121 | W |
Congressional District (at Large) (114th Congress), South Dakota | 680602 | W |
Congressional District 5 (114th Congress), Kentucky | 671852 | W |
Congressional District 8 (114th Congress), Missouri | 666221 | W |
Congressional District 7 (114th Congress), Ohio | 663250 | W |
Congressional District 1 (114th Congress), Iowa | 657752 | W |
Congressional District 6 (114th Congress), Ohio | 653298 | W |
Congressional District 4 (114th Congress), Kentucky | 651205 | W |
One thing I have to point: until just now, I don't think I've ever written a successful CASE
expression. Which goes to show how even experienced programmers can screw code up. And, that I've apparently been able to dabble in SQLite for years without needing the very useful functionality that CASE
provides.
4. Average number of households by state
Just more calculation.
For each state, list the name, the median household income, and the average number of people per household. Show the top 10 states by most number of people per household.
Back to the states
table:
SELECT "name", "median_household_income",
ROUND("total_population" * 1.0 / "total_households", 2) as "avgpplperhouse"
FROM "states"
ORDER BY "avgpplperhouse" DESC
LIMIT 10;
name | median_household_income | avgpplperhouse |
---|---|---|
Utah | 62912 | 3.22 |
Hawaii | 73486 | 3.21 |
California | 64500 | 3.04 |
Alaska | 73355 | 2.95 |
Texas | 55653 | 2.92 |
Puerto Rico | 18626 | 2.84 |
New Jersey | 72222 | 2.81 |
Georgia | 51244 | 2.79 |
Arizona | 51492 | 2.77 |
Idaho | 48275 | 2.77 |
5. The average income per person per household
List the name of each state and the per capita income, and then the ratio of median household income versus average people per home, and list the top 10 states with the highest ratio
OK, we're kind of veering into questionable statistics, though I don't know enough statistics to explain why quantifying the amount of median income per average-people-per-home. That's fine, this is just exercise.
It's not any more difficult than the previous queries. Just a little more calculation. And more punctuation to correctly line up.
This is more a test of how well you can keep track of arithmetic, order of operations, and parentheses. So, yeah, the query is going to look ugly, but it's just arithmetic:
SELECT "name", "per_capita_income",
ROUND("median_household_income" /
("total_population" * 1.0 / "total_households"), 2)
AS "avg_income_per_person_in_house"
FROM "states"
ORDER BY "avg_income_per_person_in_house" DESC
LIMIT 10;
What is the meaning of the big difference between per_capita_income and the crazy ratio we calculated? Good question, I tried looking it up in the Census website. Per capita income is kind of just division, but there may be other factors in the calculation. Our average income per person in house most definitely involves dividing household income by anyone in the house, including children.
name | per_capita_income | avg_income_per_person_in_house |
---|---|---|
District of Columbia | 75628 | 31702.02 |
Maryland | 75847 | 27502.29 |
New Hampshire | 70303 | 27348.32 |
Connecticut | 71346 | 26697.54 |
Massachusetts | 70628 | 26610.68 |
New Jersey | 72222 | 25702.25 |
North Dakota | 60557 | 25079.14 |
Alaska | 73355 | 24853.1 |
Minnesota | 63488 | 24833.42 |
Virginia | 66262 | 24557.94 |
One easy way to confirm the impact of children on our calculated ratio: reverse the sort. If Utah, which has the highest birth rate in the Union, is on top – i.e. it has the lowest household income as averaged by household members – then yeah, we can make a safe assumption that children are the factor here.
So quick to flip that query
The previous query was admittedly painful to write. But now that it works, it is trival to flip the sort, or do other alterations, such as add new columns. For now, let's just see if Utah is on the other end of the sort:
SELECT "name", "per_capita_income",
ROUND("median_household_income" /
("total_population" * 1.0 / "total_households"), 2)
AS "avg_income_per_person_in_house"
FROM "states"
ORDER BY "avg_income_per_person_in_house" ASC
LIMIT 10;
name | per_capita_income | avg_income_per_person_in_house |
---|---|---|
Puerto Rico | 18626 | 6550.66 |
Mississippi | 40593 | 14981.53 |
Arkansas | 41995 | 16140.64 |
New Mexico | 45382 | 16580.37 |
West Virginia | 42019 | 16736.62 |
Alabama | 44765 | 17010.5 |
Louisiana | 45727 | 17014.35 |
Idaho | 48275 | 17427.02 |
Kentucky | 45215 | 17535.58 |
South Carolina | 47238 | 17923.74 |
What do we know: Utah isn't on the list when it's sorted the other way. Perhaps Utah's overall high standard of living and incomes more than outweighs the extra children in its denominator.
- Top 10 most Asian-populated Congressional Districts
List the names and percentage of Asian populations for the top 10 Congressional districts, nationwide, in order of highest percentage of Asian populations.
Percentage of Asians of total population is just dividing
asian
overtotal_population
. Because we want to sort the results by this derived value, we give the derived column an alias, e.g.asian_pct
– and refer to it in theORDER BY
clause:
Simple query, though this time we'll make a few small changes to the query to see how SQLite can, with the best of intentions, kind of scre4w things up.
SELECT name,
(asian / total_population) AS asian_pct
FROM congressional_districts
ORDER BY asian_pct DESC
LIMIT 10;
name | asian_pct |
---|---|
Congressional District 1 (114th Congress), Alabama | 0 |
Congressional District 2 (114th Congress), Alabama | 0 |
Congressional District 3 (114th Congress), Alabama | 0 |
Congressional District 4 (114th Congress), Alabama | 0 |
Congressional District 5 (114th Congress), Alabama | 0 |
Congressional District 6 (114th Congress), Alabama | 0 |
Congressional District 7 (114th Congress), Alabama | 0 |
Congressional District (at Large) (114th Congress), Alaska | 0 |
Congressional District 1 (114th Congress), Arizona | 0 |
Congressional District 2 (114th Congress), Arizona | 0 |
Hmmm, strange. The highest percentage of Asian seems to be…0?
Of course that's untrue. This is a case of how we need to be mindful of our data types – integers versus decimals – and also, how a query can go very wrong, yet the database engine happily processes our query and returns results without a warning or complaint.
The reason why everything was a 0
was because the number of Asians is always smaller than the total population: the result is always going to be less than 1.
To compound the confusion, SQLite's behavior can be…inconsistent. Clearly, it guessed that the values in the total_population
and asian
columns are integers. So, it returned a 0
– the closest integer value to the result of dividing a small number by a much bigger number.
I call SQLite's behavior "inconsistent", because SQLite is very lax about enforcing data types. We haven't gotten into schemas and data types, which involve telling SQLite, "Hey, this column should be treated as numbers. And that column of numbers should be treated as text", etc. Unlike other, more business-minded databases, such as PostgreSQL, SQLite will not break down if incompatible datatypes interact. This is one example of where we really, really wish SQLite would crash, rather than truck along.
Anyway, back to the division problem. The most straightforward workaround is to multiply the numerator, asians
by 100. The result will be a number big enough to be an integer (unless the percentage of Asians truly is 0, of course):
SELECT name,
(asian * 100.0 / total_population) AS asian_pct
FROM congressional_districts
ORDER BY asian_pct DESC
LIMIT 10;
There we go; as we might have expected, lots of districts in California:
name | asian_pct |
---|---|
Congressional District 17 (114th Congress), California | 52.7798233663491 |
Congressional District 1 (114th Congress), Hawaii | 48.2852135176608 |
Congressional District 6 (114th Congress), New York | 40.1535399335448 |
Congressional District 27 (114th Congress), California | 37.896238470676 |
Congressional District 15 (114th Congress), California | 33.504534853589 |
Congressional District 14 (114th Congress), California | 33.3057266115589 |
Congressional District 12 (114th Congress), California | 32.2335424870079 |
Congressional District 39 (114th Congress), California | 30.8399560249629 |
Congressional District 19 (114th Congress), California | 26.7918545945006 |
Congressional District 45 (114th Congress), California | 24.3962154105372 |
One more fix: the number of decimal points in asian_pct
seems…excessive. Let's use the ROUND
function and specify asian_pct
to be rounded to 2 decimal places:
SELECT name,
ROUND(asian * 100.0 / total_population, 2) AS asian_pct
FROM congressional_districts
ORDER BY asian_pct DESC
LIMIT 10;
It's not necessary, but I like throwing in an extra set of parentheses for the ROUND
function, to make things slightly more readable, and to make it more obvious that total_population
is not a value passed into the ROUND
function:
SELECT name,
ROUND((asian * 100.0 / total_population), 2) AS asian_pct
FROM congressional_districts
ORDER BY asian_pct DESC
LIMIT 10;
name | asian_pct |
---|---|
Congressional District 17 (114th Congress), California | 52.78 |
Congressional District 1 (114th Congress), Hawaii | 48.29 |
Congressional District 6 (114th Congress), New York | 40.15 |
Congressional District 27 (114th Congress), California | 37.9 |
Congressional District 15 (114th Congress), California | 33.5 |
Congressional District 14 (114th Congress), California | 33.31 |
Congressional District 12 (114th Congress), California | 32.23 |
Congressional District 39 (114th Congress), California | 30.84 |
Congressional District 19 (114th Congress), California | 26.79 |
Congressional District 45 (114th Congress), California | 24.4 |
7. Congressional districts ranked by poverty level
List the name, median household income, and percentage of population living below the poverty line of the top 10 congressional districts, as ranked by most median household income
This is straightforward: poverty percentage is the below_poverty_line
value divided by total_population
.
SELECT name, median_household_income,
ROUND(below_poverty_line * 100 / total_population, 1) AS povertypct
FROM congressional_districts
ORDER BY median_household_income DESC
LIMIT 10;
Do the results – that districts with the highest household incomes also have low proportion of poor people – seem surprising. Not really, then again, it's not unheard of of societies where there are a lot of rich people but a lot of poor people – ignoring the fact that the median household income measurement mitigates the chance that a district's wealth comes from a class of top 0.5% citizens. It's easier to believe that most people in Silicon Valley are just wealthy.
name | median_household_income | povertypct |
---|---|---|
Congressional District 18 (114th Congress), California | 120089 | 7.0 |
Congressional District 10 (114th Congress), Virginia | 114793 | 5.1 |
Congressional District 17 (114th Congress), California | 111024 | 6.7 |
Congressional District 11 (114th Congress), Virginia | 105031 | 7.4 |
Congressional District 3 (114th Congress), New York | 102602 | 5.6 |
Congressional District 11 (114th Congress), New Jersey | 102205 | 4.6 |
Congressional District 7 (114th Congress), New Jersey | 101746 | 4.8 |
Congressional District 8 (114th Congress), Virginia | 100644 | 7.5 |
Congressional District 15 (114th Congress), California | 100634 | 7.2 |
Congressional District 14 (114th Congress), California | 98129 | 9.4 |
But because it's so easy to alter a SQL query, no harm in reversing the sort and seeing the poverty rates in districts with low household incomes.
SELECT name, median_household_income,
ROUND(below_poverty_line * 100.0 / total_population, 1) AS povertypct
FROM congressional_districts
ORDER BY median_household_income ASC
LIMIT 10;
In retrospect, the result seems pretty predictable:
name | median_household_income | povertypct |
---|---|---|
Resident Commissioner District (at Large) (114th Congress), Puerto Rico | 18626 | NULL |
Congressional District 15 (114th Congress), New York | 26047 | 37.5 |
Congressional District 5 (114th Congress), Kentucky | 31339 | 27.9 |
Congressional District 13 (114th Congress), Michigan | 31789 | 31.8 |
Congressional District 2 (114th Congress), Mississippi | 32152 | 28.9 |
Congressional District 7 (114th Congress), Alabama | 33928 | 27.1 |
Congressional District 6 (114th Congress), South Carolina | 34566 | 23.6 |
Congressional District 5 (114th Congress), Florida | 35116 | 23.8 |
Congressional District 1 (114th Congress), North Carolina | 35148 | 24.4 |
Congressional District 7 (114th Congress), Arizona | 35243 | 32.1 |
Let's try one more iteration. Instead of focusing on the top 10 richest or poorest, let's look at the data trend across all 400+ Congressional districts. Brilliant plan, but the point of learning data analysis is because, for humans, observing data across 400 rows is mind-numbing.
So let's take the opportunity to create a visualization. Specifically, a scatter plot. We don't care so much about the actual values, just the correlation between median household income and poverty rate.
Re-execute the previous query, but remove the LIMIT
clause so that all results are returned.
SELECT name, median_household_income,
ROUND(below_poverty_line * 100.0 / total_population, 1) AS povertypct
FROM congressional_districts
ORDER BY median_household_income ASC;
Databases don't have visualization as a feature. But spreadsheets do. And the result of our query looks like spreadsheet data.
I'll skip the spreadsheet steps, but here's the resulting visualization. Seems like a pretty clear connection between the median household income and the percentage of impoverished persons in a district.
That said, don't rule out the possibility of outliers: for example, a district in which median income is low, but poverty rate is also very low. Or, a seemingly wealthy district with an abnormally high poverty rate. As we get better at SQL, there's less excuse to not put hypotheses to the test, because SQL (hopefully) reduces the work time for analyzing data.
8. Poverty rate and percentage of white population
Similar to Query 7, but include a 4th column for the percentage of population that is white. Show the top 10 results, sorted by the white-population-pct column in descending order:
SELECT name, median_household_income,
ROUND(below_poverty_line * 100.0 / total_population, 1) AS povertypct,
ROUND(white * 100.0 / total_population, 1) AS whitepct
FROM congressional_districts
ORDER BY whitepct DESC
LIMIT 10;
name | median_household_income | povertypct | whitepct |
---|---|---|---|
Congressional District 5 (114th Congress), Kentucky | 31339 | 27.9 | 96.0 |
Congressional District 6 (114th Congress), Ohio | 43964 | 15.9 | 94.4 |
Congressional District 2 (114th Congress), Maine | 44560 | 16.0 | 94.0 |
Congressional District 1 (114th Congress), West Virginia | 44180 | 16.3 | 93.5 |
Congressional District (at Large) (114th Congress), Vermont | 56990 | 9.8 | 93.4 |
Congressional District 1 (114th Congress), Maine | 59452 | 10.1 | 93.2 |
Congressional District 3 (114th Congress), West Virginia | 35913 | 21.5 | 93.1 |
Congressional District 9 (114th Congress), Pennsylvania | 47413 | 14.4 | 92.7 |
Congressional District 18 (114th Congress), Pennsylvania | 62254 | 8.4 | 92.6 |
Congressional District 27 (114th Congress), New York | 62566 | 8.9 | 92.3 |
Conclusion
Here's some abrupt goodbye text: later, the detailed lessons will be hosted on this page. The exercises so far published will probably be edited for length, and so forth. If you got through the exercises, even by just copying the available code, that's a great start! If there is one thing I'm embarrassed about in this walkthrough is that the SQL covered so far doesn't let us ask really interesting questions, even as I'm already bored of this Census data.
But doing things in SQL is pretty easy. As you've noticed, you can just copy my code and run it easier than you can replicate how I've edited a spreadsheet. Rather than worrying about getting things done in SQL, focus on understanding and appreciating the syntax, so that when we get to the exciting commands, you'll find them even easier to master than the basic SQL features covered here.
This article is part of a series: