SQLite Simple Folks: GROUP BY and Aggregate Functions
How to look for the averages, highs, lows, and outliers within groups of data. And, how to even just count what’s in a large dataset.
This article is part of a series: SQLite Simple Folks: Overview
Get the data
- Download the simplefolks.sqlite data file.
- Preview it as a Google Spreadsheet
Once you've downloaded simplefolks.sqlite data file onto your computer, open it with your SQLite client and be ready to write some queries.
Aggregate functions
I may break this up into a separate lesson, but before we get into the GROUP BY
s, let's take a look at SQLite's aggregate function, which are also documented on SQLite's homepage.
The SQLite aggregate functions are fairly similar to what you'd find on pivot tables: you can find averages, sums, counts, max and min, and…that's about it.
Like other core functions, aggregate functions can be called in the SELECT
query. We'll start there.
Counting rows
To get a count of how many rows are in a table:
SELECT COUNT(*) FROM people;
The result is a value, inside a single row (ignoring the header) and single column.
COUNT(*) |
---|
15 |
What does that *
mean inside the parentheses? It's just a convention to follow. Here's how SQLite's documentation describes it, though it doesn't make things much clearer:
The count(*) function (with no arguments) returns the total number of rows in the group.
I think of aggregate functions as summarizing – collapsing individual data rows to find a bigger picture. With this simplefolks data, there's only 15 rows so it's no big deal, but think of how necessity and value in reducing the hundreds of thousands of SFPD crime incident reports to get a bigger picture of crime trends.
Averaging with AVG
Unlike COUNT
, which basically just wants to be invoked as COUNT(*)
, we need to be more particular with AVG
.
Does it make sense, for example, to get an average of people.name
?
SELECT AVG(name) FROM people;
Surprisingly, an error wasn't thrown. But the result is non-sensical:
AVG(name) |
---|
0.0 |
The only numerical column in people
is age
. So, let's calculate the average age across all of people
:
SELECT AVG(age) AS averageage
FROM people;
averageage |
---|
41.8666666666667 |
Let's clean that up with ROUND
:
SELECT ROUND(AVG(age), 1) AS averageage
FROM people;
averageage |
---|
41.9 |
Be careful with selection when aggregating
You might have noticed that for each of the previous queries in which I used COUNT
and AVG
to aggregate the data, I selected only a single column, e.g. AVG(age)
.
What happens if select other columns, such as name
? Does more data get returned, rather than a single scalar value?
Let's try it:
SELECT name, ROUND(AVG(age), 1)
FROM people;
This was completely unexpected to me:
name | ROUND(AVG(age), 1) |
---|---|
Zed | 41.9 |
The record for Zed
is the very last record in `people, because the data was loaded in alphabetically. But what is the meaning of this result?
Is Zed's average age 41.9? What does that even mean? Here his record by itself:
name | sex | age |
---|---|---|
Zed | M | 42 |
Is it a coincidence that Zed's listed age is extremely close to 41.9
? Or is that why he was selected in the results?
Let's try the same query but add COUNT
as well, and sex
and age
, unaggregated:
SELECT name, sex, age,
COUNT(*),
ROUND(AVG(age), 1)
FROM people;
Again, bizarre:
name | sex | age | COUNT(*) | ROUND(AVG(age), 1) |
---|---|---|---|---|
Zed | M | 42 | 15 | 41.9 |
Don't chase the mystery
Rather than spend time figuring out why the results are what they are, it's just far easier to dismiss the results as nonsensical. That is, the query doesn't make sense. And despite SQLite just going along with it and returning what seems to be results, the best advice is to ignore it. Instead, focus on not writing queries that SELECT every single column whether it makes sense or not.
Also, as we'll see in the next chunk of syntax to learn, things will become complicated enough that adding non-useful columns to a result set can lead to disaster, or at least, disastrous confusion.
Using GROUP BY
Let's revisit the query we ran to count up the rows in people
:
SELECT COUNT(*) from people;
The result was a top line number, a number that summarizes the entire dataset:
COUNT(*) |
---|
15 |
But there's more detail even in this simple data. For example, each person had gender recorded in the sex
column. How do we get a count by sex
?
Make facets with GROUP BY
Enter the GROUP BY
clause. Like ORDER BY
, we give it a series of columns to use. In this case, GROUP BY
will group by the given columns.
Easier to show than explain. To get a group count of people per sex, we want to GROUP BY sex
:
SELECT sex, COUNT(*)
FROM people
GROUP BY sex;
Note the two things that changed about the query:
- I added a
sex
column - I added the
GROUP BY
clause:
sex | COUNT(*) |
---|---|
F | 8 |
M | 7 |
Select the same columns that you GROUP BY
What would have happened if I left out the SELECT sex
? The query would still work. The results would not have much context:
SELECT COUNT(*)
FROM people
GROUP BY sex;
COUNT(*) |
---|
8 |
7 |
Grouping by things that don't have a group
What happens if we group by something that isn't really groupable by…such as name
? Every person in people
has a unique name. What kind of groupings could possibly be made?
Can't hurt to try out. Let's throw in a few aggregates across age
:
SELECT
name, COUNT(*),
AVG(age), MAX(age), MIN(age)
FROM people
GROUP BY name;
Because there is exactly one name per person, each group is just a group of one, which is a lot of effort for nothing:
name | COUNT(*) | AVG(age) | MAX(age) | MIN(age) |
---|---|---|---|---|
Austin | 1 | 33.0 | 33 | 33 |
Blair | 1 | 90.0 | 90 | 90 |
Carolina | 1 | 28.0 | 28 | 28 |
Dani | 1 | 41.0 | 41 | 41 |
Donald | 1 | 70.0 | 70 | 70 |
Eliza | 1 | 37.0 | 37 | 37 |
Farida | 1 | 23.0 | 23 | 23 |
Georgina | 1 | 19.0 | 19 | 19 |
Hillary | 1 | 68.0 | 68 | 68 |
Leland | 1 | 16.0 | 16 | 16 |
Liam | 1 | 22.0 | 22 | 22 |
Michael | 1 | 48.0 | 48 | 48 |
Phoebe | 1 | 52.0 | 52 | 52 |
Sherry | 1 | 39.0 | 39 | 39 |
Zed | 1 | 42.0 | 42 | 42 |
Yes, we can group by derived columns
But if we have a dataset in which we can extract common features from each record, then it's fun to attempt a GROUP BY
. There's not a lot to GROUP BY
in the raw people
data. So let's derive a value: we'll GROUP the data by last initial of name.
Yep, we can group by aliased columns:
SELECT
SUBSTR(name, -1) AS last_letter,
COUNT(*) AS ncount,
AVG(age), MAX(age), MIN(age)
FROM people
GROUP BY last_letter
ORDER BY ncount DESC;
A lot of pointless insights, but it's a proper GROUP BY
:
last_letter | ncount | AVG(age) | MAX(age) | MIN(age) |
---|---|---|---|---|
a | 4 | 26.75 | 37 | 19 |
d | 3 | 42.6666666666667 | 70 | 16 |
y | 2 | 53.5 | 68 | 39 |
e | 1 | 52.0 | 52 | 52 |
i | 1 | 41.0 | 41 | 41 |
l | 1 | 48.0 | 48 | 48 |
m | 1 | 22.0 | 22 | 22 |
n | 1 | 33.0 | 33 | 33 |
r | 1 | 90.0 | 90 | 90 |
Let's look at the homes
table. Let's do a simple SUM
to get the total value of homes owned:
SELECT SUM(value)
FROM homes;
SUM(value) |
---|
4247000 |
Now, let's break it up to get sum of value
by area
. This requires a GROUP BY
:
SELECT
area,
COUNT(*) AS properties,
SUM(value) AS totalvalue
FROM homes
GROUP BY area;
area | properties | totalvalue |
---|---|---|
country | 6 | 830000 |
suburbs | 5 | 815000 |
urban | 9 | 2602000 |
Unlike people
, the owner_name
column is not necessarily unique; i.e. an owner can own more than one home. Let's GROUP by owner_name
and get a sorted count of totalvalues
and properties
:
SELECT
owner_name,
COUNT(*) AS properties,
SUM(value) AS totalvalue
FROM homes
GROUP BY owner_name
ORDER BY totalvalue DESC;
owner_name | properties | totalvalue |
---|---|---|
Donald | 3 | 1370000 |
Hillary | 2 | 780000 |
Carolina | 2 | 410000 |
Michael | 2 | 242000 |
Eliza | 1 | 210000 |
Sherry | 1 | 210000 |
Farida | 1 | 180000 |
Zed | 1 | 177000 |
Liam | 1 | 160000 |
Austin | 1 | 145000 |
Blair | 1 | 95000 |
Georgina | 1 | 82000 |
Phoebe | 1 | 77000 |
Dani | 1 | 67000 |
Leland | 1 | 42000 |
Know what you SELECT
OK, now that we've covered GROUP BY
, it's easier to take this stance:
When aggregating, never include a column for selection unless:
- You are aggregating by that column, e.g.
AVG(age)
- The column is being used in a
GROUP BY
clause
Here's a good example:
SELECT sex, MIN(age), MAX(age)
FROM people
GROUP BY sex;
And this is bad:
SELECT name,
SUBSTR(name, 1, 1) AS first_letter,
COUNT(*) AS n
FROM people
GROUP BY first_letter
ORDER BY n DESC;
It won't throw an error, but the name
column is useless to the point of harmful confusion:
name | first_letter | n |
---|---|---|
Donald | D | 2 |
Liam | L | 2 |
Austin | A | 1 |
Blair | B | 1 |
Carolina | C | 1 |
Eliza | E | 1 |
Farida | F | 1 |
Georgina | G | 1 |
Hillary | H | 1 |
Michael | M | 1 |
Phoebe | P | 1 |
Sherry | S | 1 |
Zed | Z | 1 |
Future tip: in PostgreSQL, which we will be using very soon, the database engine will throw an error when non-aggregated columns are included.
This article is part of a series: SQLite Simple Folks: Overview