GROUP BY
One row is one thing, e.g. a row in an employees
table would represent one employee.
So far, we haven't deviated from this at all.
However, what if we want insights on groups of rows?
Imagine you are a data scientist for a website similar to levels.fyi, called devils.fyi.
Your website has a bunch of data on employees of various companies, including their self-reported salaries.
This is in the employees
table.
column_name | type |
---|---|
id | INTEGER |
name | VARCHAR |
company | VARCHAR |
department | VARCHAR |
job_title | VARCHAR |
salary | INTEGER |
start_date | DATE |
You are doing an analysis to see which companies pay the best.
First you want to know the highest paid salary at each company.
How would you do this?
We know we can get the highest of something using ORDER BY
and LIMIT
, but that won't work per company.
We would have to use a WHERE
clause to look at 1 specific company
SELECT *
FROM employees
WHERE company = 'Poogle'
ORDER BY salary DESC
LIMIT 1;
id INTEGER | name VARCHAR | company VARCHAR | department VARCHAR | job_title VARCHAR | salary INTEGER | start_date DATE |
---|---|---|---|---|---|---|
10 | Marianne Schamberger | Poogle | Marketing | Future Infrastructure Assistant | 300000 | 2017-03-27 |
We can repeat this for every single company, but that isn't the best way. We want 1 query that can compute this for each company.
The problem is that all the relational algebra operations we learned so far deal with rows.
Since this is the employees
table, the rows represent employees.
It takes many rows, i.e. a group of rows, to represent all the data we have for a particular company.
Since we're dealing with groups of rows, we need a new type of operation that works on groups: aggregation.
Aggregation is the process of grouping rows together, then calculating individual values for those groups.
In our case, we want to group the rows within each company, and then get the maximum salary from each group.
In SQL, this is expressed in 2 parts: a GROUP BY
clause and an aggregate function.
GROUP BY
does the grouping part, and the aggregate function gets the maximum value.
SELECT max(salary) AS top_salary
FROM employees
GROUP BY company;
top_salary INTEGER |
---|
254000 |
277000 |
300000 |
This doesn't show us which top salary belongs to which company, so let's add company
to the SELECT
clause.
SELECT
company,
max(salary) AS top_salary
FROM employees
GROUP BY company;
company VARCHAR | top_salary INTEGER |
---|---|
Footbook | 254000 |
Wamazon | 277000 |
Poogle | 300000 |
Great! We have our insight. However, since we're a little nosy, let's also see the name of each person with that salary.
SELECT
company,
name,
max(salary) AS top_salary
FROM employees
GROUP BY company;
error: column "employees.name" must appear in the GROUP BY clause or be used in an aggregate function
Wait.. we can't do that?
We can select company
, but not name
.
What's going on?
The differece is that company
appeared in the GROUP BY
, while name
did not.
To understand this distinction, it helps to visualize the aggregation process.
Think of the GROUP BY
as transforming the table in a special way.
It becomes an intermediate table with 2 types of columns: the grouped columns (green) and un-grouped columns (red).
In this case, company
is grouped, as it appears in GROUP BY company
, and salary
and name
are not.
The values in the company
column which are equal to one another aggregate to form the groups, becoming the new green column containing all the unique values.
It is basically a normal column now that you can select.
The rest of the values in the non-grouped columns glob together in a sort of collection within each group/row.
Think of these collections as complex values that can't be selected directly,
you need to aggregate the collection, making it a single value per row, in order to select it.
That's where max(salary)
comes in.
max()
is an aggregate function.
Unlink normal functions, which take in individual values and return individual values,
aggregate function take in collections/groups of values and return individual values.
Each aggregate function has its own way of converting the group to a single value.
max()
does this by picking the 1 largest number from the group.
Selecting name
directly doesn't work because name
is still a collection of names.
We have to apply an aggregate function on it.
What if we do max(name)
instead?
SELECT
company,
max(name) AS max_name,
max(salary) AS top_salary
FROM employees
GROUP BY company;
company VARCHAR | max_name TEXT | top_salary INTEGER |
---|---|---|
Footbook | Wilson Maggio | 254000 |
Wamazon | Sheri Labadie | 277000 |
Poogle | Tami Glover | 300000 |
It's not an error anyome, but the max_name
isn't the name of the person who has the top salary per company.
Taking max(name)
just takes the maximum value within the names of the group (the company),
and since the values are text, "maximum" means alphabetically last.
If we want to get the person's name with the top salary, we'll have to use a subquery or a window function. We'll re-visit that later.
Multiple Aggregations
Notice that within 1 query we can perform multiple aggregations on different columns. Furthermore, it is also possible to do different aggregations on the same column.
SELECT
company,
max(salary) AS top_salary,
avg(salary) AS average_salary,
count(salary) AS number_of_salaries_reported
FROM employees
GROUP BY company;
company VARCHAR | top_salary INTEGER | average_salary NUMERIC | number_of_salaries_reported BIGINT |
---|---|---|---|
Footbook | 254000 | 154750.000000000000 | 12 |
Wamazon | 277000 | 145923.076923076923 | 13 |
Poogle | 300000 | 186461.538461538462 | 13 |
These should be pretty self-explanatory.
You calculate an average (arithmetic mean) using avg()
.
You can count the number of non-null values in the collection using count()
.
All aggregate functions
You can see more aggregate functions available in Postgres here.
Selecting a Collection
I told you to think of collections as not being directly selectable without aggregating.
"But what if my app really needs to select all the elements of a collection in a group?" you ask.
There are a couple ways to do that, but it still requires an aggregate function, like string_agg()
.
SELECT
company,
string_agg(name, ',') AS all_names
FROM employees
GROUP BY company;
company VARCHAR | all_names TEXT |
---|---|
Footbook | Mathew Harber,Toni Fahey,Sean Wintheiser II,Mrs. Carmen Blick,Frankie Glover,Wilson Maggio,Santiago Lockman,Rex Treutel,Laura Raynor,Whitney Goyette,Gina DuBuque,Ronald Graham |
Wamazon | Alberto Dicki,Erik Vandervort,Billie Vandervort,Eula Price,Michele Larson,Candice Konopelski,Essie Legros,Sheri Labadie,Jessie Hodkiewicz,Rosemary Hilll,Ollie Murphy,Ramon Rohan,Dana Ernser |
Poogle | Claude Barton,Christy McLaughlin,Tami Glover,Ana Koepp,Mae Leannon,Norma Krajcik,Cameron Jacobson,Ruby Connelly,Antonio Treutel,Marianne Schamberger,Shirley Prohaska,Jan Wiza,Alberto Monahan |
string_agg()
turns a collection of strings into 1 string, putting a delimiter between the values.
We use a comma ','
in this example.
Another way is with array_agg()
, but we'll look at that later when we study denormalization.
Aggregate Functions and NULL
Aggregate functions ignore NULL
.
For example, if one of the companies had 20 rows, and 10 employees had NULL
in the salary column,
then count(salary)
will return 10
.