GROUP BY

One row is one thing, e.g. a row in an employees table would represent one employee. So far, we haven't deviated from this at all. However, what if we want insights on groups of rows?

Imagine you are a data scientist for a website similar to levels.fyi, called devils.fyi. Your website has a bunch of data on employees of various companies, including their self-reported salaries. This is in the employees table.


employees
column_name type
id
INTEGER
name
VARCHAR
company
VARCHAR
department
VARCHAR
job_title
VARCHAR
salary
INTEGER
start_date
DATE

You are doing an analysis to see which companies pay the best. First you want to know the highest paid salary at each company. How would you do this? We know we can get the highest of something using ORDER BY and LIMIT, but that won't work per company. We would have to use a WHERE clause to look at 1 specific company

SELECT *
FROM employees
WHERE company = 'Poogle'
ORDER BY salary DESC
LIMIT 1;
id
INTEGER
name
VARCHAR
company
VARCHAR
department
VARCHAR
job_title
VARCHAR
salary
INTEGER
start_date
DATE
10
Marianne Schamberger
Poogle
Marketing
Future Infrastructure Assistant
300000
2017-03-27

We can repeat this for every single company, but that isn't the best way. We want 1 query that can compute this for each company.

The problem is that all the relational algebra operations we learned so far deal with rows. Since this is the employees table, the rows represent employees. It takes many rows, i.e. a group of rows, to represent all the data we have for a particular company. Since we're dealing with groups of rows, we need a new type of operation that works on groups: aggregation.

Aggregation is the process of grouping rows together, then calculating individual values for those groups. In our case, we want to group the rows within each company, and then get the maximum salary from each group. In SQL, this is expressed in 2 parts: a GROUP BY clause and an aggregate function. GROUP BY does the grouping part, and the aggregate function gets the maximum value.

SELECT max(salary) AS top_salary
FROM employees
GROUP BY company;
top_salary
INTEGER
254000
277000
300000

This doesn't show us which top salary belongs to which company, so let's add company to the SELECT clause.

SELECT
  company,
  max(salary) AS top_salary
FROM employees
GROUP BY company;
company
VARCHAR
top_salary
INTEGER
Footbook
254000
Wamazon
277000
Poogle
300000

Great! We have our insight. However, since we're a little nosy, let's also see the name of each person with that salary.

SELECT
  company,
  name,
  max(salary) AS top_salary
FROM employees
GROUP BY company;
error: column "employees.name" must appear in the GROUP BY clause or be used in an aggregate function

Wait.. we can't do that? We can select company, but not name. What's going on? The differece is that company appeared in the GROUP BY, while name did not. To understand this distinction, it helps to visualize the aggregation process.

company name salary Poogle Lindsay 293000 Poogle François 131000 Poogle Jenny 162000 Wamazon Remmy 149000 Wamazon Milton 42000 company GROUP BY company name salary Poogle (Lindsay, François, Jenny) (293000, 131000, 162000) Wamazon (Remmy, Milton) (149000, 42000)

Think of the GROUP BY as transforming the table in a special way. It becomes an intermediate table with 2 types of columns: the grouped columns (green) and un-grouped columns (red). In this case, company is grouped, as it appears in GROUP BY company, and salary and name are not. The values in the company column which are equal to one another aggregate to form the groups, becoming the new green column containing all the unique values. It is basically a normal column now that you can select. The rest of the values in the non-grouped columns glob together in a sort of collection within each group/row. Think of these collections as complex values that can't be selected directly, you need to aggregate the collection, making it a single value per row, in order to select it.

company salary , max() SELECT company name salary Poogle (Lindsay, François, Jenny) (293000, 131000, 162000) Wamazon (Remmy, Milton) (149000, 42000) company max(salary) Poogle 293000 Wamazon 149000

That's where max(salary) comes in. max() is an aggregate function. Unlink normal functions, which take in individual values and return individual values, aggregate function take in collections/groups of values and return individual values. Each aggregate function has its own way of converting the group to a single value. max() does this by picking the 1 largest number from the group. Selecting name directly doesn't work because name is still a collection of names. We have to apply an aggregate function on it.

What if we do max(name) instead?

SELECT
  company,
  max(name) AS max_name,
  max(salary) AS top_salary
FROM employees
GROUP BY company;
company
VARCHAR
max_name
TEXT
top_salary
INTEGER
Footbook
Wilson Maggio
254000
Wamazon
Sheri Labadie
277000
Poogle
Tami Glover
300000

It's not an error anyome, but the max_name isn't the name of the person who has the top salary per company. Taking max(name) just takes the maximum value within the names of the group (the company), and since the values are text, "maximum" means alphabetically last.

If we want to get the person's name with the top salary, we'll have to use a subquery or a window function. We'll re-visit that later.

Multiple Aggregations

Notice that within 1 query we can perform multiple aggregations on different columns. Furthermore, it is also possible to do different aggregations on the same column.

SELECT
  company,
  max(salary) AS top_salary,
  avg(salary) AS average_salary,
  count(salary) AS number_of_salaries_reported
FROM employees
GROUP BY company;
company
VARCHAR
top_salary
INTEGER
average_salary
NUMERIC
number_of_salaries_reported
BIGINT
Footbook
254000
154750.000000000000
12
Wamazon
277000
145923.076923076923
13
Poogle
300000
186461.538461538462
13

These should be pretty self-explanatory. You calculate an average (arithmetic mean) using avg(). You can count the number of non-null values in the collection using count().

All aggregate functions

You can see more aggregate functions available in Postgres here.

Selecting a Collection

I told you to think of collections as not being directly selectable without aggregating. "But what if my app really needs to select all the elements of a collection in a group?" you ask. There are a couple ways to do that, but it still requires an aggregate function, like string_agg().

SELECT
  company,
  string_agg(name, ',') AS all_names
FROM employees
GROUP BY company;
company
VARCHAR
all_names
TEXT
Footbook
Mathew Harber,Toni Fahey,Sean Wintheiser II,Mrs. Carmen Blick,Frankie Glover,Wilson Maggio,Santiago Lockman,Rex Treutel,Laura Raynor,Whitney Goyette,Gina DuBuque,Ronald Graham
Wamazon
Alberto Dicki,Erik Vandervort,Billie Vandervort,Eula Price,Michele Larson,Candice Konopelski,Essie Legros,Sheri Labadie,Jessie Hodkiewicz,Rosemary Hilll,Ollie Murphy,Ramon Rohan,Dana Ernser
Poogle
Claude Barton,Christy McLaughlin,Tami Glover,Ana Koepp,Mae Leannon,Norma Krajcik,Cameron Jacobson,Ruby Connelly,Antonio Treutel,Marianne Schamberger,Shirley Prohaska,Jan Wiza,Alberto Monahan

string_agg() turns a collection of strings into 1 string, putting a delimiter between the values. We use a comma ',' in this example.

Another way is with array_agg(), but we'll look at that later when we study denormalization.

Aggregate Functions and NULL

Aggregate functions ignore NULL. For example, if one of the companies had 20 rows, and 10 employees had NULL in the salary column, then count(salary) will return 10.

© 2022 Andrew Carlson. All rights reserved.