GROUP BY

One row is one thing, e.g. a row in an employees table would represent one employee. So far, we haven't deviated from this at all. However, what if we want insights on groups of rows?

Imagine you are a data scientist for a website similar to levels.fyi, called devils.fyi. Your website has a bunch of data on employees of various companies, including their self-reported salaries. This is in the employees table.

Table Name:

employees

`column_name`	`type`
id	INTEGER
name	VARCHAR
company	VARCHAR
department	VARCHAR
job_title	VARCHAR
salary	INTEGER
start_date	DATE

You are doing an analysis to see which companies pay the best. First you want to know the highest paid salary at each company. How would you do this? We know we can get the highest of something using ORDER BY and LIMIT, but that won't work per company. We would have to use a WHERE clause to look at 1 specific company

SELECT *
FROM employees
WHERE company = 'Poogle'
ORDER BY salary DESC
LIMIT 1;

`id` `INTEGER`	`name` `VARCHAR`	`company` `VARCHAR`	`department` `VARCHAR`	`job_title` `VARCHAR`	`salary` `INTEGER`	`start_date` `DATE`
10	Marianne Schamberger	Poogle	Marketing	Future Infrastructure Assistant	300000	2017-03-27

We can repeat this for every single company, but that isn't the best way. We want 1 query that can compute this for each company.

The problem is that all the relational algebra operations we learned so far deal with rows. Since this is the employees table, the rows represent employees. It takes many rows, i.e. a group of rows, to represent all the data we have for a particular company. Since we're dealing with groups of rows, we need a new type of operation that works on groups: aggregation.

Aggregation is the process of grouping rows together, then calculating individual values for those groups. In our case, we want to group the rows within each company, and then get the maximum salary from each group. In SQL, this is expressed in 2 parts: a GROUP BY clause and an aggregate function. GROUP BY does the grouping part, and the aggregate function gets the maximum value.

SELECT max(salary) AS top_salary
FROM employees
GROUP BY company;

`top_salary` `INTEGER`
254000
277000
300000

This doesn't show us which top salary belongs to which company, so let's add company to the SELECT clause.

SELECT
  company,
  max(salary) AS top_salary
FROM employees
GROUP BY company;

`company` `VARCHAR`	`top_salary` `INTEGER`
Footbook	254000
Wamazon	277000
Poogle	300000

Great! We have our insight. However, since we're a little nosy, let's also see the name of each person with that salary.

SELECT
  company,
  name,
  max(salary) AS top_salary
FROM employees
GROUP BY company;

error: column "employees.name" must appear in the GROUP BY clause or be used in an aggregate function

Wait.. we can't do that? We can select company, but not name. What's going on? The differece is that company appeared in the GROUP BY, while name did not. To understand this distinction, it helps to visualize the aggregation process.

Think of the GROUP BY as transforming the table in a special way. It becomes an intermediate table with 2 types of columns: the grouped columns (green) and un-grouped columns (red). In this case, company is grouped, as it appears in GROUP BY company, and salary and name are not. The values in the company column which are equal to one another aggregate to form the groups, becoming the new green column containing all the unique values. It is basically a normal column now that you can select. The rest of the values in the non-grouped columns glob together in a sort of collection within each group/row. Think of these collections as complex values that can't be selected directly, you need to aggregate the collection, making it a single value per row, in order to select it.

That's where max(salary) comes in. max() is an aggregate function. Unlink normal functions, which take in individual values and return individual values, aggregate function take in collections/groups of values and return individual values. Each aggregate function has its own way of converting the group to a single value. max() does this by picking the 1 largest number from the group. Selecting name directly doesn't work because name is still a collection of names. We have to apply an aggregate function on it.

What if we do max(name) instead?

SELECT
  company,
  max(name) AS max_name,
  max(salary) AS top_salary
FROM employees
GROUP BY company;

`company` `VARCHAR`	`max_name` `TEXT`	`top_salary` `INTEGER`
Footbook	Wilson Maggio	254000
Wamazon	Sheri Labadie	277000
Poogle	Tami Glover	300000

It's not an error anyome, but the max_name isn't the name of the person who has the top salary per company. Taking max(name) just takes the maximum value within the names of the group (the company), and since the values are text, "maximum" means alphabetically last.

If we want to get the person's name with the top salary, we'll have to use a subquery or a window function. We'll re-visit that later.

Multiple Aggregations

Notice that within 1 query we can perform multiple aggregations on different columns. Furthermore, it is also possible to do different aggregations on the same column.

SELECT
  company,
  max(salary) AS top_salary,
  avg(salary) AS average_salary,
  count(salary) AS number_of_salaries_reported
FROM employees
GROUP BY company;

`company` `VARCHAR`	`top_salary` `INTEGER`	`average_salary` `NUMERIC`	`number_of_salaries_reported` `BIGINT`
Footbook	254000	154750.000000000000	12
Wamazon	277000	145923.076923076923	13
Poogle	300000	186461.538461538462	13

These should be pretty self-explanatory. You calculate an average (arithmetic mean) using avg(). You can count the number of non-null values in the collection using count().

All aggregate functions

You can see more aggregate functions available in Postgres here.

Selecting a Collection

I told you to think of collections as not being directly selectable without aggregating. "But what if my app really needs to select all the elements of a collection in a group?" you ask. There are a couple ways to do that, but it still requires an aggregate function, like string_agg().

SELECT
  company,
  string_agg(name, ',') AS all_names
FROM employees
GROUP BY company;

`company` `VARCHAR`	`all_names` `TEXT`
Footbook	Mathew Harber,Toni Fahey,Sean Wintheiser II,Mrs. Carmen Blick,Frankie Glover,Wilson Maggio,Santiago Lockman,Rex Treutel,Laura Raynor,Whitney Goyette,Gina DuBuque,Ronald Graham
Wamazon	Alberto Dicki,Erik Vandervort,Billie Vandervort,Eula Price,Michele Larson,Candice Konopelski,Essie Legros,Sheri Labadie,Jessie Hodkiewicz,Rosemary Hilll,Ollie Murphy,Ramon Rohan,Dana Ernser
Poogle	Claude Barton,Christy McLaughlin,Tami Glover,Ana Koepp,Mae Leannon,Norma Krajcik,Cameron Jacobson,Ruby Connelly,Antonio Treutel,Marianne Schamberger,Shirley Prohaska,Jan Wiza,Alberto Monahan

string_agg() turns a collection of strings into 1 string, putting a delimiter between the values. We use a comma ',' in this example.

Another way is with array_agg(), but we'll look at that later when we study denormalization.

Aggregate Functions and NULL

Aggregate functions ignore NULL. For example, if one of the companies had 20 rows, and 10 employees had NULL in the salary column, then count(salary) will return 10.