How to easy understand Analytics Functions on Google BigQuery

Get all the major concepts and practices in a simple and clear way.

In this post you’ll easily understand analytics functions on BigQuery, concepts like “window”, “over” and “partition by” won’t be a pain for you anymore.

There is a point when working with databases where you need some advanced analysis on your data. You can achieve complex analysis in many ways using SQL operators, and Analytics Functions came to extend the power of SQL making it possible to write in a few lines queries that otherwise would require a lot of self-joins and aggregations. This article is for you that have a basic to intermediate experience with SQL and want to understand straight to the point analytics functions.

Scenarios

For our journey, let’s consider the following scenarios related to a list of students in a school categorized by classes and their sales on a local event.

The student’s list

Scenarios:

  • The average age of all students.
  • The average Sales by class.
  • Compare each student’s sales to the average sales within its Class.
  • The average student’s sales sum with its predecessor and successor by Class.

Aggregate Functions

Before talking about Analytics Functions, consider for the first scenario we are calculating the average age of all students, it means, we are aggregating/combining a group of rows in a single result.

SELECT AVG(Age)as Average_age FROM Students=> 20.1

For this scenario the group of rows combined contains all rows.

The result is 20.1

The second scenario is very similar to the first one, except that the group of rows combined won’t be all rows, but same class rows. Then we’ll use the GROUP BY clause to aggregate the Students by class before calculating the Sales average.

SELECT Class, AVG(Sales) as Average_Sales_by_Class FROM Students GROUP BY Class

The result is:

A: 80.0 (45+101+94)/3
B: 108.75 (100+124+24+187)/4
C: 64(98+17+77)/3

For this scenario each group of rows combined (3 aggregations / 3 Classes) results each an average output.

Concluding, in a universe of n individuals (rows), each combination of grouping will result in a single output value. If you split the individuals in 79 groups, you’ll get 79 outputs, or if you just put all indivuduals in one group you’ll get 1 output as shown in our first scenario. When aggregating data you can apply several functions like: MAX, AVG, COUNT, SUM.

You can read more about aggregate functions in: https://cloud.google.com/bigquery/docs/reference/standard-sql/aggregate_functions

Analytics Functions

Now that we understand what is and how works an aggregate function, let’s think about the third scenario: We need in the same table each individual student’s Sales and the average sales within its Class. This scenario requires both non-aggregated and aggregated data. “An analytics function returns a single value for each row by computing a function over a group of input rows.”

Class_AVG is the result of an analytic function

Using them you can achieve the same result as the aggregation in the second scenario, but apply it for each individual row instead of aggregating the result in three output rows, you can see the result in the left table. Having said all of that, let’s see how to write an analytics function in BigQuery.

That’s the analytics function syntax in Google Big Query, and we’ll modify it to get our desired query and result.

FUNCTION(args) OVER (
[ PARTITION BY partition_column ]
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ window_frame_clause ]
)

The FUNCTION could be an average, sum, rank, or other supported aggregate function (find more here). We’ll use the AVG function and PARTITION BY for the third scenario, we don’t need ORDER BY or WINDOW FRAME. We just need to partition the rows by Class before calculating the average and output it for every single row.

SELECT
Name,
Age,
Class,
Sales,
AVG(Sales) OVER (PARTITION BY Class) as Class_AVG
FROM
Students
ORDER BY
Class, Age ASC

The query above is enough to get the results we saw on the previous table. The PARTITION BY Class is basically saying: If the Student’s Class is “A” calculates the average of all “A’s Sales” and output it.

Windowing

What if we need to calculate a sum for each row, and that sum depends on the successor row value ?

Analytics functions besides calculating aggregated data as shown in the previous scenario, can also do so through a window. A window frame that moves towards the rows and process an operation, as you can see in the following generic example that sums the current value with its successor.

A sum analytics function with window size 2. The results in red.

Note that as the window moves, it sums the current and the successor number. Also be aware that the element’s order interferes in the result. That’s the basic idea of windowing, and you can configure multiple sizes of window.

Now, we can resolve the fourth scenario: “The average student’s sales sum with its predecessor and successor by Class”. You already know what is an aggregate function, analytic function, and windowing. Let’s put it all togheter and make an advanced query.

Remember the Analytic Fuction syntax:

SUM/AVG OVER (
[ PARTITION BY partition_column ]
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ window_frame_clause ]
)

The first piece PARTITION BY is basically saying “How the data will be grouped before calculate the function?” In our case we’ll use the Class column, and we can visualize the data as three groups (You’re familiar with it at this point)

Let’s replace the “PARTITION BY” clause

AVG(Sales) OVER (
PARTITION BY Class
[ ORDER BY expression [{ ASC | DESC }] [, ...] ]
[ window_frame_clause ]
)
The data PARTITIONED (grouped) BY Class

Attention: remember that the order affects the results when dealing with windowing. In our hypothetical scenario we need to order the data inside each group based on Sales. That’s the ORDER BY clause on SQL.

AVG(Sales) OVER (
PARTITION BY Class
ORDER BY Sales ASC
[ window_frame_clause ]
)
The data PARTITIONED (grouped) BY Class and GROUPED BY Sales

Look, now we have almost all the pieces filled, we just need to define the window_frame_clause. Since the third scenario says “The average student’s sales sum with his predecessor and successor” we need to create a window that will contain one predecessor and one successor row.

SELECT
Name,
Age,
Class,
Sales,
AVG(Sales) OVER (
PARTITION BY Class
ORDER BY Sales ASC
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING
)AS Sales_avg
FROM
Students

Awsome! Now the analytics function is complete. Attention on the windowing that is evidenced in the right.

An average analytic function partitioned by Class with windowing

The Janice’s Sales_avg is the average of her predecessor Sales: Amanda; 100, her own Sales ;124, and her successor; Sabrina, 187. Resulting in 137. That window is in dark red in the right of Partition B Data.

But what if we need a larger window frame that considers two elements preceding and two elements following instead of one? you can change the windowing configuration as:

.. ROWS BETWEEN 2 PRECEDING AND 2 FOLLOWING ..
window frame with size 2 following and 2 preceding

Concluding

Analytics functions are very usefull in many cases and can be customized with several options that aren’t described in this article, besides aggregate functions we have also navigation functions and numbering functions for BigQuery (find more here).

I hope that this article helped you understand this complex feature of BigQuery, I always try to draw and make graphics in my articles to simplify the understanding.

You can connect me via Gmail: rodolfo.marcos07@gmail.com or Linkedin: https://www.linkedin.com/in/rodolfo-marcos-41ab3198/

Thank you so much! Pardon my mistakes, I hope it will be usefull for you.

Photo by Nathan Dumlao on Unsplash

Full-Stack developer and Data-Engineer. Experienced in e-commerce / analytics for big brands. https://www.linkedin.com/in/rodolfo-marcos-41ab3198/