This workbook contains all the SQL concepts required to effectively query GA4 data in Google BigQuery.
Analytic functions
Introduced in topic Identify The Exit Page Of A Single Session
Analytic functions take a function expression (e.g. FIRST_VALUE), and they run it over a partition of the data, returning a single result for each row being queried in the source table.
The partition is the window of rows around the current row that is being queried.
If you look at the image above, you can see how this works.
FIRST_VALUE specifies that we want the function to return the first event_timestamp value in the partition.
The OVER keyword specifies that what follows is the window we’ll query.
When the query engine runs the PARTITION BY user_pseudo_id, it creates a “temporary table”, where it orders all hits from the current user_pseudo_id (that’s what the data is partitioned by) by event_timestamp in ascending order.
Finally, FIRST_VALUE picks the first value from this partition, which would be the event with the lowest event_timestamp value for the current user_pseudo_id.
Here are what the partitions look like:
-- How the data is aligned with "PARTITION BY user_pseudo_id ORDER BY event_timestamp" +----------------+------------+-----------------+ | user_pseudo_id | event_name | event_timestamp | +----------------+------------+-----------------+ | user123 | page_view | 12345 | | user123 | page_view | 12346 | | user123 | page_view | 12348 | | user234 | page_view | 12349 | | user234 | page_view | 12350 | +----------------+------------+-----------------+
As you can see, it’s different from the source table. Hits are “grouped” by user, and ordered by event timestamp in an ascending order. Then, when FIRST_VALUE is evaluated, it fetches the first value of the partition where the user_pseudo_id matches the current row.
Thus, all rows that have user_pseudo_id of user123 will have the analytic function return 12345 (the event_timestamp of the first row in the partition), and all rows that have user_pseudo_id of user234 will have the analytic function return 12349.
It’s a difficult concept. But try to look past the syntax into what you’re trying to achieve here. You want to explore the data beyond the current row. In this example, you want to look for all hits from the user_pseudo_id of any given row, and fetch the first event_timestamp for this user.
ARRAY_AGG
Introduced in topic Fuzzy Matching With UDFs
ARRAY_AGG is another aggregation function. It lets you create a nested structure from a range of given inputs.
This nested structure is an array by type.
ARRAY_AGG is particularly useful when you want to create a list of all related values for some column for further analysis. You can use ARRAY_AGG to replace a window function, for example, by first creating the array and then accessing its first item (replaces FIRST_VALUE).
-- Gets the first source for all user_pseudo_id select user_pseudo_id, array_agg( source order by date )[offset(0)] as first_source from source_table group by user_pseudo_id
AS
Introduced in topic Count Sessions Of A Single Day
Use AS to “rename” a column in the SELECT list. This is useful if there are duplicate names (e.g. when SELECTing a handful of columns from different tables) or if you just want to name your columns in a more meaningful way.
-- Rename the event_timestamp_micros column as event_timestamp select event_timestamp_micros as event_timestamp
BETWEEN … AND
Introduced in topic Count Sessions Across A Static Date Range
Use BETWEEN and AND to specify a range for the values. The values can be strings, numbers, or dates.
When used with the _TABLE_SUFFIX column, it’s a powerful way of querying data present in only certain (contiguous) date partitions.
CASE WHEN
Introduced in topic Count Sessions By Part Of Week
The CASE WHEN statement is similar to IF…THEN in many programming languages.
You can add multiple WHEN statements for different mappings. Use the THEN keyword to indicate what happens if the CASE WHEN validates to true.
You can provide a fallback with ELSE, and you need to close the CASE WHEN with the END keyword.
-- Returns "h1" as the current date is within the first half of the year select case when extract(quarter from current_date()) in (1, 2) then 'h1' else 'h2' end
Code comments
Introduced in topic Count Sessions Of A Single Day
You can add comments in the SQL editor (these will be ignored in the query) by prefixing the row with two dashes:
-- this is a commentCONCAT
Introduced in topic Group Users By Acquisition Campaign
Use CONCAT to join two values together. CONCAT joins any values that can be cast into string.
Note! CONCAT was originally used in the topic video, but in a later draft we decided to drop it from this topic to not add extra complication. The example below is what was in the original query, and you can use it if you want to familiarize with CONCAT.
-- The CONCAT joins source and medium together, resulting in e.g. "google / cpc" select concat(traffic_source.source, " / ", traffic_source.medium) as source_medium, count(distinct user_pseudo_id) as users from `simoahava-com.analytics_206575074.events_20210601` group by source_medium order by users desc
COUNT
Introduced in topic Count Sessions Of A Single Day
COUNT is an aggregation function, which means that it can be used to count the occurrences of something across rows. In the video, Johan shows you how to count all the occurrences of a specific event name. This works because the count is the only result returned by the query.
-- Count all rows where event name is "session_start" select count(*) as sessions from <table> where event_name = 'session_start'
COUNTIF
Introduced in topic Combine User Properties With Other User Data
COUNTIF is an aggregate function that takes a condition expression as its argument. If the condition evaluates to true, then the counter is incremented.
CROSS JOIN
Introduced in topic Unnest All User Properties
A CROSS JOIN combines every row in the source table with every row in the joined table.
In the case of CROSS JOIN with an unnested structure, related values are kept together. So the CROSS JOIN only unites rows of the unnested table with the rows where the unnested structure was originally embedded in.
Note that you can replace the words “CROSS JOIN” with a comma.
-- These two queries do exactly the same thing, first: select props.key from table cross join unnest(user_properties) as props -- And second: select props.key from table, unnest(user_properties) as props
In the example above, iphone12 and nokia3310 are joined together only with user123, because that was the only user out of the two who had those devices. Similarly, iphone11 is only joined with user234, because that’s the row where the iphone11 was originally embedded in.
CURRENT_DATE
Introduced in topic Count Sessions Across A Dynamic Date Range
Returns the current date in the format YYYY-MM-DD (e.g. 2021-05-21 for 21st May, 2021).
Use FORMAT_DATE to format it into another type of date string.
DATE_DIFF
Introduced in topic Calculate User Retention 2/2
DATE_DIFF takes three arguments: two date objects and a date part specifier. The function returns the difference (as an integer) between the first two dates (date_1 – date_2) in intervals of the given date part.
-- Returns 1, as that's the difference in YEAR. date_diff returns the number of date part *boundaries* between the two dates. There's a year boundary between 1st Jan, 2021 and 31st Dec, 2020.
select date_diff(date '2021-01-01', date '2020-12-31', year
DATE_SUB
Introduced in topic Count Sessions Across A Dynamic Date Range
This function can be used to subtract days from any given date. It’s used together with the INTERVAL keyword.
The function takes two arguments: the date itself and the INTERVAL expression.
The INTERVAL expression takes a number and then a “date part”, which can be one of DAY, WEEK, MONTH, QUARTER or YEAR.
select date_sub(current_date(), interval 4 week) as four_weeks_ago
DISTINCT
Introduced in topic Count Unique Users
When you run COUNT(DISTINCT column), only unique instances of each value in column are counted, and duplicates are ignored.
Note that DISTINCT uses statistical approximation for larger spreads of values (1000+ different values) to save on performance. So it’s not an exact science but it’s close enough!
EXCEPT
Introduced in topic Flatten GA4 Schema For Relational Databases
The EXCEPT keyword is very simple but very useful.
With it, you can specify which columns to exclude from your SELECT list.
Any column names you add to the parentheses after EXCEPT will be excluded from the columns included in the query data. Thus, it’s a great way to run the controversial SELECT * while still leaving out the most query-intensive columns, for example.
EXTRACT
Introduced in topic Count Sessions By Part Of Week
Use this to EXTRACT a date part FROM a date expression. Valid date parts are listed here.
-- Returns the quarter (1, 2, 3, or 4) for the current date select extract(quarter from current_date())
FORMAT
Introduced in topic Calculate User Retention By Cohort
FORMAT can be used to format any BigQuery data type into a string.
It takes two arguments: a format specifier, and the value you want to format.
In the example above, the specifier %03d is used. This is read as:
– % is always required as a prefix in format specifiers.
– 0 means that the padding is done with zeroes rather than spaces
– 3 means that the padded length of the string should always be at least 3 characters in length. Strings of three or more characters in length will not be padded.
– d means that you’re working with integers.
Thus if the input is 1, then the output is 001. If the input is 15, the output is 015. If the input is 1554, the output is 1554.
FORMAT_DATE
Introduced in topic Count Sessions Across A Dynamic Date Range
You can use this function to translate any date type into a different string format. See this resource for the full list of available format elements.
-- Outputs "Monday, 25 Jun, 2021" if current_date is June 25th, 2021. select format_date('%A, %d %b, %Y', current_date())
FROM
Introduced in topic The First SQL Query
A required clause in all SQL queries. FROM specifies the source table(s) which is/are being queried by the SELECT statement.
GROUP BY
Introduced in topic Count Sessions Across A Static Date Range
Use GROUP BY to group values in a given column. Without GROUP BY, an SQL query goes over the data source row-by-row, and returns each row that matches the SELECT query.
However, if you want to do aggregations such as COUNT all instances of a given value, or SUM all values for a given instance together, you need to group the values first.
To group a result by a specific column, you need to specify the column name (or its ordinal number, e.g. 1 for the first column in the SELECT statement) after GROUP BY.
If there are multiple columns that are not already aggregated (e.g. due to using COUNT), you need to group ALL of them once you start using GROUP BY.
select user_pseudo_id, event_name, count(*) as number_of_events from table group by user_pseudo_id, event_name
The example above returns a count of all events from all users, but it groups the results first by event name then by user.
You could achieve the same result by using group by 1,2 as user_pseudo_id is the first (1) column, and event_name is the second (2) column in the SELECT statement.
If you tried to group by number_of_events, you’d get an error as COUNT() is already an aggregation.
HAVING
Introduced in topic Group Users By Year Of Their First Visit
The HAVING expression lets you add conditions for displaying rows in the results table. It must always come after the GROUP BY statement.
You can use similar checks that you’d use with WHERE, such as is not null or value > 5. As long as the check is a statement that evaluates to true or false, it’s valid in HAVING.
Unlike WHERE, you can reference column aliases in HAVING.
-- Would not work: select user_pseudo_id as user from table where user is not null -- Would work: select user_pseudo_id as user, count(*) as hits from table group by user having user is not null
IFNULL
Introduced in topic Explore Session Traffic Sources
The IFNULL function takes two arguments. The first is the value whose nullness is checked, and the second is what is returned if the first is null.
This is a great way to “cast” those null values into something useful.
IN
Introduced in topic Count Sessions By Part Of Week
IN can be used to compare a source value against a set of values. It returns true if the source value is among the values in the parentheses and false otherwise.
-- Returns true select 2021 in (2019, 2020, 2021)
LAG and LEAD
Introduced in topic Identify Page Pathing For A Single Session
As mentioned above, LAG fetches the preceding row from the partition relative to the current row, and LEAD fetches the following row from the partition.
The function takes two arguments: first, the column whose value is pulled from the partition, and second, the offset.
The offset is 1 by default, meaning the previous (LAG) or next (LEAD) row is fetched. If you specify offset to be 2, for example, then LAG fetches the row two behind from the current one, and LEAD fetches the row two ahead from the current one.
If you look at the image above, you can see LAG and LEAD in action. For any given row in the source table, LAG fetches the value of event_timestamp of the previous row. Because the first row (of each user_pseudo_id respectively) doesn’t have a previous row, null is returned instead.
LEAD does the same thing but in the opposite direction. And because the last row (of each user_pseudo_id respectively) doesn’t have a next row, null is returned instead.
Look closely at the example and try to understand why the result table is populated as it is.
LEFT JOIN
Introduced in topic Flatten GA4 Schema For Relational Databases
Because BigQuery often has nested data, we’ve mainly been concerned with CROSS JOINs in this course. If you remember, CROSS JOIN takes all the rows in both tables and combines them together.
Well, it’s arguable that LEFT JOIN is still the most popular join in SQL.
With a LEFT JOIN, two tables are joined together, just as with CROSS JOIN. However, with LEFT JOIN, all the rows in the left table (i.e. literally the table name to the left of the “LEFT JOIN” operator) are kept, whereas only those rows in the right table that join with the rows in the left table (using a join key) are preserved.
In other words, a LEFT JOIN will include all the rows in the first table, and then null values for columns of the right table where there wasn’t a match between the two tables. If there are rows in the right table that don’t have a match in the left table, they are discarded.
-- Table_1 Table_2 +------+----------------+ +------+-------------+ | user | transaction_id | | user | year_active | +------+----------------+ +------+-------------+ | simo | t12345 | | simo | 2021 | | mari | t23456 | | mari | 2021 | | john | t34567 | | anne | 2018 | +------+----------------+ +------+-------------+ -- Query select table_1.user, table_1.transaction_id, table_2.year_active from Table_1 as table_1 left join Table_2 as table_2 on table_1.user = table_2.user -- Result +------+----------------+-------------+ | user | transaction_id | year_active | +------+----------------+-------------+ | simo | t12345 | 2021 | | mari | t23456 | 2021 | | john | t34567 | null | +------+----------------+-------------+
As you can see, Table_1 is the left table, and Table_2 is the right table. In the result, the year_active column is populated with data from the second table, having been joined with the first table using the user name as the join key.
In Table_1, the row with john doesn’t have a corresponding join key in Table_2. Because it’s a LEFT JOIN, the row is still kept in the output, but the column populated from the right table (Table_2) is set to null.
In Table_2, the row with anne doesn’t have a corresponding join key in Table_1. However, because it’s a LEFT JOIN, the unmatched data in the right table (Table_2) is ignored in the output. If you used RIGHT JOIN, you could have reversed the process and kept anne while discarding john.
LIMIT
Introduced in topic The First SQL Query
You can add an integer such as LIMIT 10 to only return 10 rows in the query result. NOTE! This does not impact query costs, as the query is still done against the full source table.
MAX
Introduced in topic Access A User Property With A Scalar Subquery
MAX returns the largest (“maximum”) non-null value for the input. As an aggregate function, it combines nicely with GROUP BY, returning either a valid value or a null, but never a null if there’s also a valid value available.
In the image above, MAX selects high as the income_level for user123, as that’s the largest non-null value that user had in the source table. However, for user234 it returns null, because user234 did not have a single non-null value for income_level in the source table.
Note that in the last row user123 doesn’t have income either. However, as MAX is used together with GROUP BY, it fetches the largest non-null value from all of user123‘s hits in the table and not just the last one.
NULLIF
Introduced in topic Group Users By Geographical Location
NULLIF is a simple (but powerful) utility function, which sets the value to null if the expression is a match. It’s a clean way to set values to null under certain circumstances, as BigQuery has lots of methods for handling null values and relatively few (or quite complex) ways to handle other primitives.
-- Returns null if user_id is 'admin' select nullif(user_id, 'admin') as user from table
ORDER BY
Introduced in topic Count Sessions Across A Static Date Range
ORDER BY is as simple as it sounds. It lets you arrange the results by the values of specific columns in either ascending or descending order.
To use ORDER BY, add it after a possible GROUP BY, then specify the column(s) by which you want to sort the results (in order of importance), and finally whether the results should be in ascending (ASC) or descending (DESC) order.
select user_pseudo_id, event_name count(*) as number_of_events from table group by user_pseudo_id, event_name order by number_of_events desc
PARSE_DATE
Introduced in topic Count Sessions By Part Of Week
Convert any string (such as event_date in the Google Analytics 4 export) into a date object, so that you can run date formatting and extracting functions against it.
PARSE_DATE takes two arguments: the format of the input string and the string itself. For example, the Google Analytics 4 event_date object is always YYYYMMDD (e.g. 20210521), so the date format would be %Y%m%d.
PIVOT
Introduced in topic Count Events Per User With A Pivot Table
While the operator itself is fairly simple to use (once you get the gist), it has a some catch-22s.
First, you can’t add PIVOT after a regular SELECT…FROM query. PIVOT is technically part of FROM in the main query, and thus it needs to come after a table or subquery result. That’s why in both the video and in the example above we have SELECT * FROM (subquery).
The IN keyword needs an enumerated list of values to build columns out of. So you can’t just say FOR event_name or FOR event_name IN * or something. You need to list all the possible values that you want to pivot with.
Keeping these in mind, PIVOT follows a very simple logic.
First, provide the source table/query with the columns you want to include in the output AND the columns you want to pivot with.
Then, provide the aggregation function in the PIVOT clause. This can be anything that aggregates, such as COUNT or SUM or MAX.
With FOR, specify the column whose values you want to turn into columns, and finally provide the list of the values you want to pivot with in the IN expression.
-- Main query that selects everything in the subquery select * from( -- Subquery that selects user_pseudo_id and event_name from the source_table select user_pseudo_id, event_name from source_table) -- Pivot pivot( -- Aggregation that counts all instances of each event name count(*) -- Specify that event_name is what you want to count and pivot with for event_name -- Specify the values of event_name you want to count and pivot with in ('page_view', 'session_start', 'user_engagement')) -- Only include users that had page views where page_view != 0 -- Sort by page views descending order by page_view desc
Remember that WHERE and ORDER BY can be used to reference columns (and aliases) in the source table. Since PIVOT is part of the FROM clause of the main query, it actually establishes what the source table is, and thus WHERE and ORDER BY work with the pivoted column names, too!
QUALIFY
Introduced in topic Custom Engaged Sessions
QUALIFY is a fairly new addition to BigQuery. It’s similar to WHERE and HAVING, in that it can be used to filter the query output.
WHERE works with plain, ungrouped data, and HAVING works with grouped data. And QUALIFY? It works with the results of analytic functions.
Thus, if you look at the image above, you can see how in the end we’re using QUALIFY to only include those rows that returned 1 for the ROW_NUMBER function. This way we exclude all duplicates, as they would have a ROW_NUMBER of 2 or more.
REGEXP_EXTRACT
Introduced in topic Count Sessions Across Event And Intraday Tables
This function extracts the given regular expression pattern from the input string. If a capturing group is used (a pattern within parentheses), then the capturing group match alone is returned.
If there is no match, it returns NULL.
-- Returns "example.com" due to the capturing group in the regex select regexp_extract('email@example.com', '@(.*)')
ROW_NUMBER
Introduced in topic Custom Engaged Sessions
ROW_NUMBER is an analytic function, which returns, unsurprisingly, the row number of the current row when compared to all the rows in the partition.
In the screenshot above, you can see how user123 has two transactions with the same transaction ID: t12345.
When a partition is created for user_pseudo_id and transaction_id, it groups all data in the source table by these two dimensions. Then, the ROW_NUMBER function checks the position of the current row processed in the query against all the rows in the partition.
All other transaction IDs simply return 1, as they are the first (and only) row in the partition. But t12345 returns 1 for the first row and 2 for the second row with this transaction ID.
This is an easy way to deduplicate the results. Only include those rows that are the first in the partition.
SAFE_DIVIDE
Introduced in topic Add Refund, Tax, And Shipping Data
SAFE_DIVIDE is a function that lets you do division without pesky errors that are thrown if you try to divide by zero, for example.
Upon an otherwise erroneous division, it returns a null value instead and lets the query proceed.
The function takes two arguments: the dividend (what is being divided) and the divisor (what is it being divided by).
-- This would throw a "division by zero" error when running the query select 5/0 as division -- This would return null and not throw an error select safe_divide(5, 0) as division
Scalar subquery
Introduced in topic Access A User Property With A Scalar Subquery
Scalar subqueries can be run within the SELECT list. Instead of SELECTing a single column, the scalar subquery SELECTs whatever is returned by the subquery within the parentheses.
-- Select the "age" column as well as the average of all "age" values in the table select age, (select avg(age) from user_stats) as avg_age from table
SELECT
Introduced in topic The First SQL Query
A required statement in all SQL queries. It specifies the columns that are returned by the query. Typically SELECT chooses which columns to return from the source table, but it can also run transformations and aggregations against this information.
SUM
Introduced in topic Calculate Ecommerce Metrics By Date
SUM is an aggregation function that is probably even easier to understand than COUNT.
It returns, logically, the total sum of the values in the given column.
There are some nuances, such as how it only returns the sum of non-null values and how there might be some quirks if using floating point numbers, but the general gist is just as simple as you’d think.
SUM is also useful in analytic functions, where it can be used to calculate a cumulative sum.
_TABLE_SUFFIX
Introduced in topic Count Sessions Across A Static Date Range
If you replace any part of the partitioned table name with a wildcard, the _TABLE_SUFFIX will contain the part that was replaced.
For example, if the table partition name is 20210521, and you replace it with a wildcard like 202105*, the _TABLE_SUFFIX column will contain 21.
The _TABLE_SUFFIX column will not be available if no part of the partition is replaced with the wildcard.
TIMESTAMP_MICROS
Introduced in topic Count Sessions By Part Of Day
This function takes an integer as the input, as long as the input is a timestamp in microsecond format. Well, that’s exactly what the event_timestamp column is!
The end result is a DATETIME object, containing both the date and the time (d’oh) when the timestamp occurred.
UNION
Introduced in topic Custom Engaged Sessions
UNION combines the results of multiple input queries into a single output. It’s most often used to combine multiple source tables of data into one big chunk.
UNION requires that all tables that are being aligned have the same number and naming of the columns.
UNION takes two possible keywords. UNION ALL combines everything regardless of duplicates. UNION DISTINCT ignores rows that are absolutely identical across the joined sets.
-- source_table_1 source_table_2 +------+----------------+ +------+----------------+ | user | transaction_id | | user | transaction_id | +------+----------------+ +------+----------------+ | simo | t12345 | | simo | t34567 | | mari | t23456 | | mari | t23456 | +------+----------------+ +------+----------------+ -- Query select * from source_table_1 union distinct select * from source_table_2 -- Result +------+----------------+ | user | transaction_id | +------+----------------+ | simo | t12345 | | mari | t23456 | | simo | t34567 | +------+----------------+
UNNEST
Introduced in topic Unnest All User Properties
UNNEST takes a single argument: the array or repeated field that you want to unpack.
It returns a table with all the items in the array on its own row.
Take a look at the example image above. The unnested structure is aliased to a table named unnested_dev. In the RESULT table, the data that used to be embedded in a single row of the source table is now spread out so that each value is on its own row.
The user_pseudo_id column is combined with the unnested_dev data thanks to the CROSS JOIN, which combines each row of the unnested table with each row of the original source table.
+-----+----------------+---------+ +-----+---------+ | ROW | user_pseudo_id | devices | | ROW | devices | +-----+----------------+---------+ After UNNESTing +-----+---------+ | 1 | user123 | iphone | devices | 1 | iphone | | | | android | | 2 | android | +-----+----------------+---------+ +-----+---------+
WHERE
Introduced in topic Count Sessions Of A Single Day
The WHERE clause lets you filter the results after the columns have been selected. The clause requires a boolean expression (returns true or false), and only rows that return a true result are returned in the query result.
The WHERE clause must always follow the FROM clause, and it can only reference columns that are present in the FROM table (so it can’t reference columns generated in SELECT, for example).
-- Select all columns for rows where the event_name is "session_start" select * from <table> where event_name = 'session_start'
WITH…AS
Introduced in topic Count Engaged Sessions Using session_engaged
The WITH clause is primarily designed to improve readability of the query. You can achieve everything WITH…AS does with regular subqueries. But having rows and rows of complicated nesting logic in your main query can make it really difficult to understand what the query does.
With WITH…AS, you create a named subquery that can be referenced in a FROM clause anywhere where FROM can be used.
-- my_query is the named subquery, and it's referenced in the main query with FROM with my_query as ( select * from my_source_table order by kpi desc) select kpi_id from my_query limit 1