This post contains a brief note on a marathon Big Query SQL immersion session

SQL is the most underrated skill of a data engineer/data scientist. Even though many modern library data munging verbs mirror SQL terminology, most often, data crunching is done using R, Python, Spark, Hive programming. With the advent of cloud services, there is no need to spend time in setting up a Spark cluster or Hadoop cluster. With a click of a button(AWS Glue/GCP DataProc), the cluster is up and running, waiting for a number cruncher to fire away queries. There is no issue in sticking to the above route to get the job done. As one can see, the reason to write SQL queries usually diminishes, as there are many powerful data manipulation libraries in the open source world.

However there might be situations where you might have no choice but to resort to SQL queries. Take Big Query for example. It promises to democratize business analytics skills by providing a powerful columnar storage database and exposing the UI as a standard SQL query engine. One can rely the age old SQL lingo and get powerful query results within seconds. All the mighty distributed processing architecture is abstracted away from the user. A recent trend in the open source community is wrapper libraries for Big Query: you use your favorite programming language, write the data munging code, and the library converts in to SQL query and submits to Big Query. For example, the bigrquery package in R completely abstracts the SQL syntax. This is good for many R programmers as you can be in your favorite environment and do the analysis. However when you hit an issue, often it is the case that the SQL conversion is buggy. There are many other reasons, for one to have a solid working knowledge of SQL queries: one of them being Big Query domination is increasing in the data warehousing space and you might want to write native SQL queries and get the job done.

Am reasonably comfortable with SQL as I have been using it forever. However, for some reason, I felt that I should revisit the syntax and spend a few hours with it, focusing on understanding the language, rather than getting my job done for some number crunching exercise, that is part of a model building work.

Learnings from SQL immersion

Understand the basic structure of SQL statement
Subqueries and when to use them
JOIN queries and when to use them
Correlated subqueries:Inner query is related to the Outer query
UNNEST function
Create custom index columns based on certain columns
WINDOW functionality
Use functions such as OVER, LAG, ROW_NUMBER
Using date functions
Creating a repeated value field
One can write an ORDER BY clause and refer to the position of a specific column that you want to order by
One can refer to a select statement by an ALIAS and refer to that ALIAS in your queries
Usage of HAVING clause. It is mainly used for filtering on some aggregate that you are building as a part of your query
Everything is a table. You can select from a select statement. You can write subquery and write a select from it
Use WITH statement to refer a subquery so that it can be considered as a table
Use of ARRAY_LENGTH function to count values in an array
Use of ARRAY_AGG function to aggregate values in to an array
Big Query UI editor shortcuts - Surprised to know that there is no way to save the query using a keyboard shortcut
CASE statement usage
COALESCE statement usage. It can be used to populate a dummy value for all null values of a field
Use of LIKE operator to check for string matches
Besides a JOIN statement, you can put in a WHERE clause. It is better to put in a filter condition with the JOIN statement so that the resulting table is of manageable size
Understanding the power of creating small tables via subquery is key to mastering SQL
GENERATE_ARRAY is the a field that creates a struct data in a cell value.
UNNEST can be used to flatten a repeated field
One can use small case statement to create conditional summaries. One can write a case statement and feed in to the aggregate function
One can create moving averages using correlated subqueries
Correlated subquery can be used to generate summary reports
UNNEST function is Big Query specific function
WITH OFFSET AS INDEX gives the index as a column
You can create lags via correlated subqueries by relying on the row index
One can create PARTITION by multiple columns
One can use UNNEST functionality and CROSSJOIN to flatten a repeated field
There are many built in functions in Big Query that support regex operations
UDFs create a function by using a SQL expression or JavaScript code. These can be used in the query expressions

Deliberate Practice (55 queries)

Basis Queries in BQ

Order by column position

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


SELECT
  id,
  title,
  quarter,
  quarter_views
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  title = "Big Query - SQL Immersion"
  AND quarter > '2019-01-01'
ORDER BY
  3 DESC

Alias to refer to a table

1
2
3
4
5
6
7


SELECT
  p.staff_id,
  rental_id
FROM
  `jrjames83-1171.sampledata.payments` p
LIMIT
  1000

Simple JOIN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


SELECT
  p.staff_id,
  p.rental_id,
  r.inventory_id
FROM
  `jrjames83-1171.sampledata.payments` p
JOIN
  `jrjames83-1171.sampledata.rental` r
ON
  p.rental_id = r.rental_id
WHERE
  p.amount > 0
  AND p.staff_id !=1
LIMIT
  10

Select from Select

SELECT * FROM ( SELECT p.staff_id, p.rental_id, r.inventory_id FROM `jrjames83-1171.sampledata.payments` p JOIN `jrjames83-1171.sampledata.rental` r ON p.rental_id = r.rental_id WHERE p.amount > 0 AND p.staff_id NOT IN(1, 3, 5) LIMIT 10)

On the BQ UI, one can run the selected query of a bigger query

More elements in a select query

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


SELECT
  DISTINCT t.inventory_id,
  i.film_id
FROM (
  SELECT
    p.staff_id,
    p.rental_id,
    r.inventory_id
  FROM
    `jrjames83-1171.sampledata.payments` p
  JOIN
    `jrjames83-1171.sampledata.rental` r
  ON
    p.rental_id = r.rental_id
  WHERE
    p.amount > 0
    AND p.staff_id NOT IN(1,
      3,
      5)
  LIMIT
    10) t
JOIN
  `jrjames83-1171.sampledata.inventory` i
ON
  t.inventory_id = i.inventory_id

Use of DISTINCT

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12


SELECT
  COUNT(DISTINCT(id)),
  COUNT(id),
  tag
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  tag!='undefined'
GROUP BY
  tag
ORDER BY
  1 DESC

Multiple tags

No aggregate functions in the where clause

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


 SELECT
  ARRAY_AGG(DISTINCT(tag)) AS ct,
  id,
  title
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  tag !='undefined'
GROUP BY
  id,
  title
HAVING
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) > 2
LIMIT
  10

Use of HAVING

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


SELECT
  ARRAY_AGG(DISTINCT(tag)) AS ct,
  id,
  title
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  tag !='undefined'
GROUP BY
  id,
  title
HAVING
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) > 2
ORDER BY
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) desc
LIMIT
  10

Use of TRIM

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17


SELECT
  ARRAY_AGG(DISTINCT(tag)) AS ct,
  id,
  title
FROM
  `jrjames83-1171.sampledata.top_questions`
WHERE
  tag !='undefined' and TRIM(lower(tag)) LIKE "%python%"
GROUP BY
  id,
  title
HAVING
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) > 2
ORDER BY
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) desc
LIMIT
  10

HAVING

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


SELECT
  q1.id,
  q1.title,
  tags
FROM (
  SELECT
    ARRAY_AGG(DISTINCT(tag)) AS tags,
    id,
    title,
  FROM
    `jrjames83-1171.sampledata.top_questions`
  WHERE
    tag !='undefined'
    AND TRIM(LOWER(tag)) LIKE "%python%"
  GROUP BY
    id,
    title ) q1
WHERE
  ARRAY_LENGTH(q1.tags) > 2
ORDER BY
  ARRAY_LENGTH(q1.tags) DESC
LIMIT
  10

Organize subqueries as tables

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


WITH
  q1 AS (
  SELECT
    ARRAY_AGG(DISTINCT(tag)) AS tags,
    id,
    title,
  FROM
    `jrjames83-1171.sampledata.top_questions`
  WHERE
    tag !='undefined'
    AND TRIM(LOWER(tag)) LIKE "%python%"
  GROUP BY
    id,
    title )
SELECT
  q1.id,
  q1.title,
  tags
FROM  q1
  WHERE
    ARRAY_LENGTH(q1.tags) > 2
  ORDER BY
    ARRAY_LENGTH(q1.tags) DESC
  LIMIT
    10

Give me all the views from the title that contains Python as tag

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


SELECT
  SUM(q2.total_views)
FROM (
  SELECT
    DISTINCT(id),
    total_views
  FROM
    `jrjames83-1171.sampledata.top_questions`
  WHERE
    id IN(
    SELECT
      DISTINCT(id) AS id
    FROM
      `jrjames83-1171.sampledata.top_questions`
    WHERE
      REGEXP_CONTAINS(title, "python"))) q2

For each tag, get me all the questions that are associated with the tag

SELECT
  tag,
  ARRAY_AGG(DISTINCT(title))
FROM
  `jrjames83-1171.sampledata.top_questions`
GROUP BY
  tag
LIMIT
  10

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


SELECT
  tag,
  ARRAY_AGG(DISTINCT(title)) AS titles
FROM
  `jrjames83-1171.sampledata.top_questions`
GROUP BY
  tag
ORDER BY
  ARRAY_LENGTH(titles) DESC
LIMIT
  100

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


SELECT
  tag,
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
  `jrjames83-1171.sampledata.top_questions`
GROUP BY
  tag
ORDER BY
  ct DESC
LIMIT
  100

1
2
3
4
5
6
7
8
9


SELECT
  tag,
  ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
  `jrjames83-1171.sampledata.top_questions`
GROUP BY
  tag
LIMIT
  10

SELECT
  SUM(p1.ct)/COUNT(p1.tag)
FROM (
  SELECT
    tag,
    ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
  FROM
    `jrjames83-1171.sampledata.top_questions`
  GROUP BY
    tag)p1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


 WITH
  stats AS (
  SELECT
    tag,
    ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
  FROM
    `jrjames83-1171.sampledata.top_questions`
  GROUP BY
    tag)
SELECT
  SUM(ct)/COUNT(tag),
  MIN(ct),
  MAX(ct)
FROM
  stats

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


WITH
  stats AS (
  SELECT
    tag,
    ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
  FROM
    `jrjames83-1171.sampledata.top_questions`
  GROUP BY
    tag)
SELECT
  *
FROM
  stats
WHERE
  ct > (
  SELECT
    AVG(ct)
  FROM
    stats)

THE CASE STATEMENT

How to use a CASE statement

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


WITH
  base_table AS (
  SELECT
    DISTINCT(title),
    CASE
      WHEN title LIKE "%python%" THEN 'python'
      WHEN title LIKE "%sql%" THEN 'sql'
      WHEN title LIKE "%javasccript%" THEN 'javascript'
      WHEN title LIKE "%java%" THEN 'java'
  END
    AS
    LANGUAGE
  FROM
    `jrjames83-1171.sampledata.top_questions`)
SELECT
  COUNT(*),
  LANGUAGE
FROM
  base_table
GROUP BY
  2
ORDER BY
  2 DESC

CASE statement revised

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27


WITH
  base_table AS (
  SELECT
    DISTINCT(id),
    CASE
      WHEN (title LIKE "%python%" OR tag LIKE "%python%") THEN 'python'
      WHEN (title LIKE "%sql%"
      OR tag LIKE "%sql%") THEN 'sql'
      WHEN (title LIKE "%javascript%" OR tag LIKE "%sql%" ) THEN 'javascript'
      WHEN (title LIKE "%java%"
      OR tag LIKE "%java%") THEN 'java'
    ELSE
    NULL
  END
    AS
    LANGUAGE
  FROM
    `jrjames83-1171.sampledata.top_questions`)
SELECT
  COUNT(*),
  LANGUAGE
FROM
  base_table
GROUP BY
  2
ORDER BY
  1 DESC

Combining data elements and writing a case statement

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


WITH
  base_table(
  SELECT
    id,
    title,
    ARRAY_TO_STRING(ARRAY_AGG(DISTINCT tag), " ")
  FROM
    `jrjames83-1171.sampledata.top_questions`
  GROUP BY
    id,
    title)

JOINS

INNER JOIN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


WITH
  tab1 AS (
  SELECT
    num1
  FROM
    UNNEST(GENERATE_ARRAY(10,20)) AS num1),
  tab2 AS (
  SELECT
    num2
  FROM
    UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
  num1,
  num2
FROM
  tab1
JOIN
  tab2
ON
  tab1.num1=tab2.num2

LEFT/RIGHT JOIN

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


WITH
  tab1 AS (
  SELECT
    num1
  FROM
    UNNEST(GENERATE_ARRAY(10,20)) AS num1),
  tab2 AS (
  SELECT
    num2
  FROM
    UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
  num1,
  num2
FROM
  tab1
LEFT JOIN
  tab2
ON
  tab1.num1=tab2.num2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


WITH
  tab1 AS (
  SELECT
    num1
  FROM
    UNNEST(GENERATE_ARRAY(10,20)) AS num1),
  tab2 AS (
  SELECT
    num2
  FROM
    UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
  num1,
  num2
FROM
  tab1
RIGHT JOIN
  tab2
ON
  tab1.num1=tab2.num2

FULL OUTER JOIN

WITH
  tab1 AS (
  SELECT
    num1
  FROM
    UNNEST(GENERATE_ARRAY(10,20)) AS num1),
  tab2 AS (
  SELECT
    num2
  FROM
    UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
  num1,
  num2
FROM
  tab1
FULL OUTER JOIN
  tab2
ON
  tab1.num1=tab2.num2

City with most orders

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


WITH
  master AS (
  SELECT
    p1.order_id AS x,
    p2.customer_city AS y
  FROM
    `jrjames83-1171.sampledata.orders` AS p1
  JOIN
    `jrjames83-1171.sampledata.customers` AS p2
  ON
    p1.customer_id = p2.customer_id)
SELECT
  COUNT(x),
  y
FROM
  master
GROUP BY
  y
ORDER BY
  1 DESC

Max orders by customer

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


WITH
  master AS (
  SELECT
    p1.order_id AS x,
    p2.customer_unique_id AS y
  FROM
    `jrjames83-1171.sampledata.orders` AS p1
  JOIN
    `jrjames83-1171.sampledata.customers` AS p2
  ON
    p1.customer_id = p2.customer_id)
SELECT
  COUNT(x),
  y
FROM
  master
GROUP BY
  y
ORDER BY
  1 DESC

Use of `ROW_NUMBER` and `PARTITION`

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14


 SELECT
  p1.order_id AS x1,
  p2.customer_unique_id AS x2,
  p1.order_purchase_timestamp AS y,
  ROW_NUMBER() OVER (PARTITION BY p2.customer_unique_id ORDER BY p1.order_purchase_timestamp)
FROM
  `jrjames83-1171.sampledata.orders` AS p1
JOIN
  `jrjames83-1171.sampledata.customers` AS p2
ON
  p1.customer_id = p2.customer_id
ORDER BY
  1,
  2

Customers acquired every month

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


WITH
  base_table AS(
  SELECT
    p2.customer_unique_id AS x1,
    DATE_TRUNC(p1.order_purchase_timestamp,month) AS y,
    ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC(p1.order_purchase_timestamp,month)
    ORDER BY
      p1.order_purchase_timestamp) AS rn
  FROM
    `jrjames83-1171.sampledata.orders` AS p1
  JOIN
    `jrjames83-1171.sampledata.customers` AS p2
  ON
    p1.customer_id = p2.customer_id)
SELECT
  MAX(rn),
  y
FROM
  base_table
GROUP BY
  y
ORDER BY
  2

Everything is a table and you can use them in clever ways

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


WITH
  customer_order_ranks AS (
  SELECT
    c.customer_unique_id,
    o.order_purchase_timestamp,
    ROW_NUMBER() OVER(PARTITION BY c.customer_unique_id) AS customer_order_number
  FROM
    `jrjames83-1171.sampledata.customers` AS c
  JOIN
    `jrjames83-1171.sampledata.orders` AS o
  ON
    c.customer_id = o.customer_id
  ORDER BY
    1,
    2),
  exclusions AS (
  SELECT
    customer_unique_id,
    MAX(customer_order_number)
  FROM
    customer_order_ranks
  GROUP BY
    1
  HAVING
    MAX(customer_order_number) =1 )
SELECT
  customer_unique_id,
  customer_order_number
FROM
  customer_order_ranks
WHERE
  customer_unique_id NOT IN (
  SELECT
    customer_unique_id
  FROM
    exclusions)

Average days between orders

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55


WITH
  customer_order_ranks AS (
  SELECT
    c.customer_unique_id,
    o.order_purchase_timestamp,
    o.order_id,
    ROW_NUMBER() OVER(PARTITION BY c.customer_unique_id ORDER BY o.order_purchase_timestamp) AS customer_order_number
  FROM
    `jrjames83-1171.sampledata.customers` AS c
  JOIN
    `jrjames83-1171.sampledata.orders` AS o
  ON
    c.customer_id = o.customer_id
  ORDER BY
    1,
    2),
  exclusions AS (
  SELECT
    customer_unique_id,
    MAX(customer_order_number)
  FROM
    customer_order_ranks
  GROUP BY
    1
  HAVING
    MAX(customer_order_number) =1 ),
  inclusions AS(
  SELECT
    customer_unique_id,
    customer_order_number,
    order_purchase_timestamp
  FROM
    customer_order_ranks
  WHERE
    customer_unique_id NOT IN (
    SELECT
      customer_unique_id
    FROM
      exclusions)),
  reldata AS(
  SELECT
    customer_unique_id,
    customer_order_number,
    order_purchase_timestamp,
    LAG(order_purchase_timestamp) OVER(PARTITION BY customer_unique_id ORDER BY customer_order_number ) AS prev_purchase
  FROM
    inclusions
  ORDER BY
    customer_unique_id)
SELECT
  AVG(DATE_DIFF(order_purchase_timestamp,prev_purchase, day))
FROM
  reldata
WHERE
  prev_purchase IS NOT NULL

Using CASE

WITH took_offer AS ( SELECT 1 AS customer_id, 34.99 AS spend UNION ALL SELECT 2, 21.99 UNION ALL SELECT 3, 179.0 UNION ALL SELECT 4, 0.99 UNION ALL SELECT 5, 1299.9 ), loyalty AS ( SELECT 3 AS customer_id_1, 1 AS status UNION ALL SELECT 4, 1, UNION ALL SELECT 8, 2 UNION ALL SELECT 12, 1 UNION ALL SELECT 10, 2 ), results AS ( SELECT * FROM took_offer AS t FULL OUTER JOIN loyalty AS j ON t.customer_id = j.customer_id_1 ) SELECT customer_id, spend, customer_id_1 status, CASE WHEN customer_id IS NOT NULL AND customer_id_1 IS NOT NULL THEN "both" WHEN customer_id IS NULL AND customer_id_1 IS NOT NULL THEN "offer only" WHEN customer_id IS NOT NULL AND customer_id_1 IS NULL THEN "loyalty only" ELSE "na" END FROM results

Revenue trends per hour - casestudy

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


SELECT
  EXTRACT(hour
  FROM
    o.order_purchase_timestamp ) AS hr,
  ROUND(SUM(p.payment_value),2) AS sales,
  SUM(SUM(p.payment_value)) OVER () AS total_sales
FROM
  `jrjames83-1171.sampledata.orders` o
LEFT JOIN
  `jrjames83-1171.sampledata.order_payments` p
ON
  o.order_id = p.order_id
GROUP BY
  1
ORDER BY
  1

Revenue trends per time slots

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37


WITH
  base_table AS(
  SELECT
    EXTRACT(hour
    FROM
      o.order_purchase_timestamp ) AS hr,
    ROUND(SUM(p.payment_value),2) AS sales,
    SUM(SUM(p.payment_value)) OVER () AS total_sales
  FROM
    `jrjames83-1171.sampledata.orders` o
  LEFT JOIN
    `jrjames83-1171.sampledata.order_payments` p
  ON
    o.order_id = p.order_id
  GROUP BY
    1
  ORDER BY
    1 )
SELECT
  SUM(sales),
  (CASE
      WHEN hr BETWEEN 0 AND 5 THEN "morning"
      WHEN hr BETWEEN 6
    AND 11 THEN "afternoon"
      WHEN hr BETWEEN 12 AND 16 THEN "evening"
      WHEN hr BETWEEN 17
    AND 24 THEN "late evening"
    ELSE
    "na"
  END
    ) AS timeslot
FROM
  base_table
WHERE
  base_table.hr IS NOT NULL
GROUP BY
  timeslot

Use case statements to create conditional summaries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


WITH
  base_table AS(
  SELECT
    EXTRACT(hour
    FROM
      o.order_purchase_timestamp ) AS hr,
    ROUND(SUM(p.payment_value),2) AS sales,
    SUM(SUM(p.payment_value)) OVER () AS total_sales
  FROM
    `jrjames83-1171.sampledata.orders` o
  LEFT JOIN
    `jrjames83-1171.sampledata.order_payments` p
  ON
    o.order_id = p.order_id
  GROUP BY
    1
  ORDER BY
    1 )
SELECT
  SUM(CASE
      WHEN hr BETWEEN 0 AND 5 THEN sales
    ELSE
    0
  END
    ),
  SUM(CASE
      WHEN hr BETWEEN 7 AND 12 THEN sales
    ELSE
    0
  END
    )
FROM
  base_table

UNNEST and Correlated Subqueries

Correlated subquery

This is useful for creating unnest record field
A correlated subquery sample

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


WITH
  base_table AS (
  SELECT
    index,
    dollars
  FROM
    UNNEST(GENERATE_ARRAY(1,5)) AS dollars
  WITH
  OFFSET
    AS index)
SELECT
  bt1.*,
  (
  SELECT
    SUM(bt2.dollars)
  FROM
    base_table bt2
  WHERE
    bt2.index <= bt1.index ) AS cumsum
FROM
  base_table bt1

Moving Averages

1
2
3
4
5


SELECT
  *,
  AVG(dollars) OVER (ORDER BY index ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_ma
FROM
  base_table

Checking for a specific condition

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


WITH
  base_table AS (
  SELECT
    index,
    dollars
  FROM
    UNNEST(GENERATE_ARRAY(1,5)) AS dollars
  WITH
  OFFSET
    AS index),
  results AS (
  SELECT
    dollars,
    SUM(dollars) OVER (ORDER BY index ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prev
  FROM
    base_table )
SELECT
  COUNT(*)
FROM
  results
WHERE
  dollars - prev <0
  AND prev IS NOT NULL

Another solution

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19


WITH
  base_table AS (
  SELECT
    index,
    dollars
  FROM
    UNNEST(GENERATE_ARRAY(1,5)) AS dollars
  WITH
  OFFSET
    AS index)
SELECT
  DISTINCT check
FROM (
  SELECT
    (dollars > LAG(dollars) OVER (ORDER BY index) ) AS check
  FROM
    base_table)
WHERE
  check IS NOT NULL

Write using a correlated subquery

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


WITH
  base_table AS (
  SELECT
    index,
    dollars
  FROM
    UNNEST(GENERATE_ARRAY(1,5)) AS dollars
  WITH
  OFFSET
    AS index)
SELECT
  bt1.dollars,
  bt1.index,
  (
  SELECT
    MAX(bt2.dollars)
  FROM
    base_table bt2
  WHERE
    bt1.index > bt2.index)
FROM
  base_table bt1

DVD Rental case study - Revenue from the first order

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30


WITH
  first_amount AS (
  SELECT
    customer_id,
    amount
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
    FROM
      `jrjames83-1171.sampledata.payments`)
  WHERE
    index = 1 ),
  summary_stat AS (
  SELECT
    customer_id,
    SUM(amount) AS total_revenue,
    MIN(payment_date)
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id)
SELECT
  *
FROM
  summary_stat a
JOIN
  first_amount b
ON
  b.customer_id = a.customer_id

DVD Rental case study - Revenue from the first order - Act 2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


WITH
  first_amount AS (
  SELECT
    customer_id,
    amount AS first_amt
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
    FROM
      `jrjames83-1171.sampledata.payments`)
  WHERE
    index = 1 ),
  summary_stat AS (
  SELECT
    customer_id,
    SUM(amount) AS total_revenue,
    MIN(payment_date) AS m_pay_date,
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id)
SELECT
  a.customer_id,
  total_revenue,
  first_amt
FROM
  summary_stat a
JOIN
  first_amount b
ON
  b.customer_id = a.customer_id
ORDER BY
  1

LTVP

WITH
  first_amount AS (
  SELECT
    customer_id,
    amount AS first_amt
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
    FROM
      `jrjames83-1171.sampledata.payments`)
  WHERE
    index = 1 ),
  summary_stat AS (
  SELECT
    customer_id,
    SUM(amount) AS total_revenue,
    MIN(payment_date) AS m_pay_date
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id),
  summary_so_far AS(
  SELECT
    a.customer_id,
    m_pay_date,
    total_revenue,
    first_amt,
    first_amt/total_revenue*100 AS first_as_pct_tot_rev
  FROM
    summary_stat a
  JOIN
    first_amount b
  ON
    b.customer_id = a.customer_id
  ORDER BY
    1)
SELECT
  sf.*,
  (
  SELECT
    sum (p2.amount)
  FROM
    `jrjames83-1171.sampledata.payments` p2
  WHERE
    p2.customer_id =sf.customer_id
    AND DATE(p2.payment_date) BETWEEN DATE(sf.m_pay_date)
    AND DATE(sf.m_pay_date)+90 ) AS customer_tv_first_30days
FROM
  summary_so_far sf

LTVP using standard functions

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


WITH
  first_amount AS (
  SELECT
    customer_id,
    amount AS first_amt
  FROM (
    SELECT
      *,
      ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
    FROM
      `jrjames83-1171.sampledata.payments`)
  WHERE
    index = 1 ),
  summary_stat AS (
  SELECT
    customer_id,
    SUM(amount) AS total_revenue,
    MIN(payment_date) AS m_pay_date
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id),
  summary_so_far AS(
  SELECT
    a.customer_id,
    m_pay_date,
    total_revenue,
    first_amt,
    first_amt/total_revenue*100 AS first_as_pct_tot_rev
  FROM
    summary_stat a
  JOIN
    first_amount b
  ON
    b.customer_id = a.customer_id
  ORDER BY
    1)
SELECT
  sf.*,
  (
  SELECT
    sum (p2.amount)
  FROM
    `jrjames83-1171.sampledata.payments` p2
  WHERE
    p2.customer_id =sf.customer_id
    AND DATE(p2.payment_date) BETWEEN DATE(sf.m_pay_date)
    AND DATE_ADD(DATE(sf.m_pay_date), INTERVAL 30 day) ) AS customer_tv_first_30days
FROM
  summary_so_far sf

Moving Average Case Study

Attempt 1

WITH
  bt AS (
  SELECT
    Date,
    close,
    AVG(close) OVER(ORDER BY Date ROWS BETWEEN 50 PRECEDING AND CURRENT ROW ) AS ma_50,
    AVG(close) OVER(ORDER BY Date ROWS BETWEEN 200 PRECEDING AND CURRENT ROW ) AS ma_200
  FROM
    `jrjames83-1171.sampledata.stock_prices`
  ORDER BY
    Date)
SELECT
  *,
  (CASE
      WHEN ma_50 > ma_200 THEN FALSE
    ELSE
    TRUE
  END
    ) AS status
FROM
  bt

Attempt 2

WITH
  bt AS (
  SELECT
    *,
    ROW_NUMBER() OVER(ORDER BY Date) AS index
  FROM
    `jrjames83-1171.sampledata.stock_prices`
  ORDER BY
    Date),
  bt1 AS (
  SELECT
    Date,
    close,
    AVG(close) OVER(ORDER BY index RANGE BETWEEN 49 PRECEDING AND CURRENT ROW ) AS ma_50,
    AVG(close) OVER(ORDER BY index RANGE BETWEEN 199 PRECEDING AND CURRENT ROW ) AS ma_200
  FROM
    bt
  ORDER BY
    Date)
SELECT
  *,
  (CASE
      WHEN ma_50 > ma_200 THEN FALSE
    ELSE
    TRUE
  END
    ) AS status
FROM
  bt1

Attempt 3

WITH
  bt AS (
  SELECT
    *,
    ROW_NUMBER() OVER(ORDER BY Date) AS index
  FROM
    `jrjames83-1171.sampledata.stock_prices`
  ORDER BY
    Date),
  bt1 AS (
  SELECT
    Date,
    close,
    index,
    AVG(close) OVER(ORDER BY index RANGE BETWEEN 49 PRECEDING AND CURRENT ROW ) AS ma_50,
    AVG(close) OVER(ORDER BY index RANGE BETWEEN 199 PRECEDING AND CURRENT ROW ) AS ma_200
  FROM
    bt
  ORDER BY
    Date),
  signal AS (
  SELECT
    *,
  IF
    (ma_50 > ma_200,
      "buy",
      "sell") AS status
  FROM
    bt1),
  signal_stat AS (
  SELECT
    *,
    LAG(status) OVER(ORDER BY index) AS lagged
  FROM
    signal
  ORDER BY
    index)
SELECT
  *,
IF
  (lagged!=status,
    "y",
    "n") AS crossover
FROM
  signal_stat

More Complex JOINS and Partitioning

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65


WITH
  bt AS (
  SELECT
    customer_id,
    rental_id,
    SUM(amount) AS rev
  FROM
    `jrjames83-1171.sampledata.payments`
  GROUP BY
    customer_id,
    rental_id
  ORDER BY
    customer_id,
    rental_id,
    rev DESC),
  bt2 AS (
  SELECT
    *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY rev DESC ) AS top
  FROM
    bt
  ORDER BY
    customer_id,
    top),
  bt3 AS (
  SELECT
    bt2.*,
    rent.inventory_id
  FROM
    bt2
  JOIN
    `jrjames83-1171.sampledata.rental` AS rent
  ON
    bt2.customer_id = rent.customer_id
  WHERE
    bt2.top IN (1,
      2)),
  bt4 AS (
  SELECT
    bt3.*,
    inv.*
  FROM
    bt3
  LEFT JOIN
    `jrjames83-1171.sampledata.inventory`AS inv
  ON
    bt3.inventory_id = inv.inventory_id),
  bt5 AS (
  SELECT
    bt4.*,
    film.*
  FROM
    bt4
  LEFT JOIN
    `jrjames83-1171.sampledata.film`AS film
  ON
    bt4.film_id = film.film_id)
SELECT
  customer_id,
  rev,
  rating
FROM
  bt5
ORDER BY
  customer_id

Stackoverflow Text analysis

Most common word for each tag

Approach 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39


WITH
  tab1 AS (
  SELECT
    tag,
    SPLIT(title," ") AS tokens
  FROM
    `jrjames83-1171.sampledata.top_questions`),
  tab2 AS (
  SELECT
    tag,
    ftoken
  FROM
    tab1
  CROSS JOIN
    UNNEST(tab1.tokens) AS ftoken),
  tab3 AS (
  SELECT
    tag,
    ftoken,
    COUNT(*) AS ct
  FROM
    tab2
  GROUP BY
    tag,
    ftoken),
  tab4 AS (
  SELECT
    tab3.*,
    ROW_NUMBER() OVER(PARTITION BY tag ORDER BY ct DESC) AS rank
  FROM
    tab3)
SELECT
  *
FROM
  tab4
WHERE
  rank = 1
LIMIT
  10

Approach 2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


WITH
  tab1 AS (
  SELECT
    tag,
    SPLIT(title," ") AS tokens
  FROM
    `jrjames83-1171.sampledata.top_questions`),
  tab2 AS (
  SELECT
    DISTINCT tag,
    word
  FROM
    tab1,
    UNNEST(tokens) word ),
  tab3 AS (
  SELECT
    tag,
    word,
    COUNT(*) AS ct
  FROM
    tab2
  GROUP BY
    tag,
    word),
  tab4 AS (
  SELECT
    tab3.*,
    ROW_NUMBER() OVER(PARTITION BY tag ORDER BY ct DESC) AS rank
  FROM
    tab3)
SELECT
  *
FROM
  tab4
WHERE
  rank = 1
LIMIT
  10

Regex Operations

Advent of code 2020 - Day2(Part A)

Attempt 1

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


WITH
  bt AS (
  SELECT
    REGEXP_EXTRACT(occurrence_range, "(\\d*)-[\\d]*") AS sidx,
    REGEXP_EXTRACT(occurrence_range, "[\\d]*-(\\d*)") AS eidx,
    *
  FROM
    `jrjames83-1171.sampledata.aoc2017day2`),
  bt1 AS (
  SELECT
    sidx,
    eidx,
    character,
    datafield,
    COUNT(occs) AS occs,
  FROM
    bt,
    UNNEST(REGEXP_EXTRACT_ALL(bt.datafield, character)) occs
  GROUP BY
    1,
    2,
    3,
    4)
SELECT
  sidx,
  eidx,
  character,
  datafield,
  (CASE
      WHEN (CAST(sidx AS int64) <= occs) AND (occs <= CAST(eidx AS int64)) THEN TRUE
    ELSE
    FALSE
  END
    ) AS status
FROM
  bt1

Attempt 2

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36


WITH
  bt AS (
  SELECT
    REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
    REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
    *
  FROM
    `jrjames83-1171.sampledata.aoc2017day2`),
  bt1 AS (
  SELECT
    sidx,
    eidx,
    character,
    datafield,
    COUNT(occs) AS occs,
  FROM
    bt,
    UNNEST(REGEXP_EXTRACT_ALL(bt.datafield, character)) occs
  GROUP BY
    1,
    2,
    3,
    4)
SELECT
  sidx,
  eidx,
  character,
  datafield,
  (CASE
      WHEN (CAST(sidx AS int64) <= occs) AND (occs <= CAST(eidx AS int64)) THEN TRUE
    ELSE
    FALSE
  END
    ) AS status
FROM
  bt1

Attempt-3

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


WITH
  bt AS (
  SELECT
    *,
    REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
    REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
    ARRAY_LENGTH(REGEXP_EXTRACT_ALL(datafield, character)) AS occs
  FROM
    `jrjames83-1171.sampledata.aoc2017day2`),
  bt1 AS (
  SELECT
    sidx,
    eidx,
    character,
    datafield,
    occs BETWEEN (CAST(sidx AS int64))
    AND (CAST(eidx AS int64)) AS status
  FROM
    bt)
SELECT
  *
FROM
  bt1
WHERE
  status = TRUE

Advent of code 2020 - Day2(Part B)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38


WITH
  bt AS (
  SELECT
    *,
    REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
    REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
    ARRAY_LENGTH(REGEXP_EXTRACT_ALL(datafield, character)) AS occs
  FROM
    `jrjames83-1171.sampledata.aoc2017day2`),
  bt1 AS (
  SELECT
    sidx,
    eidx,
    character,
    datafield,
    SUBSTR(datafield,CAST(sidx AS int64),1) AS leftw,
    SUBSTR(datafield,CAST(eidx AS int64),1) AS rightw
  FROM
    bt),
  bt2 AS (
  SELECT
    bt1.*,
    (CASE
        WHEN character = leftw AND character <> rightw THEN TRUE
        WHEN character = rightw
      AND character <> leftw THEN TRUE
      ELSE
      FALSE
    END
      ) AS status
  FROM
    bt1)
SELECT
  COUNT(*)
FROM
  bt2
WHERE
  status = TRUE

Takeaway

On a periodic basis, it is always better to revisit a concept/subject/idea, that you think you are extremely familiar with. You will be surprised to know, how much you do not know.

Contents

Learnings from SQL immersion

Deliberate Practice (55 queries)

Basis Queries in BQ

Order by column position

Alias to refer to a table

Simple JOIN

Select from Select

More elements in a select query

Use of DISTINCT

Multiple tags

Use of HAVING

Use of TRIM

HAVING

Organize subqueries as tables

Give me all the views from the title that contains Python as tag

For each tag, get me all the questions that are associated with the tag

THE CASE STATEMENT

How to use a CASE statement

CASE statement revised

Combining data elements and writing a case statement

JOINS

INNER JOIN

LEFT/RIGHT JOIN

FULL OUTER JOIN

City with most orders

Max orders by customer

Use of ROW_NUMBER and PARTITION

Customers acquired every month

Everything is a table and you can use them in clever ways

Average days between orders

Using CASE

Revenue trends per hour - casestudy

Revenue trends per time slots

Use case statements to create conditional summaries

UNNEST and Correlated Subqueries

Correlated subquery

Moving Averages

Checking for a specific condition

Write using a correlated subquery

DVD Rental case study - Revenue from the first order

DVD Rental case study - Revenue from the first order - Act 2

LTVP

LTVP using standard functions

Moving Average Case Study

Attempt 1

Attempt 2

Attempt 3

More Complex JOINS and Partitioning

Stackoverflow Text analysis

Most common word for each tag

Regex Operations

Advent of code 2020 - Day2(Part A)

Advent of code 2020 - Day2(Part B)

Takeaway

Use of `ROW_NUMBER` and `PARTITION`