This post contains a brief note on a marathon Big Query SQL immersion session
SQL
is the most underrated skill of a data engineer/data scientist. Even
though many modern library data munging verbs mirror SQL terminology, most
often, data crunching is done using R
, Python
, Spark
, Hive
programming.
With the advent of cloud services, there is no need to spend time in setting up
a Spark
cluster or Hadoop
cluster. With a click of a button(AWS Glue/GCP
DataProc), the cluster is up and running, waiting for a number cruncher to fire
away queries. There is no issue in sticking to the above route to get the job
done. As one can see, the reason to write SQL
queries usually diminishes, as
there are many powerful data manipulation libraries in the open source world.
However there might be situations where you might have no choice but to resort
to SQL queries. Take Big Query for example. It promises to democratize
business analytics skills by providing a powerful columnar storage database and
exposing the UI as a standard SQL query engine. One can rely the age old SQL
lingo and get powerful query results within seconds. All the mighty distributed
processing architecture is abstracted away from the user. A recent trend in the
open source community is wrapper libraries for Big Query: you use your
favorite programming language, write the data munging code, and the library
converts in to SQL query and submits to Big Query. For example, the bigrquery
package in R
completely abstracts the SQL syntax. This is good for many R
programmers as you can be in your favorite environment and do the analysis.
However when you hit an issue, often it is the case that the SQL conversion is
buggy. There are many other reasons, for one to have a solid working knowledge
of SQL queries: one of them being Big Query domination is increasing in the data
warehousing space and you might want to write native SQL queries and get the job
done.
Am reasonably comfortable with SQL as I have been using it forever. However, for
some reason, I felt that I should revisit the syntax and spend a few hours with
it, focusing on understanding the language, rather than getting my job done for
some number crunching exercise, that is part of a model building work.
Learnings from SQL immersion
- Understand the basic structure of SQL statement
- Subqueries and when to use them
JOIN
queries and when to use them
- Correlated subqueries:Inner query is related to the Outer query
UNNEST
function
- Create custom index columns based on certain columns
WINDOW
functionality
- Use functions such as
OVER, LAG, ROW_NUMBER
- Using date functions
- Creating a repeated value field
- One can write an
ORDER BY
clause and refer to the position of a specific
column that you want to order by
- One can refer to a select statement by an ALIAS and refer to that ALIAS in
your queries
- Usage of
HAVING
clause. It is mainly used for filtering on some
aggregate that you are building as a part of your query
- Everything is a table. You can select from a select statement. You can write
subquery and write a select from it
- Use
WITH
statement to refer a subquery so that it can be considered as a table
- Use of
ARRAY_LENGTH
function to count values in an array
- Use of
ARRAY_AGG
function to aggregate values in to an array
- Big Query UI editor shortcuts - Surprised to know that there is no way to save
the query using a keyboard shortcut
CASE
statement usage
COALESCE
statement usage. It can be used to populate a dummy value for all
null values of a field
- Use of
LIKE
operator to check for string matches
- Besides a
JOIN
statement, you can put in a WHERE
clause. It is better to put in
a filter condition with the JOIN statement so that the resulting table is of
manageable size
- Understanding the power of creating small tables via subquery is key to mastering SQL
GENERATE_ARRAY
is the a field that creates a struct data in a cell value.
UNNEST
can be used to flatten a repeated field
- One can use small case statement to create conditional summaries. One can
write a case statement and feed in to the aggregate function
- One can create moving averages using correlated subqueries
- Correlated subquery can be used to generate summary reports
UNNEST
function is Big Query specific function
WITH OFFSET AS INDEX
gives the index as a column
- You can create lags via correlated subqueries by relying on the row index
- One can create
PARTITION
by multiple columns
- One can use
UNNEST
functionality and CROSSJOIN
to flatten a repeated field
- There are many built in functions in Big Query that support regex operations
- UDFs create a function by using a SQL expression or JavaScript code.
These can be used in the query expressions
Deliberate Practice (55 queries)
Basis Queries in BQ
Order by column position
1
2
3
4
5
6
7
8
9
10
11
12
|
SELECT
id,
title,
quarter,
quarter_views
FROM
`jrjames83-1171.sampledata.top_questions`
WHERE
title = "Big Query - SQL Immersion"
AND quarter > '2019-01-01'
ORDER BY
3 DESC
|
Alias to refer to a table
1
2
3
4
5
6
7
|
SELECT
p.staff_id,
rental_id
FROM
`jrjames83-1171.sampledata.payments` p
LIMIT
1000
|
Simple JOIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
SELECT
p.staff_id,
p.rental_id,
r.inventory_id
FROM
`jrjames83-1171.sampledata.payments` p
JOIN
`jrjames83-1171.sampledata.rental` r
ON
p.rental_id = r.rental_id
WHERE
p.amount > 0
AND p.staff_id !=1
LIMIT
10
|
Select from Select
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
SELECT
*
FROM (
SELECT
p.staff_id,
p.rental_id,
r.inventory_id
FROM
`jrjames83-1171.sampledata.payments` p
JOIN
`jrjames83-1171.sampledata.rental` r
ON
p.rental_id = r.rental_id
WHERE
p.amount > 0
AND p.staff_id NOT IN(1,
3,
5)
LIMIT
10)
|
On the BQ UI, one can run the selected query of a bigger query
More elements in a select query
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
SELECT
DISTINCT t.inventory_id,
i.film_id
FROM (
SELECT
p.staff_id,
p.rental_id,
r.inventory_id
FROM
`jrjames83-1171.sampledata.payments` p
JOIN
`jrjames83-1171.sampledata.rental` r
ON
p.rental_id = r.rental_id
WHERE
p.amount > 0
AND p.staff_id NOT IN(1,
3,
5)
LIMIT
10) t
JOIN
`jrjames83-1171.sampledata.inventory` i
ON
t.inventory_id = i.inventory_id
|
Use of DISTINCT
1
2
3
4
5
6
7
8
9
10
11
12
|
SELECT
COUNT(DISTINCT(id)),
COUNT(id),
tag
FROM
`jrjames83-1171.sampledata.top_questions`
WHERE
tag!='undefined'
GROUP BY
tag
ORDER BY
1 DESC
|
Use of HAVING
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
SELECT
ARRAY_AGG(DISTINCT(tag)) AS ct,
id,
title
FROM
`jrjames83-1171.sampledata.top_questions`
WHERE
tag !='undefined'
GROUP BY
id,
title
HAVING
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) > 2
ORDER BY
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) desc
LIMIT
10
|
Use of TRIM
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
SELECT
ARRAY_AGG(DISTINCT(tag)) AS ct,
id,
title
FROM
`jrjames83-1171.sampledata.top_questions`
WHERE
tag !='undefined' and TRIM(lower(tag)) LIKE "%python%"
GROUP BY
id,
title
HAVING
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) > 2
ORDER BY
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(tag))) desc
LIMIT
10
|
HAVING
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
SELECT
q1.id,
q1.title,
tags
FROM (
SELECT
ARRAY_AGG(DISTINCT(tag)) AS tags,
id,
title,
FROM
`jrjames83-1171.sampledata.top_questions`
WHERE
tag !='undefined'
AND TRIM(LOWER(tag)) LIKE "%python%"
GROUP BY
id,
title ) q1
WHERE
ARRAY_LENGTH(q1.tags) > 2
ORDER BY
ARRAY_LENGTH(q1.tags) DESC
LIMIT
10
|
Organize subqueries as tables
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
WITH
q1 AS (
SELECT
ARRAY_AGG(DISTINCT(tag)) AS tags,
id,
title,
FROM
`jrjames83-1171.sampledata.top_questions`
WHERE
tag !='undefined'
AND TRIM(LOWER(tag)) LIKE "%python%"
GROUP BY
id,
title )
SELECT
q1.id,
q1.title,
tags
FROM q1
WHERE
ARRAY_LENGTH(q1.tags) > 2
ORDER BY
ARRAY_LENGTH(q1.tags) DESC
LIMIT
10
|
Give me all the views from the title that contains Python as tag
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
SELECT
SUM(q2.total_views)
FROM (
SELECT
DISTINCT(id),
total_views
FROM
`jrjames83-1171.sampledata.top_questions`
WHERE
id IN(
SELECT
DISTINCT(id) AS id
FROM
`jrjames83-1171.sampledata.top_questions`
WHERE
REGEXP_CONTAINS(title, "python"))) q2
|
For each tag, get me all the questions that are associated with the tag
1
2
3
4
5
6
7
8
9
10
|
SELECT
tag,
ARRAY_AGG(DISTINCT(title))
FROM
`jrjames83-1171.sampledata.top_questions`
GROUP BY
tag
LIMIT
10
|
1
2
3
4
5
6
7
8
9
10
11
|
SELECT
tag,
ARRAY_AGG(DISTINCT(title)) AS titles
FROM
`jrjames83-1171.sampledata.top_questions`
GROUP BY
tag
ORDER BY
ARRAY_LENGTH(titles) DESC
LIMIT
100
|
1
2
3
4
5
6
7
8
9
10
11
|
SELECT
tag,
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
`jrjames83-1171.sampledata.top_questions`
GROUP BY
tag
ORDER BY
ct DESC
LIMIT
100
|
1
2
3
4
5
6
7
8
9
|
SELECT
tag,
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
`jrjames83-1171.sampledata.top_questions`
GROUP BY
tag
LIMIT
10
|
1
2
3
4
5
6
7
8
9
10
11
|
SELECT
SUM(p1.ct)/COUNT(p1.tag)
FROM (
SELECT
tag,
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
`jrjames83-1171.sampledata.top_questions`
GROUP BY
tag)p1
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
WITH
stats AS (
SELECT
tag,
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
`jrjames83-1171.sampledata.top_questions`
GROUP BY
tag)
SELECT
SUM(ct)/COUNT(tag),
MIN(ct),
MAX(ct)
FROM
stats
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
WITH
stats AS (
SELECT
tag,
ARRAY_LENGTH(ARRAY_AGG(DISTINCT(title))) AS ct
FROM
`jrjames83-1171.sampledata.top_questions`
GROUP BY
tag)
SELECT
*
FROM
stats
WHERE
ct > (
SELECT
AVG(ct)
FROM
stats)
|
THE CASE STATEMENT
How to use a CASE statement
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
WITH
base_table AS (
SELECT
DISTINCT(title),
CASE
WHEN title LIKE "%python%" THEN 'python'
WHEN title LIKE "%sql%" THEN 'sql'
WHEN title LIKE "%javasccript%" THEN 'javascript'
WHEN title LIKE "%java%" THEN 'java'
END
AS
LANGUAGE
FROM
`jrjames83-1171.sampledata.top_questions`)
SELECT
COUNT(*),
LANGUAGE
FROM
base_table
GROUP BY
2
ORDER BY
2 DESC
|
CASE statement revised
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
WITH
base_table AS (
SELECT
DISTINCT(id),
CASE
WHEN (title LIKE "%python%" OR tag LIKE "%python%") THEN 'python'
WHEN (title LIKE "%sql%"
OR tag LIKE "%sql%") THEN 'sql'
WHEN (title LIKE "%javascript%" OR tag LIKE "%sql%" ) THEN 'javascript'
WHEN (title LIKE "%java%"
OR tag LIKE "%java%") THEN 'java'
ELSE
NULL
END
AS
LANGUAGE
FROM
`jrjames83-1171.sampledata.top_questions`)
SELECT
COUNT(*),
LANGUAGE
FROM
base_table
GROUP BY
2
ORDER BY
1 DESC
|
Combining data elements and writing a case statement
1
2
3
4
5
6
7
8
9
10
11
|
WITH
base_table(
SELECT
id,
title,
ARRAY_TO_STRING(ARRAY_AGG(DISTINCT tag), " ")
FROM
`jrjames83-1171.sampledata.top_questions`
GROUP BY
id,
title)
|
JOINS
INNER JOIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
WITH
tab1 AS (
SELECT
num1
FROM
UNNEST(GENERATE_ARRAY(10,20)) AS num1),
tab2 AS (
SELECT
num2
FROM
UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
num1,
num2
FROM
tab1
JOIN
tab2
ON
tab1.num1=tab2.num2
|
LEFT/RIGHT JOIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
WITH
tab1 AS (
SELECT
num1
FROM
UNNEST(GENERATE_ARRAY(10,20)) AS num1),
tab2 AS (
SELECT
num2
FROM
UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
num1,
num2
FROM
tab1
LEFT JOIN
tab2
ON
tab1.num1=tab2.num2
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
WITH
tab1 AS (
SELECT
num1
FROM
UNNEST(GENERATE_ARRAY(10,20)) AS num1),
tab2 AS (
SELECT
num2
FROM
UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
num1,
num2
FROM
tab1
RIGHT JOIN
tab2
ON
tab1.num1=tab2.num2
|
FULL OUTER JOIN
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
WITH
tab1 AS (
SELECT
num1
FROM
UNNEST(GENERATE_ARRAY(10,20)) AS num1),
tab2 AS (
SELECT
num2
FROM
UNNEST(GENERATE_ARRAY(8,12)) AS num2)
SELECT
num1,
num2
FROM
tab1
FULL OUTER JOIN
tab2
ON
tab1.num1=tab2.num2
|
City with most orders
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
WITH
master AS (
SELECT
p1.order_id AS x,
p2.customer_city AS y
FROM
`jrjames83-1171.sampledata.orders` AS p1
JOIN
`jrjames83-1171.sampledata.customers` AS p2
ON
p1.customer_id = p2.customer_id)
SELECT
COUNT(x),
y
FROM
master
GROUP BY
y
ORDER BY
1 DESC
|
Max orders by customer
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
WITH
master AS (
SELECT
p1.order_id AS x,
p2.customer_unique_id AS y
FROM
`jrjames83-1171.sampledata.orders` AS p1
JOIN
`jrjames83-1171.sampledata.customers` AS p2
ON
p1.customer_id = p2.customer_id)
SELECT
COUNT(x),
y
FROM
master
GROUP BY
y
ORDER BY
1 DESC
|
Use of ROW_NUMBER
and PARTITION
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
SELECT
p1.order_id AS x1,
p2.customer_unique_id AS x2,
p1.order_purchase_timestamp AS y,
ROW_NUMBER() OVER (PARTITION BY p2.customer_unique_id ORDER BY p1.order_purchase_timestamp)
FROM
`jrjames83-1171.sampledata.orders` AS p1
JOIN
`jrjames83-1171.sampledata.customers` AS p2
ON
p1.customer_id = p2.customer_id
ORDER BY
1,
2
|
Customers acquired every month
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
WITH
base_table AS(
SELECT
p2.customer_unique_id AS x1,
DATE_TRUNC(p1.order_purchase_timestamp,month) AS y,
ROW_NUMBER() OVER (PARTITION BY DATE_TRUNC(p1.order_purchase_timestamp,month)
ORDER BY
p1.order_purchase_timestamp) AS rn
FROM
`jrjames83-1171.sampledata.orders` AS p1
JOIN
`jrjames83-1171.sampledata.customers` AS p2
ON
p1.customer_id = p2.customer_id)
SELECT
MAX(rn),
y
FROM
base_table
GROUP BY
y
ORDER BY
2
|
Everything is a table and you can use them in clever ways
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
WITH
customer_order_ranks AS (
SELECT
c.customer_unique_id,
o.order_purchase_timestamp,
ROW_NUMBER() OVER(PARTITION BY c.customer_unique_id) AS customer_order_number
FROM
`jrjames83-1171.sampledata.customers` AS c
JOIN
`jrjames83-1171.sampledata.orders` AS o
ON
c.customer_id = o.customer_id
ORDER BY
1,
2),
exclusions AS (
SELECT
customer_unique_id,
MAX(customer_order_number)
FROM
customer_order_ranks
GROUP BY
1
HAVING
MAX(customer_order_number) =1 )
SELECT
customer_unique_id,
customer_order_number
FROM
customer_order_ranks
WHERE
customer_unique_id NOT IN (
SELECT
customer_unique_id
FROM
exclusions)
|
Average days between orders
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
|
WITH
customer_order_ranks AS (
SELECT
c.customer_unique_id,
o.order_purchase_timestamp,
o.order_id,
ROW_NUMBER() OVER(PARTITION BY c.customer_unique_id ORDER BY o.order_purchase_timestamp) AS customer_order_number
FROM
`jrjames83-1171.sampledata.customers` AS c
JOIN
`jrjames83-1171.sampledata.orders` AS o
ON
c.customer_id = o.customer_id
ORDER BY
1,
2),
exclusions AS (
SELECT
customer_unique_id,
MAX(customer_order_number)
FROM
customer_order_ranks
GROUP BY
1
HAVING
MAX(customer_order_number) =1 ),
inclusions AS(
SELECT
customer_unique_id,
customer_order_number,
order_purchase_timestamp
FROM
customer_order_ranks
WHERE
customer_unique_id NOT IN (
SELECT
customer_unique_id
FROM
exclusions)),
reldata AS(
SELECT
customer_unique_id,
customer_order_number,
order_purchase_timestamp,
LAG(order_purchase_timestamp) OVER(PARTITION BY customer_unique_id ORDER BY customer_order_number ) AS prev_purchase
FROM
inclusions
ORDER BY
customer_unique_id)
SELECT
AVG(DATE_DIFF(order_purchase_timestamp,prev_purchase, day))
FROM
reldata
WHERE
prev_purchase IS NOT NULL
|
Using CASE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
|
WITH
took_offer AS (
SELECT
1 AS customer_id,
34.99 AS spend
UNION ALL
SELECT
2,
21.99
UNION ALL
SELECT
3,
179.0
UNION ALL
SELECT
4,
0.99
UNION ALL
SELECT
5,
1299.9 ),
loyalty AS (
SELECT
3 AS customer_id_1,
1 AS status
UNION ALL
SELECT
4,
1,
UNION ALL
SELECT
8,
2
UNION ALL
SELECT
12,
1
UNION ALL
SELECT
10,
2 ),
results AS (
SELECT
*
FROM
took_offer AS t
FULL OUTER JOIN
loyalty AS j
ON
t.customer_id = j.customer_id_1 )
SELECT
customer_id,
spend,
customer_id_1 status,
CASE
WHEN customer_id IS NOT NULL AND customer_id_1 IS NOT NULL THEN "both"
WHEN customer_id IS NULL
AND customer_id_1 IS NOT NULL THEN "offer only"
WHEN customer_id IS NOT NULL AND customer_id_1 IS NULL THEN "loyalty only"
ELSE
"na"
END
FROM
results
|
Revenue trends per hour - casestudy
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
SELECT
EXTRACT(hour
FROM
o.order_purchase_timestamp ) AS hr,
ROUND(SUM(p.payment_value),2) AS sales,
SUM(SUM(p.payment_value)) OVER () AS total_sales
FROM
`jrjames83-1171.sampledata.orders` o
LEFT JOIN
`jrjames83-1171.sampledata.order_payments` p
ON
o.order_id = p.order_id
GROUP BY
1
ORDER BY
1
|
Revenue trends per time slots
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
|
WITH
base_table AS(
SELECT
EXTRACT(hour
FROM
o.order_purchase_timestamp ) AS hr,
ROUND(SUM(p.payment_value),2) AS sales,
SUM(SUM(p.payment_value)) OVER () AS total_sales
FROM
`jrjames83-1171.sampledata.orders` o
LEFT JOIN
`jrjames83-1171.sampledata.order_payments` p
ON
o.order_id = p.order_id
GROUP BY
1
ORDER BY
1 )
SELECT
SUM(sales),
(CASE
WHEN hr BETWEEN 0 AND 5 THEN "morning"
WHEN hr BETWEEN 6
AND 11 THEN "afternoon"
WHEN hr BETWEEN 12 AND 16 THEN "evening"
WHEN hr BETWEEN 17
AND 24 THEN "late evening"
ELSE
"na"
END
) AS timeslot
FROM
base_table
WHERE
base_table.hr IS NOT NULL
GROUP BY
timeslot
|
Use case statements to create conditional summaries
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
WITH
base_table AS(
SELECT
EXTRACT(hour
FROM
o.order_purchase_timestamp ) AS hr,
ROUND(SUM(p.payment_value),2) AS sales,
SUM(SUM(p.payment_value)) OVER () AS total_sales
FROM
`jrjames83-1171.sampledata.orders` o
LEFT JOIN
`jrjames83-1171.sampledata.order_payments` p
ON
o.order_id = p.order_id
GROUP BY
1
ORDER BY
1 )
SELECT
SUM(CASE
WHEN hr BETWEEN 0 AND 5 THEN sales
ELSE
0
END
),
SUM(CASE
WHEN hr BETWEEN 7 AND 12 THEN sales
ELSE
0
END
)
FROM
base_table
|
- This is useful for creating unnest record field
- A correlated subquery sample
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
WITH
base_table AS (
SELECT
index,
dollars
FROM
UNNEST(GENERATE_ARRAY(1,5)) AS dollars
WITH
OFFSET
AS index)
SELECT
bt1.*,
(
SELECT
SUM(bt2.dollars)
FROM
base_table bt2
WHERE
bt2.index <= bt1.index ) AS cumsum
FROM
base_table bt1
|
Moving Averages
1
2
3
4
5
|
SELECT
*,
AVG(dollars) OVER (ORDER BY index ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) AS three_day_ma
FROM
base_table
|
Checking for a specific condition
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
|
WITH
base_table AS (
SELECT
index,
dollars
FROM
UNNEST(GENERATE_ARRAY(1,5)) AS dollars
WITH
OFFSET
AS index),
results AS (
SELECT
dollars,
SUM(dollars) OVER (ORDER BY index ROWS BETWEEN 1 PRECEDING AND 1 PRECEDING) AS prev
FROM
base_table )
SELECT
COUNT(*)
FROM
results
WHERE
dollars - prev <0
AND prev IS NOT NULL
|
Another solution
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
WITH
base_table AS (
SELECT
index,
dollars
FROM
UNNEST(GENERATE_ARRAY(1,5)) AS dollars
WITH
OFFSET
AS index)
SELECT
DISTINCT check
FROM (
SELECT
(dollars > LAG(dollars) OVER (ORDER BY index) ) AS check
FROM
base_table)
WHERE
check IS NOT NULL
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
WITH
base_table AS (
SELECT
index,
dollars
FROM
UNNEST(GENERATE_ARRAY(1,5)) AS dollars
WITH
OFFSET
AS index)
SELECT
bt1.dollars,
bt1.index,
(
SELECT
MAX(bt2.dollars)
FROM
base_table bt2
WHERE
bt1.index > bt2.index)
FROM
base_table bt1
|
DVD Rental case study - Revenue from the first order
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
WITH
first_amount AS (
SELECT
customer_id,
amount
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
FROM
`jrjames83-1171.sampledata.payments`)
WHERE
index = 1 ),
summary_stat AS (
SELECT
customer_id,
SUM(amount) AS total_revenue,
MIN(payment_date)
FROM
`jrjames83-1171.sampledata.payments`
GROUP BY
customer_id)
SELECT
*
FROM
summary_stat a
JOIN
first_amount b
ON
b.customer_id = a.customer_id
|
DVD Rental case study - Revenue from the first order - Act 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
|
WITH
first_amount AS (
SELECT
customer_id,
amount AS first_amt
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
FROM
`jrjames83-1171.sampledata.payments`)
WHERE
index = 1 ),
summary_stat AS (
SELECT
customer_id,
SUM(amount) AS total_revenue,
MIN(payment_date) AS m_pay_date,
FROM
`jrjames83-1171.sampledata.payments`
GROUP BY
customer_id)
SELECT
a.customer_id,
total_revenue,
first_amt
FROM
summary_stat a
JOIN
first_amount b
ON
b.customer_id = a.customer_id
ORDER BY
1
|
LTVP
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
|
WITH
first_amount AS (
SELECT
customer_id,
amount AS first_amt
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
FROM
`jrjames83-1171.sampledata.payments`)
WHERE
index = 1 ),
summary_stat AS (
SELECT
customer_id,
SUM(amount) AS total_revenue,
MIN(payment_date) AS m_pay_date
FROM
`jrjames83-1171.sampledata.payments`
GROUP BY
customer_id),
summary_so_far AS(
SELECT
a.customer_id,
m_pay_date,
total_revenue,
first_amt,
first_amt/total_revenue*100 AS first_as_pct_tot_rev
FROM
summary_stat a
JOIN
first_amount b
ON
b.customer_id = a.customer_id
ORDER BY
1)
SELECT
sf.*,
(
SELECT
sum (p2.amount)
FROM
`jrjames83-1171.sampledata.payments` p2
WHERE
p2.customer_id =sf.customer_id
AND DATE(p2.payment_date) BETWEEN DATE(sf.m_pay_date)
AND DATE(sf.m_pay_date)+90 ) AS customer_tv_first_30days
FROM
summary_so_far sf
|
LTVP using standard functions
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
|
WITH
first_amount AS (
SELECT
customer_id,
amount AS first_amt
FROM (
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY customer_id ORDER BY payment_date) AS index
FROM
`jrjames83-1171.sampledata.payments`)
WHERE
index = 1 ),
summary_stat AS (
SELECT
customer_id,
SUM(amount) AS total_revenue,
MIN(payment_date) AS m_pay_date
FROM
`jrjames83-1171.sampledata.payments`
GROUP BY
customer_id),
summary_so_far AS(
SELECT
a.customer_id,
m_pay_date,
total_revenue,
first_amt,
first_amt/total_revenue*100 AS first_as_pct_tot_rev
FROM
summary_stat a
JOIN
first_amount b
ON
b.customer_id = a.customer_id
ORDER BY
1)
SELECT
sf.*,
(
SELECT
sum (p2.amount)
FROM
`jrjames83-1171.sampledata.payments` p2
WHERE
p2.customer_id =sf.customer_id
AND DATE(p2.payment_date) BETWEEN DATE(sf.m_pay_date)
AND DATE_ADD(DATE(sf.m_pay_date), INTERVAL 30 day) ) AS customer_tv_first_30days
FROM
summary_so_far sf
|
Moving Average Case Study
Attempt 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
WITH
bt AS (
SELECT
Date,
close,
AVG(close) OVER(ORDER BY Date ROWS BETWEEN 50 PRECEDING AND CURRENT ROW ) AS ma_50,
AVG(close) OVER(ORDER BY Date ROWS BETWEEN 200 PRECEDING AND CURRENT ROW ) AS ma_200
FROM
`jrjames83-1171.sampledata.stock_prices`
ORDER BY
Date)
SELECT
*,
(CASE
WHEN ma_50 > ma_200 THEN FALSE
ELSE
TRUE
END
) AS status
FROM
bt
|
Attempt 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
WITH
bt AS (
SELECT
*,
ROW_NUMBER() OVER(ORDER BY Date) AS index
FROM
`jrjames83-1171.sampledata.stock_prices`
ORDER BY
Date),
bt1 AS (
SELECT
Date,
close,
AVG(close) OVER(ORDER BY index RANGE BETWEEN 49 PRECEDING AND CURRENT ROW ) AS ma_50,
AVG(close) OVER(ORDER BY index RANGE BETWEEN 199 PRECEDING AND CURRENT ROW ) AS ma_200
FROM
bt
ORDER BY
Date)
SELECT
*,
(CASE
WHEN ma_50 > ma_200 THEN FALSE
ELSE
TRUE
END
) AS status
FROM
bt1
|
Attempt 3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
|
WITH
bt AS (
SELECT
*,
ROW_NUMBER() OVER(ORDER BY Date) AS index
FROM
`jrjames83-1171.sampledata.stock_prices`
ORDER BY
Date),
bt1 AS (
SELECT
Date,
close,
index,
AVG(close) OVER(ORDER BY index RANGE BETWEEN 49 PRECEDING AND CURRENT ROW ) AS ma_50,
AVG(close) OVER(ORDER BY index RANGE BETWEEN 199 PRECEDING AND CURRENT ROW ) AS ma_200
FROM
bt
ORDER BY
Date),
signal AS (
SELECT
*,
IF
(ma_50 > ma_200,
"buy",
"sell") AS status
FROM
bt1),
signal_stat AS (
SELECT
*,
LAG(status) OVER(ORDER BY index) AS lagged
FROM
signal
ORDER BY
index)
SELECT
*,
IF
(lagged!=status,
"y",
"n") AS crossover
FROM
signal_stat
|
More Complex JOINS and Partitioning
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
|
WITH
bt AS (
SELECT
customer_id,
rental_id,
SUM(amount) AS rev
FROM
`jrjames83-1171.sampledata.payments`
GROUP BY
customer_id,
rental_id
ORDER BY
customer_id,
rental_id,
rev DESC),
bt2 AS (
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY rev DESC ) AS top
FROM
bt
ORDER BY
customer_id,
top),
bt3 AS (
SELECT
bt2.*,
rent.inventory_id
FROM
bt2
JOIN
`jrjames83-1171.sampledata.rental` AS rent
ON
bt2.customer_id = rent.customer_id
WHERE
bt2.top IN (1,
2)),
bt4 AS (
SELECT
bt3.*,
inv.*
FROM
bt3
LEFT JOIN
`jrjames83-1171.sampledata.inventory`AS inv
ON
bt3.inventory_id = inv.inventory_id),
bt5 AS (
SELECT
bt4.*,
film.*
FROM
bt4
LEFT JOIN
`jrjames83-1171.sampledata.film`AS film
ON
bt4.film_id = film.film_id)
SELECT
customer_id,
rev,
rating
FROM
bt5
ORDER BY
customer_id
|
Stackoverflow Text analysis
Most common word for each tag
-
Approach 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
|
WITH
tab1 AS (
SELECT
tag,
SPLIT(title," ") AS tokens
FROM
`jrjames83-1171.sampledata.top_questions`),
tab2 AS (
SELECT
tag,
ftoken
FROM
tab1
CROSS JOIN
UNNEST(tab1.tokens) AS ftoken),
tab3 AS (
SELECT
tag,
ftoken,
COUNT(*) AS ct
FROM
tab2
GROUP BY
tag,
ftoken),
tab4 AS (
SELECT
tab3.*,
ROW_NUMBER() OVER(PARTITION BY tag ORDER BY ct DESC) AS rank
FROM
tab3)
SELECT
*
FROM
tab4
WHERE
rank = 1
LIMIT
10
|
-
Approach 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
WITH
tab1 AS (
SELECT
tag,
SPLIT(title," ") AS tokens
FROM
`jrjames83-1171.sampledata.top_questions`),
tab2 AS (
SELECT
DISTINCT tag,
word
FROM
tab1,
UNNEST(tokens) word ),
tab3 AS (
SELECT
tag,
word,
COUNT(*) AS ct
FROM
tab2
GROUP BY
tag,
word),
tab4 AS (
SELECT
tab3.*,
ROW_NUMBER() OVER(PARTITION BY tag ORDER BY ct DESC) AS rank
FROM
tab3)
SELECT
*
FROM
tab4
WHERE
rank = 1
LIMIT
10
|
Regex Operations
Advent of code 2020 - Day2(Part A)
-
Attempt 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
WITH
bt AS (
SELECT
REGEXP_EXTRACT(occurrence_range, "(\\d*)-[\\d]*") AS sidx,
REGEXP_EXTRACT(occurrence_range, "[\\d]*-(\\d*)") AS eidx,
*
FROM
`jrjames83-1171.sampledata.aoc2017day2`),
bt1 AS (
SELECT
sidx,
eidx,
character,
datafield,
COUNT(occs) AS occs,
FROM
bt,
UNNEST(REGEXP_EXTRACT_ALL(bt.datafield, character)) occs
GROUP BY
1,
2,
3,
4)
SELECT
sidx,
eidx,
character,
datafield,
(CASE
WHEN (CAST(sidx AS int64) <= occs) AND (occs <= CAST(eidx AS int64)) THEN TRUE
ELSE
FALSE
END
) AS status
FROM
bt1
|
-
Attempt 2
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
|
WITH
bt AS (
SELECT
REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
*
FROM
`jrjames83-1171.sampledata.aoc2017day2`),
bt1 AS (
SELECT
sidx,
eidx,
character,
datafield,
COUNT(occs) AS occs,
FROM
bt,
UNNEST(REGEXP_EXTRACT_ALL(bt.datafield, character)) occs
GROUP BY
1,
2,
3,
4)
SELECT
sidx,
eidx,
character,
datafield,
(CASE
WHEN (CAST(sidx AS int64) <= occs) AND (occs <= CAST(eidx AS int64)) THEN TRUE
ELSE
FALSE
END
) AS status
FROM
bt1
|
-
Attempt-3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
WITH
bt AS (
SELECT
*,
REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
ARRAY_LENGTH(REGEXP_EXTRACT_ALL(datafield, character)) AS occs
FROM
`jrjames83-1171.sampledata.aoc2017day2`),
bt1 AS (
SELECT
sidx,
eidx,
character,
datafield,
occs BETWEEN (CAST(sidx AS int64))
AND (CAST(eidx AS int64)) AS status
FROM
bt)
SELECT
*
FROM
bt1
WHERE
status = TRUE
|
Advent of code 2020 - Day2(Part B)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
|
WITH
bt AS (
SELECT
*,
REGEXP_EXTRACT(occurrence_range, r"(\d+)-[\d]+") AS sidx,
REGEXP_EXTRACT(occurrence_range, r"[\d]+-(\d+)") AS eidx,
ARRAY_LENGTH(REGEXP_EXTRACT_ALL(datafield, character)) AS occs
FROM
`jrjames83-1171.sampledata.aoc2017day2`),
bt1 AS (
SELECT
sidx,
eidx,
character,
datafield,
SUBSTR(datafield,CAST(sidx AS int64),1) AS leftw,
SUBSTR(datafield,CAST(eidx AS int64),1) AS rightw
FROM
bt),
bt2 AS (
SELECT
bt1.*,
(CASE
WHEN character = leftw AND character <> rightw THEN TRUE
WHEN character = rightw
AND character <> leftw THEN TRUE
ELSE
FALSE
END
) AS status
FROM
bt1)
SELECT
COUNT(*)
FROM
bt2
WHERE
status = TRUE
|
Takeaway
On a periodic basis, it is always better to revisit a concept/subject/idea, that
you think you are extremely familiar with. You will be surprised to know, how
much you do not know.