In data science & data engineering, it is often required to extracting exclusive data set from a large-scale table in a business date comparing with a history data, e.g. daily new product incremental, weekly new users, ..... Here two ways are discussed using an example of how to calculate daily new products.
Method 1: use LEFT JOIN operator, e.g.
-- daily_biz_tb is table to store daily product with schema like
-- CREATE TABLE IF NOT EXISTS daily_biz_tb (product_id STRING, title STRING) PARTITION BY (ds STRING)
DROP TABLE IF EXISTS daily_inc_tb;
CREATE TABLE daily_inc_tb LIFECYCLE 7 AS
SELECT t1.* FROM (
SELECT product_id, title
FROM daily_biz_tb
WHERE ds='${bizdate}'
)t1
LEFT JOIN (
SELECT product_id, title
FROM daily_biz_tb
WHERE ds='${yesterday}'
)t2
ON t1.product_id=t2_product_id
WHERE t2.product_id IS NULL
;
The logic is that if product_id in today but not in yesterday, after left join, the column t2.product_id will be NULL in today.
Method 2: left anti join. The LEFT ANTI JOIN will be much elegant. It will exclude all item based on key product_id.
-- daily_biz_tb is table to store daily product with schema like
-- CREATE TABLE IF NOT EXISTS daily_biz_tb (product_id STRING, title STRING) PARTITION BY (ds STRING)
DROP TABLE IF EXISTS daily_inc_tb;
CREATE TABLE daily_inc_tb LIFECYCLE 7 AS
SELECT t1.* FROM (
SELECT product_id, title
FROM daily_biz_tb
WHERE ds='${bizdate}'
)t1
LEFT ANTI JOIN (
SELECT product_id, title
FROM daily_biz_tb
WHERE ds='${yesterday}'
)t2
ON t1.product_id=t2_product_id
;
Which one is recommended? Of course, left anti join. It is not only elegant but also performance better when from a big table excluding data in a small table. e.g. 1T table excluding 1K data in small table. In practice, method 1 sometimes has performance issue while method 2 fast.