Functional Data Engineering

Functional Data Engineering - A Blueprint for adopting functional principles in data pipeline Key Principles of Functional Data Engineering Reproducibility Re-Computability 1 2 Key Challenges Late Arriving Data Data Deletion 1 2 Data Deletion Reprocessing Deletion Audit Log 1 2 Choose your Confidence Window of Correctness The Modern Data Cloud = LakeHouse & Warehouse State of the Data 2023 Separation of storage and compute Unlimited scale data repository ACID transaction and mutation support Schema Classification Warehouse LakeHouse CREATE TABLE dw.user ( user_id BIGINT, user_name STRING, created_at DATE ) PARTITION BY (ds STRING) # ds = date timestamp of the snapshot s3://dw/user/2022-12-20/<all users data at the time of snapshot> s3://dw/user/2022-12-21/<all users data at the time of snapshot> DateTime Partition Table Design Entity Modeling Incremental Snapshot Full Snapshot 1 2 Entity Modeling CREATE OR REPLACE VIEW dw.user_latest AS SELECT user_id, user_name, created_at, ds FROM dw.user WHERE ds =< current DateTime partition >; Event Modeling Hour T1 Data Hour T2 Data Hour T3 Data Hour T1 Data Hour T2 Data Hour T3 Data Hour T1 Data Hour T2 Data Tumbling Window Hour T1 Pipeline Hour T2 Pipeline Hour T3 Pipeline Sliding Window Apply Window Functions Hour T1 Data Window Time Hour T1 pipeline starts Apply Watermark Adopt Reconciliation Hour T1 pipeline Hour T2 pipeline Hour T3 pipeline Reconciliation pipeline