For analytics / ML projects Developing a data pipeline on cloud Every data analytics or ML project needs a good data pipeline as a foundation. But the data is always not ready to use at the beginning. We also need to think about automation and scalability as the project goes on. Developing a data pipeline on cloud Specifically on Google Cloud Platform Why do we need data pipeline? 🛠 Purposes of data pipeline Ingest data from multiple sources Transform or clean the data to ensure data quality Automation of the process What are options on cloud? ☁ Choosing Storage Options BigQuery Choosing Storage Options: zoom-in For analytics workload Cloud Bigtable Cloud Storage Choosing Compute Options Google Compute Engine Google Kubernetes Engine Google Cloud Run Google App Engine Google Cloud Function IAAS : Infrastructure as a service CAAS : Container as a service PAAS : Platform as a service FAAS : Function as a service GCE GKE GAE GCF Virtual Machine Managed K8s cluster Serverless container Serverless application Serverless function platform Choosing Data Processing Options Cloud Dataproc Processing data Workflow and scheduler Cloud Scheduler Cloud Dataflow Cloud Composer Spark or Hadoop Data processing Unified pipeline w/ Apache Beam 3 2 1 Cloud Functions or or Cloud Run Serverless options Cloud Workflows (optional) Option 1 : Low-cost & severless option Processing data (light workload) Cloud Functions Cloud Run (Or Pub/Sub) for Cloud Functions Workflow and scheduler REST API Scheduler ✓ Severless: easy & fast ✓ Low-cost solution ✓ Suitable for light workload Option 2 : Big data solution Spark or Hadoop Data processing Unified data pipeline with Apache Beam Processing data (Big data workload) Workflow and scheduler Cloud Dataproc REST API Cloud Dataflow Scheduler ✓ Big data framework: Spark, Apache Beam, Flink ✓ Scalability and reliability ✓ Opensource solutions Option 3 : Cloud Composer (Airflow) Cloud Composer Kubernetes Engine Cloud SQL Managed service + ✓ Easier for maintenance ✓ Scalability and reliability ✓ Suitable for large number of jobs that require workers Why Apache Airflow? ● Popular open-source project for ETL or data pipeline orchestration. ● All codes are in Python. Easy to learn and use. ● Can be run locally as well for development environments Apache Airflow basic components Sensor Wait on an event i.e. poking for a file Operator Running an action; PythonOperator Hook Interface to external services or system Reference Architecture Batch Ingestion BigQuery Cloud Storage SFTP server Cloud Composer SFTPToGCSOperator GCSToBigQueryOperator Analytics Workload BI Dashboards SFTPSensor Text & diagram slides DAG overview SFTPSensor SFTPToGCSOperator GCSToBigQueryOperator Check if a file is available Upload that file to GCS Load file from GCS to BigQuery Text & diagram slides Simple data pipeline using Airflow (1) Import and initialize DAG ← DAG name ← Schedule ← Import necessary components