Kubernetes for Data Engineers

Kubernetes for Data Engineers What is Kubernetes? Open source. Container orchestrator. Runs everywhere. Focus on applications, not machines. Why Kubernetes? Workload portability. Legacy compatible. Modular. Declarative, not imperative. Kubernetes for stateless applications Deployment and ReplicaSet Self-healing. Autoscaling. Rollouts and rollbacks. De-facto standard. Applications that Data Engineers care about Stateful. Databases. Data processing frameworks. Machine learning frameworks. Running stateful applications YARN: MapReduce, Hive, Spark etc. Rest of workloads: bespoke deployments. Siloed clusters and underutilization. No standard and management pain. Kubernetes can help All workloads. Standardized tooling. Borg for the rest of the world. Running stateful applications on Kubernetes StatefulSet Stable, unique network identifiers. Stable, persistent storage. Ordered, graceful deployment and scaling. Ordered, graceful termination. Ordered, automated rolling updates. Built-in, no need to reinvent. Operators Extensions. Encode domain-specific operational knowledge. Control-loops: observe, rectify, repeat. Lots of Operators etcd. Prometheus. Kafka. Postgres. Elasticsearch. Redis. and so on... Native integration Spark on Kubernetes. JupyterHub. (In progress) Airflow on Kubernetes. ML workloads Kubeflow project. Operators for Tensorflow, PyTorch, Caffe2, MXNet... Lot of activity. GPUs in Kubernetes Support for NVIDIA GPUs. Support for scheduling any device (GPUs, FPGAs, Infiniband etc.) Recap Stateless > Deployment and ReplicaSet Simple stateful > StatefulSet Distributed databases > Operators Spark/Airflow > Native integration ML > Kubeflow