Building a Modern Data Platform in the Cloud Organizations that successfully generate business value from their data, will outperform their peers. An Aberdeen survey saw organizations who implemented a Data Lake outperforming similar companies by 9% in organic revenue growth. 24% 15% Leaders Followers Organic revenue growth To Become a Leader, Data is Your Differentiator Data variety and data volumes are increasing rapidly Multiple Consumers and Applications Ingest Discover Catalog Understand Curate Find insights Purpose-built engines Right tool for the job Collect Store Analyze Amazon Kinesis Firehose AWS Direct Connect Amazon Snowball Amazon Kinesis Analytics Amazon Kinesis Streams Amazon S3 Amazon Glacier Amazon CloudSearch Amazon RDS, Amazon Aurora Amazon Dynamo DB Amazon Elasticsearch Amazon EMR Amazon Redshift Amazon QuickSight AWS Database Migration Service AWS Glue Amazon Athena Amazon SageMaker Traditionally, Analytics Used to Look Like This OLTP ERP CRM LOB Data Warehouse Business Intelligence • Relational data • TBs–PBs scale • Schema defined prior to data load • Operational reporting and ad hoc • Large initial CAPEX + $10K–$50K/TB/Year “A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale” Collect analyze semi-structured unstructured Decoupled ingestion on-read warehouses exabyte scale once many tools Open formats S3 Elasticsearch Glue DynamoDB Catalog & search Cognito API Gateway API/UI Athena QuickSight Redshift Spectrum Analytics & processing Lambda Kinesis Streams Kinesis Firehose Direct Connect Ingest AWS IoT KMS CloudTrail IAM Macie Security & auditing time Capture, process, and store video streams for analytics Load data streams into AWS data stores Analyze data streams with SQL Build custom applications that analyze data streams Kinesis Video Streams Kinesis Data Streams Kinesis Data Firehose Kinesis Data Analytics Amazon S3: Buffered files Kinesis Agent Record producers Amazon Redshift: Table loads Amazon Elasticsearch Service: Domain loads Amazon S3: Source record backup Transformed records Put Records Kinesis Firehose: Delivery stream AWS Lambda: Transformations & enrichment Raw Transformed Open-source standards (Apache) Parquet, ORC, etc. Optimize Performance Optimize Costs Analytical queries Storing is Not Enough, Data Needs to Be Discoverable Dark data are the information assets organizations collect, process, and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing). CRM ERP Data warehouse Mainframe data Web Social Log files Machine data Semi- structured Unstructured “ ” Gartner IT Glossary, 2018 https://www.gartner.com/it-glossary/dark-data Building training sets Cleaning and organizing data Collecting data sets Mining data for patterns Refining algorithms Other 80% & Data Catalog ETL Job authoring Discover data and extract schema Auto-generates customizable ETL code in Python and Spark Data & schema automatic discovery Generates customizable code for ETL Schedule and run ETL jobs periodically Serverless model