Real-time Analytics on Agenda • Why Real-time Data streaming and Analytics? • How to Build? • Where to Store streaming data? • How to Ingest streaming data? • How to Process streaming data? • Delivery Streaming Data • Dive into Stream Process Framework • Transform, Aggregate, Join Streaming Data • Key Takeaways Data The world’s most valuable resource is no longer oil, but data. * *Copyright: David Parkins , The Economist , 2017 “ ” Why Real-time Data streaming and Analytics? Data Loses Value Over Time * Source: Mike Gualtieri, Forrester, Perishable insights Real time Seconds Minutes Hours Days Months Value of data to decision-making Preventive/predictive Actionable Reactive Historical Time-critical decisions Traditional “batch” business intelligence To create Value, derive insights in Real-time Batch vs Real-time Batch Difference Real-time Arbitrarily, or Periodically Continuity Constant Store → Process ( Hadoop MapReduce, Hive, Pig, Spark ) Method of analysis Process → Store ( Spark Streaming, Flink, Apache Storm ) Small - Huge (KB~TB) Data size per a unit Small (B~KB) Low - High (minutes to hours) Query Latency Low (milliseconds to minutes) Low - High (hourly/daily/monthly) Request Rate Very High - High (in seconds, minutes) High - Very high Durability Low - High ¢~$ (Amazon S3, Glacier) Cost/GB $$ ~$ (Redis, Memcached) From Batch to Real - time: Lambda Architecture Data Source Stream Storage Speed Layer Batch Layer Batch Process Batch View Real - time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process Lambda Architecture Streaming Data Batch View Stream Process Real-time View Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer Key Components of Real-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common) Where to Store Streaming Data? Data Source Stream Storage Stream Process Stream Ingestion Data Sink Stream Storage Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK = next consumer offset oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 Why is Stream Storage? • Decouple producers & consumers • Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce • Decouple producers & consumers • Persistent buffer • Collect multiple streams • No client ordering (standard) • FIFO queue preserves client ordering • No streaming MapReduce • No parallel consumption • Amazon SNS can publish to multiple SNS subscribers (queues or Lambda functions) Consumers 4 3 2 1 1 2 3 4 4 3 2 1 1 2 3 4 2 1 3 4 1 3 3 4 2 Standard FIFO Producers Amazon SQS Queue What about SQS? Publisher Amazon SNS Topic AWS Lambda function Amazon SQS queue Queue Subscriber Topic Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka • Operational Considerations • Number of clusters? • Number of brokers per cluster? • Number of topics per broker? • Number of partitions per topic? • Only increase number of partitions; can’t decrease • Integration with a few of AWS Services such as Kinesis Data Analytics for Apache Flink • Operational Considerations • Number of Kinesis Data Streams? • Number of shards per stream? • Increase/Decrease number of shards • Fully Integration with AWS Services such as Lambda function, Kinesis Data Analytics, etc RequestQueue - Length - WaitTime ResponseQueue - Length - WaitTime Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? Metrics to Monitor: MSK (Kafka) Metrics to Monitor: MSK (Kafka) Metric Level Description ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds. How about monitoring Kinesis Data Streams? Consumer Application GetRecords() Data How long time does a record stay in a shard? Metrics to Monitor: Kinesis Data Streams Metric Description GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords ReadProvisionedThroughputExceeded Number of GetRecords calls throttled WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations GetRecords.Success Number of successful GetRecords operations How to Ingest Streaming Data? Data Source Stream Storage Stream Process Stream Ingestion Data Sink