Choose Right Stream Storage

Choose Right Stream Storage: Kinesis Data Streams vs MSK Key Components of Real- time Analytics From Batch to Real - time: Lambda Architecture Data Source Stream Storage Speed Layer Batch Layer Batch Process Batch View Real - time View Consumer Query & Merge Results Service Layer Stream Ingestion Raw Data Storage Streaming Data Stream Delivery Stream Process Lambda Architecture Streaming Data Batch View Stream Process Real-time View Query Query Batch View Real-time View Raw Data Batch Process Batch Layer Serving Layer Speed Layer Key Components of Real-time Analytics Data Source Stream Storage Stream Process Stream Ingestion Data Sink Devices and/or applications that produce real-time data at high velocity Data from tens of thousands of data sources can be written to a single stream Data are stored in the order they were received for a set duration of time and can be replayed indefinitely during that time Records are read in the order they are produced, enabling real-time analytics or streaming ETL Data lake (most common) Database (least common) Stream Storage Data Source Stream Storage Stream Process Stream Ingestion Data Sink Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Anatomy of Amazon Kinesis Data Streams and MSK Key Features of Kinesis Data Streams and MSK • Distributed Queue • Stream Storage #Queue #Distributed #Storage Consumer oldest data newest data 5 4 3 2 1 0 3 2 1 0 2 #Queue: FIFO, Scale-Up vs Scale-Out 5 4 4 3 2 1 0 5 Producers Hash Function Consumer PK PK PK PK oldest data newest data Producers shard/partition-1 shard/partition-2 3 2 1 0 5 4 3 2 1 0 4 3 2 1 0 2 shard/partition-3 #Distributed: Scale-Out Consumer 0 Consumer 4 0 Consumer Group 4 3 2 1 0 Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK = next consumer offset oldest data newest data Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 #Storage: Stream Buffer 2 1 0 4 3 2 1 0 0 Hash Function Consumer Consumer Consumer Consumer Group PK PK PK PK = next consumer offset oldest data newest data Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Producers shard/partition-1 shard/partition-2 5 4 3 2 1 0 3 2 1 0 4 3 2 1 0 4 2 0 shard/partition-3 Anatomy of Benefits of Stream Storage • Decouple producers & consumers • Persistent buffer • Collect multiple streams • Preserve client ordering • Parallel consumption • Streaming MapReduce Comparing Amazon Kinesis Data Streams to MSK Topic Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka Comparing Kinesis Data Streams to MSK Amazon Kinesis Data Streams Amazon Managed Streaming for Kafka • Operational Perspective • Number of clusters? • Number of brokers per cluster? • Number of topics per broker? • Number of partitions per topic? • Cluster provisioning model • Only increase number of partitions; can’t decrease • Integration with a few of AWS Services such as Kinesis Data Analytics for Apache Flink • Operational Perspective • Number of Kinesis Data Streams? • Number of shards per stream? • Throughput provisioning model • Increase/Decrease number of shards • Fully Integration with AWS Services such as Lambda function, Kinesis Data Analytics, etc Monitoring Metrics RequestQueue - Length - WaitTime ResponseQueue - Length - WaitTime Network - Packet Drop? Produce/Consume Rate Unbalance Who is Leader? Disk Full? Too many topics? Metrics to Monitor: MSK (Kafka) Metrics to Monitor: MSK (Kafka) Metric Level Description ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time. OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster. GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster. GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster. KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs. KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs. RootDiskUsed DEFAULT The percentage of the root disk used by the broker. PartitionCount PER_BROKER The number of partitions for the broker. LeaderCount PER_BROKER The number of leader replicas. UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker. UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker. FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on fetching data from the broker. ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds. How about monitoring Kinesis Data Streams? How long time does a record stay in a shard? 5 transactions per second, per shard With only one consumer application, records can be retrieved every 200 ms up to 1MB or 1,000 records per seconds, per shard for writes • 10MB per second, per shard • up to 10,000 records per call Consumer Application GetRecords() Data