The System Architecture of Real-time Reporting in Laravel 02 Aggregation Improvements in MySQL Outline 01 Aggregation Report in MySQL 03 Row vs Column Oriented Databases 04 Integration with ClickHouse 05 Deduplication in ClickHouse 06 Optimizations in ClickHouse Report is Everywhere Daily revenue for product orders Analysis for top-selling products Ranks for most-clicked articles or videos Total cost for business operation Statements for financial department Summaries of registered users Statistics of unique visitors ... etc. Data Source database, files, services Condition Filters resource id, time range, numbers, text, data type Aggregations sum, average, count, max, min Transformers concat, mask, numbers format Groups Orders Basic Elements of Report How much time does it take for this query if your data rows is 1000, 10k, 1m, 10m or 100m? Report Performance in MySQL SELECT product_id, COUNT(id) as count WHERE status != 'canceled' GROUP BY product_id ORDER BY count DESC LIMIT 10 id user_id product_id amount discount shipping_fee total status created_at 1 35 2 1000 50 0 950 shipped 2022-11-11 09:24:13 2 78 5 860 20 60 900 paid 2022-11-11 10:12:04 3 93 11 780 0 60 840 canceled 2022-11-12 01:39:17 Add Appropriate Indexes Only helps filtering rows, not aggregation Too many indexes is not a good idea Partitions and Shardings Suitable for unchanged conditions combination Complicated and limited usage (join, transactions... etc) Vertical Scaling Database There's hard limit for single server Money is always not a problem Aggregation Improvements in MySQL Pre-aggregated Results MySQL doesn't support Materialized View Aggregations for different combinations of conditions will be difficult to maintain Aggregation results need to be refreshed once source data keeps updating Aggregation Improvements in MySQL id amount discount shipping_fee total status period 1 2 50 0 950 shipped 2022-11-11 09:00 2 5 20 60 900 paid 2022-11-11 10:00 3 11 0 60 840 canceled 2022-11-11 11:00 Select Date Search Filters in Our Imagination Aggregation Improvements in MySQL Status: Order Date: Text Select Select Select Date Date Date Search Select Date Filters for Users' Needs Aggregation Improvements in MySQL Name: Gender: City: Status: Shipping Date: Order Date: Accounting Date: Agent: Updated Date: There's not much you can do with MySQL if... Data rows are large for aggregation (over 10 million) Filtering conditions are flexible Data rows are mutable Near real-time requirement Aggregation Improvements in MySQL Row Oriented Databases Data associated with the same record are kept next to each other in memory Optimized for reading and writing a single row of data Common in OLTP database like MySQL, Postgres etc. Row vs Column Oriented Databases id name country age 1 Taylor USA 35 2 Evan China 35 Column Oriented Databases Data associated with the same column are kept next to each other in memory Optimized for reading to support analyzing data Common in Data Warehouses like Cassandra, HBase, Big Query, Clickhouse etc. id 1 2 Row vs Column Oriented Databases name Taylor Evan Country USA China age 35 35 Pros of Columnar Database Queries that use only a few columns out of many Aggregating queries against large volumes of data Column-wise data compression Cons of Columnar Database Poor at data mutations like update or delete Not optimized for querying row-specific data Lack of unique constraints Row vs Column Oriented Databases Aggregation in Row Oriented Databases Row vs Column Oriented Databases (https://clickhouse.com/docs/en/faq/general/columnar-database) Aggregation in Column Oriented Databases Row vs Column Oriented Databases (https://clickhouse.com/docs/en/faq/general/columnar-database) Benchmark Row vs Column Oriented Databases (https://benchmark.clickhouse.com) An open source (Apache 2.0), relational, OLAP database since 2016 Was originally developed as an internal project by Yandex (a Russian search engine) The word ClickHouse is a combination of Click stream and Data Ware House Known for extremely high query speed and performance Fully based on ANSI SQL, which makes it easier to interact with Highly productive development on Github Many companies are using ClickHouse in production like Uber, Slack, Tesla, Tencent, Tiktok, Cloudflare , etc. Introduction of ClickHouse Why is ClickHouse so fast? Column-oriented storage Data Compression with specialized codecs that can make data even more compact Vectorized query execution to allow for SIMD CPU instructions usage ClickHouse can leverage all available CPU cores and disks to execute even a single query Introduction of ClickHouse Downside of ClickHouse Limited Updates and Deletes Mutations in ClickHouse are heavy operation and not designed for frequent use Limited Transactional (ACID) support at this moment (INSERT into one partition in one table of MergeTree family) ClickHouse Keeper is not stable enough like official's announcement Introduction of ClickHouse Table Engines MergeTree ReplacingMergeTree, SummingMergeTree, AggregationMergeTree, CollapsingMergeTree, etc. Log Log, TinyLog and StripeLog Integration Engines MySQL, Postgres, ODBC, JDBC, MongoDB, Kafka, etc. Special Engines File, MaterializedView, URL, Buffer, Memory, etc. Introduction of ClickHouse