Amazon Athena Interactive query service to analyze data in Amazon S3 using standard 1. Introduction to Amazon Athena Amazon Athena Serverless Distributed ( SQL ) Query Engine for Amazon S3 Distributed SQL Query Engine Amazon S3 Why Amazon Athena ? • Decouple storage from compute • Serverless – No infrastructure or resources to manage • Pay only for data scanned • Schema on read – Same data, many views • Secure – IAM/SAMLv2 for authentication; Encryption at rest & in transit • Standard compliant and open storage file formats • Built on powerful community supported OSS solutions Simple Pricing • DDL operations – FREE • SQL operations – FREE • Query concurrency – FREE • Data scanned - $5 / TB • Standard S3 rates for storage, requests, and data transfers apply Runs standard SQL • Uses Presto with ANSI SQL support • Works with standard data formats • CSV • Apache Weblogs • JSON • Parquet • ORC • Handles complex queries • Large Joins • Window functions • Arrays Fast Performance for Large Data Sets • Fast, ad-hoc queries • Executes queries in parallel • No provisioning extra resources for complex queries • Scales automatically • Amazon Athena Federated Query Storage Query Engine Scale-up Scale-out How to handle Large Data Sets? Storage Storage Interface Layer Storage Compute Decouple storage from compute Compute Scale - out Storage Interface Layer Scale - out Storage Scale - out Compute Distributed Processing Framework Distributed File System (DFS) ex) HDFS, Amazon S3 , ... ex) Hadoop MapReduce, Apache Spark, Flink Apache Hive, Presto Amazon Athena , ... Distributed Data Processing System 2. Athena Design Patterns 1: Ad-hoc use-case S3 Athena AWS Glue Data Catalog Query data Hot data Warm & cold data Application request 2: SaaS use-case AWS service logs Application logs Data sourced from external vendors S3 Athena Update table partition Query data S3 Athena CTAS and INSERT INTO to ETL 3: ETL and query use-case Glue Data Catalog S3 Athena Query data S3 4: Data science exploration and feature engineering Glue Data Catalog Raw Data Transformed data SageMaker Comparison of SQL Processing engines Data Structure Semi Semi Semi Full Languages API/SQL SQL SQL SQL Data Store S3 (Glue), S3/HDFS (Spark) S3/HDFS S3 Local Use case Transformation SQL Queries for S3/HDFS Serverless SQL Queries for S3 Fully Featured SQL Database Performance Glue Amazon Athena Amazon Redshift Amazon EMR/ Amazon EMR 3. Athena in Action Create External Tables • Use Apache Hive DDL to create table • Run DDL statements using the Athena console • Via a JDBC driver, using SQL workbench • Using the Athena create table wizard • Create “external” table in DDL • Creates a view of the data • Deleting table doesn’t delete the data in S3 • Schema-on-read • Projects your schema onto data at query execution time • No need for data loading or transformation Navigate to ’Saved Queries’ to get DDL Create External Table using DDL