linked

A CLI for Amazon Athena ⇝

  Wynn Netherland • 2017-03-23

I've been enjoying Amazon Athena to analyze our application event data.

In case you haven't played with it yet, Athena allows you to query data in S3 using SQL. My only complaint so far is having to use the web interface to manage schemas and run queries. Since Amazon offers a JDBC driver for Athena, I decided to build my first JRuby app - a command line interface for Athena catalogs.

You can run queries:

❯ cat queries/count-by-port.sql
SELECT COUNT(*) AS count, elb_name
FROM sampledb.elb_logs
GROUP BY elb_name
ORDER BY count DESC
LIMIT 10;

❯ athena query queries/count-by-port.sql
COUNT  | ELB_NAME
-------|-------------
151901 | elb_demo_006
151886 | elb_demo_009
151753 | elb_demo_001
151284 | elb_demo_002
151062 | elb_demo_004
150503 | elb_demo_008
149934 | elb_demo_005
149122 | elb_demo_007
148761 | elb_demo_003

manage schemas,

❯ athena table show sampledb.elb_logs
CREATE EXTERNAL TABLE `sampledb.elb_logs`(
  `request_timestamp` string COMMENT '',
  `elb_name` string COMMENT '',
  `request_ip` string COMMENT '',
  `request_port` int COMMENT '',
  `backend_ip` string COMMENT '',
  `backend_port` int COMMENT '',
  `request_processing_time` double COMMENT '',
  `backend_processing_time` double COMMENT '',
  `client_response_time` double COMMENT '',
  `elb_response_code` string COMMENT '',
  `backend_response_code` string COMMENT '',
  `received_bytes` bigint COMMENT '',
  `sent_bytes` bigint COMMENT '',
  `request_verb` string COMMENT '',
  `url` string COMMENT '',
  `protocol` string COMMENT '',
  `user_agent` string COMMENT '',
  `ssl_cipher` string COMMENT '',
  `ssl_protocol` string COMMENT '')
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  'input.regex'='([^ ]*) ([^ ]*) ([^ ]*):([0-9]*) ([^ ]*):([0-9]*) ([.0-9]*) ([.0-9]*) ([.0-9]*) (-|[0-9]*) (-|[0-9]*) ([-0-9]*) ([-0-9]*) \\\"([^ ]*) ([^ ]*) (- |[^ ]*)\\\" (\"[^\"]*\") ([A-Z0-9-]+) ([A-Za-z0-9.-]*)$')
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://athena-examples-us-east-1/elb/plaintext'
TBLPROPERTIES (
  'transient_lastDdlTime'='1480278335')

list and rebuild partitions, and more.

Setup instructions and full usage are in the GitHub repository.

Linked from: https://github.com/pengwynn/athena-cli

Wynn Netherland
Wynn Netherland

Engineering Director at Adobe Creative Cloud, team builder, DFW GraphQL meetup organizer, platform nerd, author, and Jesus follower.