Last month, I described how Gluent Cloud Sync can be used to enhance an organization’s analytic capabilities by copying data to cloud storage, such as Amazon S3, and enabling the use of a variety of cloud and serverless technologies to gain further insights. I used Cloud Sync to copy the data to the cloud, then built an Amazon Quicksight analysis against the Athena tables. That did the trick, allowing me to use advanced analytic functions in my query.
The challenge now is that Quicksight must execute the same analytic functions through Athena each time I run the report. I decided it would be better to run an ETL job and access the results directly from a single table. I’ll have my data ready for me in a few minutes and the report performance will increase due to the analytic functions already having processed in the ETL. The data is in stored in Amazon S3, so my tool of choice is the newly released cloud data integration tool AWS Glue.
AWS Glue is a cloud-based extract, transform, and load (ETL) service, recently released for public use. It’s also serverless, meaning you don’t need to provision or manage any server resources in order to kick off an ETL job. AWS Glue consists of several notable components: the AWS Glue Catalog, Crawlers, and Jobs.
- AWS Glue Catalog
The Catalog is a central metadata repository which stores table definitions and metadata and makes it accessible to Glue, Athena, Redshift, and EMR. To SQL engines it looks like Hive Metastore with enhanced functionality. The metadata can be populated through manual Hive DDL scripts (CREATE TABLE…) or automatically, using the AWS Glue Crawlers.
One of the more interesting, and powerful, features of Glue are the Crawlers. These processes will automatically build your data catalog by seeking and finding table structures in various data sources. Crawlers can extract table definition metadata from files in Amazon S3 or any JDBC connected database. There’s even more; the Crawlers can keep the catalog up-to-date with any schema changes – automatically. Schedule the Crawler to run and your table definitions will always represent the most recent schema updates.
The AWS Glue Jobs are where the ETL action takes place. The scripts are written in pySpark and must be compatible with Apache Spark 2.1 (as of the writing of this article). When using the built-in templates for Source, Target, and Transform steps in the ETL process, Glue will generate a nice diagram to go along with the code. One thing to note, if you deviate too much from the standard template and fail to provide the appropriate comments, the diagram won’t generate. Glue has a variety of Transform templates that may be added to the ETL script. They are written against the AWS Glue libraries to help simplify some of the more complex transformations you might use in your data integration flow.
AWS Glue looks pretty great, so let’s try it out with the query from my previous post.
First, the data that is stored in HDFS, our data lake, must be copied over to S3. I walked through this process using Gluent Cloud Sync in the previous post, so I’ll skip some of the details here. Gluent Cloud Sync allows any Hadoop table to be synchronized with another Hadoop environment or into cloud storage, such as Amazon S3 or Google Cloud Storage. With a simple command, Cloud Sync copies both the data and table structure to the target. In this case, we copy our 3 tables required for the Customer Sales by Demographics report to the cloud. The data is stored in Amazon S3 and the table definition and other metadata copied out to the AWS Glue Catalog.
Once the data is in S3, we can either manually create our tables in Athena or allow a Glue Crawler to create it for us. The Crawler is quite simple to setup and will keep our schema in-sync should any changes occur, so let’s try it out. After providing the Crawler a name, choose the Data Store type (S3 or JDBC) and the path to the S3 bucket in which we will to crawl for metadata. A few more clicks through the wizard to choose the IAM role, orchestration, and target database for the table definitions, and we’re done.
Now that we have the tables identified in the Glue Catalog, we can create our Job to perform the transformations and load data into a target S3 bucket. Glue does provide a wizard to get your Job started off, but I’ll go ahead and cut to the more interesting bits – the code and mapping diagram.
Using the built-in templates, I can add the 3 source tables from the Glue Catalog; customers, supplementary_demographics, and sales, each as a DataFrame, a tabular object in Spark. From there, the built-in Transforms are used to aggregate, filter, and join the data as needed to produce the final result. The SelectFields Transform allows me the choice of columns to keep in the DataFrame. Join, as you might expect, will join two DataFrames together using a join key.
cust_demo_sales_t = cust_demo_sales.toDF() cust_demo_sales_t.createOrReplaceTempView("cust_demo_sales_tbl") cust_demo_sales_sql = spark.sql("SELECT concat(cust_last_name, ', ', cust_first_name) cust_name, \ education, occupation, amount_sold, \ avg(amount_sold) OVER (partition by education) avg_by_education, \ avg(amount_sold) OVER (partition by occupation) avg_by_occupation, \ avg(amount_sold) OVER () avg_total FROM cust_demo_sales_tbl") cust_demo_sales_final = DynamicFrame.fromDF(cust_demo_sales_sql, glueContext, "cust_demo_sales_final")
Finally, to perform the analytic functions necessary for the report, I created a temporary view to perform the more complex transformations using Spark SQL. When adding custom code like this, you must remember to create your own comments if you’d like the data flow diagram to generate. The final script used to load the customer demographics data to a new S3 bucket can be found here (https://github.com/Michaelrainey/aws-glue/blob/master/load-customer-demographics.py).
The AWS Glue ETL process runs in just a few minutes and the data returns in Athena much faster than it did with the previous query. Now I can update my Quicksight analyses report to use the newly generated table.
The result, a much quicker time to return the data to the report and, the ultimate goal, a happier end user.
[bin]$ ./cloud_present -t glue-target.sales_cust_demo
One final thing to note: with Gluent Cloud Sync, the AWS Glue generated table built on top of data stored in S3 can be presented back to our Hadoop data lake, with no data transfer or ETL required. This means you can seamlessly access your cloud storage data from a Hadoop table, just as if the data lived locally in HDFS. Gluent can enable the hybrid cloud data environment, all without any data movement. Now, with the AWS Glue Catalog, any table that has been found by a Crawler and has metadata stored in the catalog can be presented to Hadoop, and thanks to Gluent, be queried by relational databases too. Glue and Gluent are similar in name and compliment each other very well.