Data engineers, data scientists, and analysts are often limited by the technologies available in their organization when completing data integration and analytics tasks. With many tools, specific features or functionality may not be readily available or easily implemented. For example, an analyst may be interested in advanced analytic queries, typically seen in today’s top relational databases, but they may not have a SQL on Hadoop technology capable of this functionality.
If you recall my previous blog post, the challenge was to use advanced SQL functionality within a reporting tool against modern data sources like Hadoop. Tableau could connect directly to Impala, but you lost the advanced analytic functions that are built-in to Oracle’s query engine. In this case, I was able to use Gluent Present to virtualize the Impala table within Oracle and query it just as if it existed within the RDBMS schema. All of my analytic functions were available to use, with the bulk of the computation work pushed down to Hadoop. Now I have a different scenario. All of my tables are located in Impala, with underlying data stored in HDFS, and am back to a similar situation with a lack of advanced query functionality. I could present all of the tables to a relational database, as we did in the previous scenario. But that approach doesn’t fit my end user’s needs. They plan to use cloud-based tools against cloud storage, keeping the infrastructure flexible in order to tackle their ever-changing workloads.
Amazon Simple Storage Service (S3) is a cloud object store that can handle data in almost any format. If you stay within the AWS stack, it’s quite easy to access the data in S3 using a number of Amazon’s cloud native applications. One such tool, Amazon Athena, is a serverless interactive query service built for analyzing data in the cloud. As the product page states, “Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3”. Athena’s metastore is Hive based, with a Presto query engine providing advanced SQL functionality. On top of Athena is an analytics application named Amazon QuickSight. This is where our analysts can create visualizations (graphical reports, dashboards, or stories), which can then be shared with decision makers and others throughout the organization. But first, we must get the data into S3 and configure the system to continuously sync changes.
The recently released Gluent Cloud Sync is the perfect solution for “syncing” your on-premises data lake to the cloud. The backup capability will copy your Impala tables, data and structure, to cloud storage services, in this case Amazon S3. The goal is to enable our analysts and engineers to use the enterprise data lake, but without the need for managing variable workloads on the production Hadoop cluster. Let’s get started!
The Cloud Sync product is a simple installation and configuration on the Hadoop side. Once you’ve defined your source and targets in the config file, it’s an easy command to backup your Impala tables to S3. I have a specific dataset in mind, so I’m just going to sync the tables I need for this task.
I’ll sync the remaining two tables and we’re done! If I want to keep the data in-sync with our production data lake, I can schedule these commands in any external scheduler. Cloud Sync will ensure the latest data is copied over and doesn’t duplicate the existing records.
./cloud_sync --backup -t int12102_sh.customers -x ./cloud_sync --backup -t int12102_sh.supplementary_demographics -x
Now that I have the data and table metadata in S3, I can create the tables in Athena and begin to perform my analysis. Cloud Sync creates the DDL for the original table, which points to a HDFS location, and another which points to the S3 data.
If you try to select from the table straight away, you’ll find that it returns no records. The DDL has pointed to the correct file location for the external table, but we still need to rebuild the partitions.
Using the slick MSCK REPAIR TABLE command in Hive (the underlying metastore for Athena), we can reload partition metadata all at once.
MSCK REPAIR TABLE int12102_sh.sales;
We have data! Perform the same process for the other 2 tables and we can begin to write our query.
As a developer, whether building a report or ETL, I always like to create the SQL in a query editor prior to building the final object. In this case, we can build the query in Athena and then create the final QuickSight report once we have the data returning the expected results. I’m going to keep it simple in this example, but you can imagine how this becomes an important step in the process the more complex and analytical these queries become.
The advanced SQL I’m going to use is a window function to get the average amount sold to each customer based on occupation and education. This type of analytical function will fail to execute when run in some of the SQL on Hadoop query engines, but not in Athena.
Now that I’m satisfied with the results, I can move on to QuickSight to build my visualizations in order to gain further insight from the data. I’ll create a new analysis and a new dataset. I have the option to select tables from the Athena metastore or direct from S3. Since I’ve already defined a structure around the S3 data in Athena, I’ll choose it as my source. Once I add the dataset, I select Custom SQL, add the query I wrote in Athena, and begin creating my analysis.
With QuickSight, we can now visually analyze our sales and customer demographics data, potentially leading to new discoveries that might shape our marketing strategy. For example, it looks as though customers in high school (10th/11th grades) are below average when it comes to purchasing our products. If we have a major targeted marketing campaign for that age group, it may not be working. This discovery could lead to a shift in type of campaign or even direct the company to spend its marketing dollars elsewhere. As more sales are made, Gluent Cloud Sync will keep the data in S3 up to date, ensuring the analytics are always run against the latest information.
For data professionals, open access to enterprise data is key to successfully doing their job. Not just ability to query the data, but the ability to use the tools and technologies which are best suited for the task at hand. Cloud features such as Amazon Athena and Quicksight, serverless applications for data analytics, are cost effective and offer the flexibility necessary for modern data analysis. With Gluent Cloud Sync, data can be liberated from the confines of Hadoop and made available in the cloud for use by these technologies. Once the data is backed up to the cloud storage, the worries about overloading the production server with massive machine learning or analytical workloads are long forgotten.