Gluent Offload Engine (GOE) has been a core component of our main software product, Gluent Data Platform, from the beginning. It is now offered as a standalone product, providing organizations with an easy way to offload enterprise data to modern distributed storage platforms. We have some great demo videos and high-level information about the offload process on the website. But now we’d like to dive a bit deeper into how Gluent Offload Engine actually works.
The entire offload process happens in 5 phases: Setup, Transport, Hybrid Objects, Optimizations, and Validation. Each phase has numerous steps within it, which get executed depending on various factors (partitioning strategy, offload input parameters, etc). We won’t walk through every possible scenario here, but rather will highlight some of the core features that make GOE run. Note that in the offload process, chunks of data are first copied to a staging area on HDFS and stored in the Avro format. It is then copied to the final database and stored as either Parquet or ORC columnar formats, depending on the Hadoop distribution in use. This 2-step process of loading the data was implemented for various reasons, as you’ll find out. Now, let’s dive into the process!
As the name of the phase suggests, we’re setting up the objects and metadata required for the offload. Gluent must ensure data type compatibility between the RDBMS and Hadoop to maintain data integrity and avoid precision loss. First, an analysis of the source table structure is performed and a data sampling query is used to analyze the data types. Within the offload log file, you may find this type of output with regards to the data type analysis step:
Detected precision=2, scale=0 for "PROMO_ID" after sampling column data
Offloading "PROMO_ID" as "BIGINT" due to sampled specification (2,0)
Once the data types have been determined, the Avro schema is created and copied to HDFS for use by the load table. The Hadoop load table and final table are then created, with the load table using the Avro schema from the previous step and the final table set to use the appropriate columnar compressed file format for the Hadoop distribution in use (Parquet or ORC). You’ll notice that Gluent creates the Avro staging schema with many data types set as STRING. When there is no data type equivalence across RDBMS, Java, and Avro (e.g TIMESTAMP, DECIMAL), Gluent uses STRING to avoid loss of precision during data transport. These string fields will get converted into the correct final datatypes in a later stage. This two-step approach will ensure data integrity between Oracle and Hadoop.
Create load table Mon Nov 13 18:48:37 2017 Hadoop sql: SET SYNC_DDL=true Hadoop sql: CREATE TABLE `sh_load`.`promotions` ( `PROMO_ID` STRING , `PROMO_NAME` STRING , `PROMO_SUBCATEGORY` STRING , `PROMO_SUBCATEGORY_ID` STRING , `PROMO_CATEGORY` STRING , `PROMO_CATEGORY_ID` STRING , `PROMO_COST` STRING , `PROMO_BEGIN_DATE` STRING , `PROMO_END_DATE` STRING , `PROMO_TOTAL` STRING , `PROMO_TOTAL_ID` STRING ) STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='hdfs:///user/gluent/offload/sh_load.db/sh.promotions.avsc') Step time: 0:00:06 Done
Create Hadoop tables Mon Nov 13 18:48:25 2017 Hadoop sql: SET SYNC_DDL=true Hadoop sql: CREATE TABLE IF NOT EXISTS `sh`.`promotions` ( `PROMO_ID` BIGINT , `PROMO_NAME` STRING , `PROMO_SUBCATEGORY` STRING , `PROMO_SUBCATEGORY_ID` BIGINT , `PROMO_CATEGORY` STRING , `PROMO_CATEGORY_ID` BIGINT , `PROMO_COST` DECIMAL(18,4) , `PROMO_BEGIN_DATE` TIMESTAMP , `PROMO_END_DATE` TIMESTAMP , `PROMO_TOTAL` STRING , `PROMO_TOTAL_ID` BIGINT ) PARTITIONED BY (`offload_bucket_id` SMALLINT) STORED AS PARQUETFILE Step time: 0:00:06 Done
There are a few additional steps completed along the way to finish out the setup phase. At this point, we have identified the data to be offloaded and prepared the Hadoop databases and tables necessary to store the data.
The Transport phase involves physically copying the data from the Oracle database schema to Hadoop. To begin, chunks of data are exported from Oracle and imported into the staging table in Hadoop. Then a Hadoop INSERT SELECT operation copies the data to its final destination location. The query will cast the data to the final data types, identified in the previous phase, during this step. One important note, Gluent Offload Engine does not drop any data from the source RDBMS. If desired, this must be completed with your own script(s) and only after the related data offload process has completed successfully and the Gluent built-in functionality has verified the data to be correct.
Gluent uses two different methods for copying the data across to Hadoop, depending on the size of the table, partitioning, or specific data types in use. For large or partitioned tables, a Sqoop command line import is used to create Avro data file(s) in Hadoop. The smaller, non-partitioned tables, determined by a “small table threshold” parameter setting, are exported using a proprietary Gluent process called Query Import. This process will SELECT from the RDBMS table and directly write results to a staging Avro data file in Hadoop. This results in lower latency for moving smaller sets of data, skipping the overhead of submitting a YARN job and starting up a Java Virtual Machine, as required by Sqoop. If for some reason Sqoop is the preferred method for all data offloads, you can always set a parameter to “force sqoop”.
Once the data has reached the load table in Avro format, it is copied over to the final data table. This two-step process helps in several of different ways:
- If any data transport issues occur, the initial step to stage Avro data can easily be rerun. Also the final data load step into columnar formats is self-contained within Hadoop and is atomic, so you won’t end up with duplicates when re-running jobs in case of network issues or other failures.
- Gluent Offload Engine allows the repartitioning of tables in Hadoop and may use a different partitioning strategy than was used in the RDBMS. Separating the offload process into the staging phase and a final load phase allows to repartition data “on the fly” during the offload.
- Ensuring data integrity between RDBMS and Hadoop. The initial load to Avro STRING format will ensure the data makes it to Hadoop without any data loss, regardless of the final destination data types.
Once the data is loaded successfully, Gluent Offload Engine will issue a command to compute the statistics in the Hadoop, ensuring good query performance on Hadoop. Finally, the load table is removed and we’re on to the next phase.
This phase is specific to the Gluent Offload Engine running as a part of the overall Gluent Data Platform, since these objects are required for access to the offloaded data from the database of origin. Gluent Offload Engine, licensed as a standalone product, will ignore the Hybrid Objects phase.
Quite a lot happens here, but we’ll save the details for another post that is focused on Gluent Data Platform Access and Present capabilities. To sum up this phase, hybrid objects provide virtual access to source RDBMS + offloaded Hadoop data and allow existing database application queries to continue to run against the original table structures, without code changes. These objects are created in this step. Examples below:
|Database||Object Type||Object Name||Comments|
|Oracle||Table||SH_H.PROMOTIONS_EXT||Hybrid external table|
Yet another phase focused on the fully licensed Gluent Data Platform and the ability to query hybrid tables. The purpose of this phase is to create the hybrid objects that enable various performance optimizations, like the Advanced Aggregation Pushdown optimization to push down aggregation computations to Hadoop, in addition to scanning, filtering, column projection etc.
As with the previous phase, a lot is happening here, which we won’t go into detail on within this post. It’s good to note that for Gluent Data Platform, the Hybrid Objects and Optimizations phases are where some of Gluent’s patented methods are implemented.
The final phase is the all important data validation that will ensure the data has successfully been offloaded to Hadoop. There are two methods of validation, with the choice of which to use controlled by an input parameter.
- Minus – simply selects a count(*) from the offloaded table minus a count(*) from the hybrid table.
- Aggregate – compares several stats (MIN, MAX, COUNT, etc) on top high cardinality columns. Groups are performed at the partition level (if partitions are detected) for greater granularity.
Gluent software is delivered with additional tools to enable a more in-depth data validation, should your offload process require it.
If no differences are detected, then validation is successful and our offload is complete!
Gluent Offload Engine can easily sync your enterprise relational data to a modern, distributed data storage backend, and keep the data in-sync with change data capture and incremental update processing. This post is an initial look at just a few of the steps that occur during this complex data movement process, made simple to the end user by the excellent software engineers at Gluent. Keep an eye out for more “deep dive” posts on other Gluent product capabilities. If you have any specific questions, feel free to reach out to us at firstname.lastname@example.org and we’ll be happy to answer them.