Gluent Data Platform was built to provide data sharing and transparent data virtualization across the enterprise, ensuring all enterprise data is just a query away. The term “data virtualization” has many different meanings (as we see on Wikipedia), depending on the goal of the virtualization software. In the case of Gluent, we consider our platform to truly virtualize data. With Gluent, you can transparently plug the power of a modern distributed computation engine into your existing application without any upfront development investment. No application code changes or ETL projects required. Because of the common “No ETL” data sharing description, Gluent’s data virtualization technology is often compared with data federation software. But as you can see in the image, they’re quite different.
Both data federation and data virtualization achieve the goal of enterprise data access, eliminating data silos without any ETL or data replication required. However, transparent data virtualization allows applications to continue using the same data source connection and queries that they always have, regardless of where the data physically resides. With Gluent Data Platform, you can virtualize an entire database, schema, or table. You can even offload and virtualize a portion of a table, with Gluent handling the execution of hybrid queries on the backend. The key is that no application must be rewritten or redirected, they just continue to work without the extra development costs associated with data federation.
How exactly does Gluent implement transparent data virtualization?
In our previous post, we described the Gluent Offload Engine (GOE) phases in detail, highlighting those specific to the standalone GOE product: Setup, Transport, and Validation. The other two phases, the creation of Hybrid Objects and Optimizations, are delivered with the full Gluent Data Platform, specifically to enable Gluent’s data virtualization capability. Let’s go through each phase, briefly describing those already discussed in the previous post and digging into more detail on the two that we skipped over last time.
Setup of all objects occurs in this initial phase, ensuring they’re ready for offload. This includes creating the intermediate staging table and final table on the Hadoop side and confirming the column data types created will be compatible with the source table data.
This phase is where data is physically transferred between the source RDBMS and the target, using either Sqoop or a Gluent-developed process to copy and load the data. Automatic table partitioning, automatic data compression, and other features make the offload process much more than just moving data from point A to point B.
Enabling transparent data virtualization with Gluent Data Platform begins with the creation of the hybrid database objects. In this phase, hybrid objects that provide access to Oracle + Hadoop data and allow application queries to continue to run against the original tables, without code changes, are created. The set of hybrid objects generated are listed in the table below.
|Database||Object Type||Object Name||Comments|
|Oracle||Table||SH_H.PROMOTIONS_EXT||Hybrid external table|
First, a hybrid schema is created as a duplicate of the original “source” schema. In the example shown in the table above, the original schema is named “SH” and the hybrid schema is created as “SH_H”. This schema is granted SELECT on the offloaded table(s), with GRANT OPTION enabled to allow users who are granted access to the hybrid view to also be granted access to the original source table.
In the next few steps, we create the hybrid external table and the hybrid view, dropping them first if they already exist. Dropping these objects helps the process pickup any DDL changes from the original source table. The external table provides the Oracle database user “virtual” access to the offloaded data in Hadoop. The hybrid view sits on top of the external table and, if the table data was partially offloaded, the original source table. These two hybrid objects are key to ensuring the entire dataset can be accessed by the database application user just as it was prior to the offload.
In order to deliver query performance when selecting from the hybrid external table (via the hybrid view), the original source table statistics are copied to the external table. Finally, any views in the original source schema that are dependent upon the offloaded table are recreated in the hybrid schema. With the hybrid view in place of the original source table, each dependent view is now ready to be accessed from within the hybrid schema just as it was before.
Now the hybrid objects have been created, virtualizing access to the offloaded data without any application changes required.
The next phase in the Gluent Data Platform offload process has to do with query performance optimization. There are many types of optimizations performed, but here we’ll focus on enhancing aggregate queries. Gluent creates additional hybrid objects for handling the aggregations, but only the main hybrid view, created in the previous step, is queried by the end user or application. All other hybrid objects generated are meant to stay behind the scenes.
The list of hybrid objects created for our example table are listed below.
|Database||Object Type||Object Name||Comments|
|Oracle||View||SH_H.PROMOTIONS_AGG||Aggregate hybrid view|
|Oracle||Table||SH_H.PROMOTIONS_AGG_EXT||Aggregate hybrid external table|
|Oracle||Rewrite Equivalence||SH_H.PROMOTIONS_AGG||Aggregate rewrite rule|
|Oracle||View||SH_H.PROMOTIONS_CNT_AGG||Aggregate hybrid view|
|Oracle||Table||SH_H.PROMOTIONS_CNT_AGG_EXT||Aggregate hybrid external table|
|Oracle||Rewrite Equivalence||SH_H.PROMOTIONS_CNT_AGG||Aggregate rewrite rule|
When a query with an aggregate function is called using the hybrid view, Gluent’s rewrite rules come into play. Gluent will modify the query, replacing the standard hybrid view with the appropriate aggregate hybrid view using Oracle’s Advanced Query Rewrite Using Equivalences functionality. Essentially, Gluent will rewrite the query and indicate to the Oracle processor that the new query is the best choice for performance.
There are two sets of external table/view/rewrite equivalence objects. The first, using the _AGG suffix, is meant to handle all standard aggregation functions (SUM, MAX, MIN, etc). The second, with _CNT_AGG suffix, is specific to the COUNT(*) queries. The count star rewrite could use the generic hybrid aggregation view, but we found an even better approach to rewriting this type of query for performance and therefore created the additional view and external table for the count star aggregates.
The query rewrites are used by Gluent’s patented Smart Connector. The Smart Connector orchestrates reading data from Hadoop and passing it back to Oracle, utilizing a number of optimizations such as Advanced Aggregation Pushdown, Predicate Pushdown, Advanced Join Filter Pulldown, Projection Pushdown, and Column Dropout. Sharing the details on the additional Gluent optimizations will be saved for another post.
The final phase performs data validation to ensure the offload completed without any issues. If no differences are detected, then validation is successful and our offload is complete!
Enabling true data virtualization has been a cornerstone of Gluent software from the beginning. The ability to query data offloaded to modern data storage and compute systems, transparently, and without any application changes is the key. And Gluent Data Platform delivers! If you’re interested in learning more about how Gluent Data Platform can help virtualize your enterprise data, give us a shout at email@example.com.