Gluent Data Platform FAQ
Q: Is Gluent Data Platform a data federation engine?
No, Gluent Data Platform is not a federated solution. In simple terms, a federated solution provides a SQL engine which can connect to various back end databases and pull data together from those silos. There are a couple of downsides to the federated approach. The first is that application SQL must be re-written to conform to the new SQL engine provided by the federation layer. The second is performance related. Joining multiple large data sets stored in different database silos can present a challenge that is very difficult to overcome. Gluent, on the other hand, is a transparent data virtualization solution. While the problem definition that Gluent addresses is very similar to federated offerings (trying to eliminate or bridge the silos), the approach to solving the problem is very different.
Q: What is transparent data virtualization?
Transparent data virtualization enables enterprise data sharing, providing the ability to virtually access data that is locked away in silos throughout the organization. This means that access to enterprise data becomes simple. No longer are numerous data pipelines or replication jobs required to copy the data from silo to silo, just to query various datasets locally. While many data virtualization products on the market today require a federation engine to translate between variations of datastores, and force code changes to existing applications, at Gluent we believe that data virtualization should be transparent. Transparent to the application and transparent to the end user, regardless of where the underlying data lives.
Transparent data virtualization has many additional benefits, such as:
- Cost reduction in storage, compute, and data movement processes
- Enhanced capabilities, such as the ability to perform machine learning or advanced analytics across previously siloed datasets
- Transparent access to virtualized data allows applications to continue operating with zero code changes
- Modernization of data storage and compute capabilities without re-platforming existing applications
Q: How much of the query processing does Gluent push down to the centralized data lake (e.g. aggregation, filters, joins,…etc.)?
Gluent’s transparent data virtualization software uses a component, called Smart Connector, to determine which portion of the query should be executed on Hadoop and which should be run in the RDBMS. Smart Connector can push the following down to Hadoop:
- predicates (i.e. the filters in the WHERE clause)
- projections (i.e. to retrieve only the columns required by the query)
- aggregations (using our Advanced Aggregation Pushdown feature)
- join filters (using our Join Filter Pulldown feature)
- joins (using our Join Pushdown feature)
- data type conversion and formatting (for optimal performance)
Combinations of the above are also supported.
Q: What type of SQL on Hadoop engines does Gluent use for data access? Are they existing open-source SQL engines or proprietary?
Gluent only uses open-source SQL engines. Currently certified production engines are Hive (with or without LLAP) and Impala, with Spark on the roadmap for a future release.
Q: Is standard ANSI SQL fully supported?
Gluent supports all standard SQL supported by Oracle including extensions such as PL/SQL.
Q: How does existing code (ex. PL/SQL) that is run against a relational database table continue to work when executed in a hybrid environment, in which a portion of the data is now stored in Hadoop and virtually accessible?
Gluent is a transparent data virtualization tool. Gluent keeps the original SQL engine in play, eliminating the need to re-write any application code and allowing transparent access to the data stored in Hadoop. This explains why PL/SQL continues to work in Oracle deployments. The applications continue to connect to Oracle (usually with a much smaller footprint), which processes the incoming SQL (and proprietary extensions such as PL/SQL). The heavy lifting is pushed down to the Hadoop platform. Having the large data sets in Hadoop minimizes the performance impact of having to move large data sets before being able to join them.
Q: If there are joins across virtualized tables, where is the query (and join) executed?
Gluent creates a hybrid system. Parts of the query are performed on the Hadoop side and parts on the RDBMS side. The general goal is to push as much processing to the Hadoop side as possible. This is where a large amount of Gluent’s intellectual property resides. The join can be pushed down to the Hadoop backend entirely if the required tables are present in Hadoop. We often sync (duplicate) tables, even the smaller tables, so the data resides all in Hadoop to enable the join pushdown. Gluent also handles hybrid joins, where some, typically large, tables are in Hadoop and some smaller tables are in the RDBMS. To avoid moving large datasets around and make hybrid joins performant, we have built a patented adaptive join-filter-pulldown (JFPD) technology. For large “fact-to-fact” table joins it is recommended to have all data available in Hadoop (Gluent Offload Engine can automatically sync the data, so not a problem!). Many optimizations exist in the Gluent configuration, so contact email@example.com to have a more detailed conversation about this topic.
Q: What are some use cases for transparent data virtualization?
Any industry can benefit from transparent data virtualization. Often large enterprises are seen as the best candidates for data virtualization software due to the large number of data silos and difficulty in sharing data throughout the organization. But really any company that wants to share data across disparate systems will benefit from data virtualization.
A few standard use cases for transparent data virtualization are:
- Enterprise data sharing
- Accelerating data migration to the cloud
- Eliminating enterprise data sprawl
- Enterprise data warehouse offload (“active archive” of historical data)
- Access to IoT and other big data sources using Gluent Present
- Sharing machine learning results with existing RDBMS applications
For further details, have a look at the Gluent case studies published on our website.
Q: What is Gluent Offload Engine?
Gluent Offload Engine is software that synchronizes tables from enterprise relational databases to modern data storage platforms like Hadoop, both on-premises and in the cloud. The offloaded data is stored in open data formats, with no need for a proprietary database engine for data access. Gluent Offload Engine will sync the data to Hadoop, store it in a compressed, columnar format, and create the metadata for the table structure and partitions. Once offloaded, data instantly becomes available in open data formats, can be accessed by native Hadoop tools for data scientists and others, and is easily shared throughout the enterprise.
Q: How is Gluent Offload Engine different from open source tools such as Sqoop?
While Sqoop is a well known de-facto bulk data movement tool for Hadoop, it has a number of limitations that Gluent Offload Engine is able to offset.
- Different capabilities of datatypes in commercial RDBMS vs Hadoop
- Different capabilities of range partitioning methods in RDBMS vs Hadoop
Gluent Offload Engine will translate each RDBMS datatype to the appropriate Hadoop datatype to ensure there is no data loss or corruption.
What Sqoop doesn’t provide out of the box (without extra scripting & coding):
- Atomic offloading (no data loss nor duplicates even in case of Hadoop node or network failure)
- Quick offloading of small tables (even small table offloads launch a big MapReduce job)
- Sync, merge, and compact incremental changes on HDFS for one “current” view of data
- Load microsecond-precision timestamps without precision loss (Sqoop loads only milliseconds by default)
Gluent Offload Engine has built-in data validation and HDFS update support to ensure the RDBMS data is fully synchronized with Hadoop and represents the current state of the source data.
Q: What Hadoop storage formats are supported?
When syncing data from RDBMS to Hadoop, we support Parquet and ORC. We have built a technology to allow updates/changes to existing offloaded data (normally HDFS/Parquet is immutable). When presenting existing Hadoop data to RDBMS (not offloading from database first) we support all datatypes what your underlying Hadoop distribution supports. Whatever tables/files can be queried via Hive or Impala (for example: Parquet, Avro, CSV, JSON), they can be presented to your RDBMS for transparent query.
Q: Where is the software installed?
The software is required to be installed on the relational database server. Optionally, if you would like to run the offloads from the Hadoop cluster, or for specific security implementations, Gluent Offload Engine components can also be installed on a Hadoop Edge Node.
Q: Does Gluent require specific hardware resources for the Hadoop cluster (for example, SSD storage, number of cores etc.)?
There are no specific hardware requirements.
Q: What change data capture methods are used to identify and capture source data changes?
Gluent Offload Engine will periodically query tables and/or partitions for changes and sync these to Hadoop. It can also poll Gluent-maintained log tables (populated by triggers on source tables or when updating offloaded, Hadoop-only data from within the RDBMS) and syncs the logged changes to Hadoop.
Q: How does Gluent Offload Engine keep my data secure?
Gluent Offload Engine supports data encryption on data at rest and data in-motion. Role based access control is also fully supported.
Q: Does Gluent Offload Engine provide a scheduling tool?
No, but Gluent Offload can be run via any external scheduling tool that supports generic command line calls.
Q: Do I need to create any tables or metadata on Hadoop manually?
Not at all. Gluent Offload Engine will both copy data to HDFS in a compressed, columnar format, and it will create the table metadata in Hadoop (Impala or Hive). While creating the Hadoop based table, Gluent Offload Engine will determine which data type will best represent each source RDBMS data type, ensuring no data is compromised or lost during the offload.
Q: How do I optimize the offload of data?
Gluent Offload Engine has several available options for optimizing the speed of ingestion, depending on the specific use case, data types involved, and other information. These optimizations can be found and followed in the Gluent documentation. For more detailed optimization support, ask about Gluent consulting services for Gluent Offload Engine implementation.