Data virtualization refers to a process that combines data from multiple sources to create a single, comprehensive view. Data federation, considered a traditional form of data virtualization, is a subset that allows data consumers to create a virtual database by collecting remote data into middleware and then presenting it to users through a single access point.
While this more conventional form of data virtualization allows users to access siloed data sources without using the costly and cumbersome extract, transform, and load (ETL) processes, it still presents some challenges that can be tough to overcome, making it virtually obsolete in the modern age.
What Is Traditional Data Virtualization?
Data federation technology aggregates data from disparate sources, putting them into a virtual database so that it can be analyzed for business intelligence, reporting, and analytics. Rather than containing the data itself, this virtual database only contains information pertaining to the data’s location.
Data federation creates a layer of abstraction above the physical implementation of data, which allows it to be accessed without physically moving it. Virtualized views of integrated data allow distributed queries to be executed against multiple data sources, including relational databases, enterprise applications, data warehouses, documents, and XML.
Why Traditional Data Virtualization Doesn’t Work
Data Mapping Exercise is Required
Performance Goes Down
Code Rewrites are Required
1. Data Mapping Exercise is Required
Before the data can be used it must be presented uniformly to consumers and traditional data virtualization requires an understanding of the diverse sources it can come from. Only with that understanding can the data be mapped into the federated database and used.
The mapping process allows the federated database to retrieve data from various sources by associating the different sources with the corresponding types of data in the federated database. It creates a bridge between the two systems that ensures data is presented in an accurate and usable format.
As you can tell, this process becomes more and more complex and intensive the more data and sources there are. Integrating large amounts of data becomes a monumental, time-consuming challenge that demands an abundance of resources.
2. Performance Goes Down
When integrating data from various sources, some may be slower than others. Running across even one slow site will result in slower response times that negatively affect the overall performance and add even more time to the process. Even when all silos are running well, combining data from separate large data sets always requires moving data around, resulting in poor performance. In these cases, a caching layer is used to reduce bottlenecks.
This caching layer creates an overlay database on top of the existing data source and attempts to minimize the impact of pulling the data together from the various silos.
Unfortunately, caches can quickly become stale, especially if it is a frequently updated production database or where the data sets are too large to fit in the cache.
3. Cost Goes Up
Traditional Data Virtualization doesn’t reduce hard costs. Since data is left in the original silos, all the existing costs remain. In fact, it actually increases cost by adding the cost for the data virtualization software itself.
4. Code Rewrites are Required
Traditional Data Virtualization software introduces a new SQL engine to the architecture and requires all applications to use that new SQL engine. So the data access layer (the SQL) of all applications must be rewritten in order to use Traditional Data Virtualization. Once the time consuming code rewrite process is complete, the federated engine starts translating this new SQL, you guessed it, back to the original SQL engine to retrieve the data. Hardly an efficient process.
The Solution: Transparent Data Virtualization
Transparent data virtualization differs from the outdated, traditional data virtualization technologies described above because it requires no code changes to existing applications and is completely transparent to end-users whether they be applications, analytical reporting systems, or custom SQL queries.
Our reliable, effective data virtualization solution, Gluent Data Platform, is transparent from the ground up and uses distributed storage and computation backends like Apache Hadoop, Cloudera Data Platform, Google BigQuery, Snowflake or Azure Synapse to store and access data. Data is stored in a centralized location, allowing query processing to be pushed to a single scalable computational platform. With Gluent Data Platform, business users and applications can access enterprise data without code rewrites.
Contact us to find out more about transparent data virtualization and Gluent Data Platform.