Select a page
Strategic Investor Relations for Technology Companies

Do Not Warehouse Your Data Warehouse Yet

How organizations incorporate the Hadoop ecosystem into their big data analytics projects depend on architectures, use cases and economics.  With established data warehouse and new vendors incorporating Structure Query Language (SQL) interfaces to bridge enterprises to Hadoop, do not warehouse your data warehouse yet.

Hadoop is the open source platform for developing and deploying distributed, data-intensive applications that can accommodate the ever-increasing volume, velocity and variety of data commonly referred to as big data.  The development platform is managed by the Apache Software Foundation and is freely distributed under an open-source license.

Hadoop is valuable for three primary purposes: scaling systems, cost efficiency and flexibility.  At its core is the Hadoop Distributed File System, which serves as the storage layer, and the MapReduce software framework, which is the compute layer.  Many other application projects have been developed to expand functionality and make Hadoop easier to work with for enterprises.  We’ll take a more in-depth look at Hadoop in an upcoming post.     

The biggest difference between an enterprise data warehouse (EDW) and Hadoop is that the latter operates without a schema.  This mean than unlike EDWs that require data to be formatted after ingestion, data can be added in Hadoop in raw form and recalled rapidly for analysis.

The move to a distributed architecture

Owing in part to the evolution of Hadoop, the centralized data store of the traditional EDW is gradually giving way to a more distributed architecture.  This is to leverage the scale, cost efficiencies and flexibility of handling unstructured data provided by Hadoop.  The infrastructure economics of Hadoop are compelling: measured on a cost per terabyte basis, comparable workloads can be deployed on a cluster of commodity servers in Hadoop at approximately one-tenth the cost of branded storage.  As a minimum, this makes Hadoop ideal for archiving by allowing enterprises to offload infrequently used data from first-tier expensive storage to secondary and tertiary tiers.

This evolution to a distributed, modular architecture represents a strategic shift that has been forced on EDW vendors.  Data warehouses were not built to handle the complexity of big data workloads.  The agility that incorporating Hadoop provides allows enterprise IT to shift focus from the burden of managing workloads to helping business users derive more value from their data.

Major EDW vendors, including Teradata, Oracle, and IBM have introduced appliances to connect their database and analytics software to data stored in Hadoop.  They are also partnering with leading Hadoop distributors Cloudera and Hortonworks to facilitate application deployment. 

SQL connectors key Hadoop integration, adoption

The new Hadoop appliances are designed to operate alongside the EDW.  Importantly, each vendor offers some type of SQL-based query language on top of Hadoop’s distributed file system to make the data stored in Hadoop clusters more accessible to business users.  Their objective is to encourage analysis of all data – whether structured or multi-structured – with the ease and familiarity of SQL.

For traditional EDWs, database administrators, SQL developers and extract, transform and load (ETL) experts are fairly common.  However, with a shortage of IT skills in big data technologies, particularly the Hadoop ecosystem, this architecture serves as a good stepping stone to make querying and building business applications in Hadoop easier.  It also enables enterprises to fully depreciate their EDW assets and gracefully migrate to incorporate the scale and cost benefits of Hadoop for big data analytics projects.

The EDW vendors are facilitating this by building analytical function accelerators into their appliances to speed up specific capabilities.  The SQL connectors also extend Hadoop appliances to familiar back-end business intelligence platforms.  With a modular architecture, unstructured data that is stored in Hadoop can be processed and then shipped to the EDW for analysis. 

Traditional EDW vendors have more work to do.  They must rework the relational databases that remain critical to analyzing business operations.  This includes leveraging in-memory technology to make the database more elastic and flexible for analyzing big data.  By offloading non-analytical functions such as transforming, cleansing and preparing data onto Hadoop clusters, organizations can utilize the data warehouse to do what it does best: high-performance processing and analytics on tier-one data.

With the general availability of Hadoop 2.0, more of the analysis can be done without moving the data out of Hadoop.  Instead of loading everything into the EDW and storing it there, enterprises can pre-screen data flowing into Hadoop clusters to determine what should be moved to the EDW or remain in Hadoop either for archiving or native analysis.  Ultimately, what business users want is a safe, well-managed and less complex environment in which they can solve business problems.