A unified Hadoop is essential for data management. However, for Hadoop to remain united, rather than splinter into various provider implementations (i.e. Cloudera and Hortonworks implement different modules of Hadoop), distributors need to coalesce around the BigTop reference model. BigTop is an Apache project (bigtop.apache.org) that integrates core Hadoop with the Zookeeper, HBass, Hive, Pig, Mahout, Oozie, Squoop, Flume and Whirr modules and with versions of Fedora, CentOS, Red Hat Enterprise Linux and SuSE Linux Ubuntu.
While BigTop is somewhat Cloudera-centric now, its success will only be assured if all distributors support its development and incorporate the code base in their distributions. This is essential as Hadoop is still an immature technology with many components that are neither robust nor well integrated.
Much of the innovation in big data management and analytics is occurring in the open source community, the center of which is focused on NoSQL databases and the Hadoop framework. As demand continues to grow for analysis of semi- and unstructured data, and as organizations migrate further toward cloud platforms, NoSQL systems are becoming the database of choice. NoSQL has evolved to support large-scale, real-time data access that is better suited for big analytics.
Hadoop can augment existing data management and storage capabilities due to its ability to collect and analysis data on low-cost, easily scaled commodity hardware. As the enterprise market for Hadoop matures and more companies deploy commodity clusters closer to the data sources for the most demanding analytics projects, expect widespread adoption of the open source framework. However, since Hadoop is linearly scalable, enterprises will increase storage and processing power with each Hadoop node they add.
Connecting to Legacy Systems Remains Crucial
In my view, Hadoop is complementary to, not a replacement for data warehouse architectures in its ability to process the data flows and analytics that are beyond conventional database capabilities. For example, Hadoop is not suitable for highly complex transactional data that may require many steps that all need to be carried out at once. It is also not optimal for structured data sets that require minimal latency, for example when a Web site is served by a MySQL database that requires speeds that Hadoop cannot match.
Moreover, a paucity of expertise in R language and other components of Hadoop are keeping all but the most sophisticated Web-based enterprises from broader deployment. While we are already seeing training in these technologies ramping rapidly, we expect deployments to remain complementary to existing infrastructure to maximize return on all data assets.
Because of its batch processing, Hadoop is most effective for high data volumes that can be stored in Hadoop clusters and queried at length using MapReduce functions. Use cases include building indexes, finding patterns, creating recommendation engines or performing sentiment analysis. In this sense, Hadoop is an extract, transform and load (ETL) technology that complements existing systems as a means to process Web logs, social, spatial, text and machine-generated data. It will not replace legacy ETL architecture for streaming data directly into relational database architectures.
Enterprise IT teams will face an expanding selection of open source analytics offerings from a growing range of vendors. Startups and incumbents are rapidly developing and/or partnering to deliver out-of-the-box integration with legacy systems/infrastructures surrounded by strong service, support, consulting and training. Consequently, established data management vendors will continue to incorporate open source, particularly Hadoop, into their product strategies.