During last ten years the volume and diversity of digital information grew at unprecedented rates. Amount of information is doubling every 18 months, and unstructured information volumes grow six times faster than structured.
Big data is the nowadays trend. Big data has been defined as data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.
Hadoop was created in 2005 by Doug Cutting and Mike Cafarella to address the big data issue. Doug Cutting named it after his son's toy elephant It was originally developed for the Nutch search engine project. Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.
As of 2013, Hadoop adoption is widespread. A number of companies offer commercial implementations or support for Hadoop. For example, more than half of the Fortune 50 use Hadoop. Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing.
Ventana Research, a benchmark research and advisory services firm published the results of its groundbreaking survey on enterprise adoption of Hadoop to manage big data. According to this survey:
- More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs.
- More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analysis and better utilization of computing resources.
- 87% of Hadoop users are performing or planning new types of analysis with large scale data.
- 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.
- 63% of organizations use Hadoop in particular to work with unstructured data such as logs and event data.
- More than two-thirds of Hadoop users perform advanced analysis such as data mining or algorithm development and testing.
Today, Hadoop is being used as a:
- Staging layer: the most common use of Hadoop in enterprise environments is as “Hadoop ETL” — pre-processing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.
- Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
- Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
- Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop/MapReduce.
Hadoop is particularly useful when:
- Complex information processing is needed.
- Unstructured data needs to be turned into structured data.
- Queries can’t be reasonably expressed using SQL.
- Heavily recursive algorithms.
- Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing.
- Machine learning.
- Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB).
- Data value does not justify expense of constant real-time availability, such as archives or special interest information, which can be moved to Hadoop and remain available at lower cost.
- Results are not needed in real time.
- Fault tolerance is critical.
- Significant custom coding would be required to handle job scheduling.
Does Hadoop and Big Data Solve All Our Data Problems?
Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.