Galaxy Consulting Blog: Hadoop

Sunday, June 12, 2016

Hadoop Adoption

True to its iconic logo, Hadoop is still very much the elephant in the room. Many organizations heard of it, yet relatively few can say they have a firm grasp on what the technology can do for their business, and even fewer have actually implemented it successfully at their organization.

Forrester Research predicted that Hadoop will become a cornerstone of the business technology agenda at most organizations.

Scalability, affordability, and flexibility make Hadoop uniquely suited to change the big data scene. An open-source software framework, Hadoop allows for the processing of big data sets across clusters on commodity hardware either on-premises or in the cloud.

At roughly one-thirtieth the cost of traditional data storage and processing, Hadoop makes it realistic and cost effective to analyze all data instead of just a data sample. Its open-source architecture enables data scientists and developers to build on top of it to form customized connectors or integrations.

Typically, data analysis requires some level of data preparation, such as data cleansing and eliminating errors, outside of traditional data warehouses. Once the data is prepared, it is transferred to a high-performance analytics tool, such as a Teradata data warehouse. With data stored in Hadoop, however, users can see "instant ROI" by moving the data workloads off of Teradata and running analytics right where the data resides.

Other use of Hadoop is for live archiving. Instead of backing up data and storing it in a data recovery system, such as Iron Mountain, users can store everything in Hadoop and easily pull it up whenever necessary.

The greatest power of Hadoop lies in its ability to house and process data that couldn't be analyzed in the past due to its volume and unstructured form. Hadoop can parse emails and other unstructured feedback to reveal similar insight.

The sheer volume of data that businesses can store on Hadoop changes the level of analytics and insight that users can expect. Because it allows users to analyze all data and not just a segment or sample, the results can better anticipate customer engagement. Hadoop is surpassing model analytics that can describe certain patterns and is now delivering full data set analytics that can predict future behavior.

There are few challenges.

Hadoop's ability to process massive amounts of data, for example, is both a blessing and a curse. Because it's designed to handle large data loads relatively quickly, the system runs in batch mode, meaning it processes massive amounts of data at once, rather than looking at smaller segments in real time. As a result, the system often forces users to choose between quantity and quality. At this point in Hadoop's life cycle, the focus is more on enormous data size than high-performance analytics.

Because of the large size of the data sets fed into Hadoop, the number-crunching doesn't take place in real time. This is problematic because as the time between when you input the data and the time at which you have to make a decision based on that data grows, the effectiveness of that data decreases.

The biggest problem of all is that Hadoop's seeming boundlessness instills a proclivity for data exploration in those who use it. Relying on Hadoop to deliver all the answers without asking the right questions is inefficient.

As companies begin to recognize Hadoop's potential, demand is increasing, and vendors are actively developing solutions that promise to painlessly transfer data onto Hadoop, improve its processing performance, and operationalize data to make it more business-ready.

Big data integration vendor Talend, for example, offers solutions that help organizations transition their data onto Hadoop in high volume. The company works with more than 800 connectors that link up to other data systems and warehouses to "pull data out, put it into Hadoop, and transform it into a shape that you can run analytics on.

While solutions such as those offered by Talend make the Hadoop migration more manageable for companies, vendors such as MapR tackle the batch-processing lag. MapR developed a solution that enhances the Hadoop data platform to make it behave like enterprise storage. It enables Hadoop to be accessed as easily as network-attached storage is accessed through the network file system; this means faster data management and system administration without having to move any data.

Veteran data solution vendors such as Oracle are innovating as well, developing platforms that make Hadoop easier to use and to incorporate into existing data infrastructures. Its latest updates revolved around allowing users to store and analyze structured and unstructured data together and giving users a set of tools to visualize data and find data patterns or problems.

RapidMiner's approach to Hadoop has been to simplify it, eliminate the need for end users to code, and do for Hadoop analytics what Wordpress did for Web site building. Once usable insights are collected, RapidMiner can connect the data platform to a marketing automation system or other digital experience management system to deploy campaigns or make changes based on data predictions.

Moving forward, analysts predict that leveraging Hadoop's potential will become a more attainable goal for companies. Because it's open-source, the possibilities are vast. Hadoop's ability to connect openly to other systems and solutions will increase adoption in the coming months and years.

Tuesday, July 22, 2014

Hadoop and Big Data

During last ten years the volume and diversity of digital information grew at unprecedented rates. Amount of information is doubling every 18 months, and unstructured information volumes grow six times faster than structured.

Big data is the nowadays trend. Big data has been defined as data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Hadoop was created in 2005 by Doug Cutting and Mike Cafarella to address the big data issue. Doug Cutting named it after his son's toy elephant It was originally developed for the Nutch search engine project. Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.

As of 2013, Hadoop adoption is widespread. A number of companies offer commercial implementations or support for Hadoop. For example, more than half of the Fortune 50 use Hadoop. Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing.

Ventana Research, a benchmark research and advisory services firm published the results of its groundbreaking survey on enterprise adoption of Hadoop to manage big data. According to this survey:

More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs.
More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analysis and better utilization of computing resources.
87% of Hadoop users are performing or planning new types of analysis with large scale data.
94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.
63% of organizations use Hadoop in particular to work with unstructured data such as logs and event data.
More than two-thirds of Hadoop users perform advanced analysis such as data mining or algorithm development and testing.

Today, Hadoop is being used as a:

Staging layer: the most common use of Hadoop in enterprise environments is as “Hadoop ETL” — pre-processing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.
Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop/MapReduce.

Hadoop is particularly useful when:

Complex information processing is needed.
Unstructured data needs to be turned into structured data.
Queries can’t be reasonably expressed using SQL.
Heavily recursive algorithms.
Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing.
Machine learning.
Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB).
Data value does not justify expense of constant real-time availability, such as archives or special interest information, which can be moved to Hadoop and remain available at lower cost.
Results are not needed in real time.
Fault tolerance is critical.
Significant custom coding would be required to handle job scheduling.

Does Hadoop and Big Data Solve All Our Data Problems?

Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.

Pages

Sunday, June 12, 2016

Hadoop Adoption

Tuesday, July 22, 2014

Hadoop and Big Data