Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Thursday, February 11, 2021

Mastering Fractured Data

Data complexity in companies can be a big obstacle to achieve efficient operations and excellent customer service.

Companies are broken down into various departments. They have hundreds, thousands, or even hundreds of thousands of employees performing various tasks. Adding to the complexity, customer information is stored in so many different applications that wide gaps exist among data sources. Bridging those gaps so every employee in the organization has a consistent view of data is possible and necessary.

Various applications collect customer information in different ways. For example, CRM solutions focus on process management and not on data management.

Consequently, customer data is entered into numerous autonomous systems that were not designed to talk to one another. Client data is housed one way in a sales application, another way in an inventory system, and yet another way in contact center systems.

Other organizational factors further splinter the data, which can vary depending on the products in which a customer is interested, where the product resides, and who (the company or a partner) delivers it.

In addition, information is entered in various ways, including manually, either by the customer or an employee, or via voice recognition. And applications store the information in unique ways. One system might limit the field for customers’ last names to 16 characters while another could allow for 64 characters.

The challenge is further exacerbated by software design and vendors’ focus. CRM vendors concentrate on adding application features and do not spend as much time on data quality.

Customers can input their personal information 10 different ways. Most applications do not check for duplication when new customer information is entered.

Human error creates additional problems. Employees are often quite busy, move frequently and quickly from one task to the next, and, consequently, sometimes do not follow best practices fully.

Data becomes very fractured and there appear different versions of truth. The data features a tremendous amount of duplication, inconsistencies, and inefficiencies.

The inconsistencies exist because fixing such problems is a monumental task, one that requires companies to tackle both technical and organizational issues. Master data management (MDM) solutions, which have been sold for decades, are designed to address the technical issues. They are built to clean up the various inconsistencies, a process dubbed data cleansing.

The work sounds straightforward, but it is time-consuming and excruciatingly complex. The company has to audit all of its applications and determine what is stored where and how it is formatted. In many cases, companies work with terabytes and petabytes of information. Usually, they find many more sources than initially anticipated because cloud and other recent changes enable departments to set up their own data lakes.

Cleansing Process

Cleansing starts with mundane tasks, like identifying and fixing typos. The MDM solution might also identify where necessary information is missing.

To start the process, companies need to normalize fields and field values and develop standard naming conventions.  The data clean-up process can be streamlined in a few ways. If a company chooses only one vendor to supply all of its applications, the chances of data having a more consistent format increase. Typically, vendors use the same formats for all of their solutions. In some cases, they include add-on modules to help customers harmonize their data.

But that is not typically the case. Most companies purchase software from different suppliers, and data cleaning has largely been done in an ad hoc fashion, with companies harmonizing information application by application. Recognizing the need for better integration, suppliers sometimes include MDM links to popular systems, like Salesforce Sales Cloud, Microsoft Dynamics, and Marketo.

Artificial intelligence and machine learning are emerging to help companies grapple with such issues, but the work is still in the very early stages of development.

Still other challenges stem from internal company policies—or a lack thereof—and corporate politics. Businesses need to step back from their traditional departmental views of data and create an enterprise-wide architecture. They must understand data hierarchies and dependencies; develop a data governance policy; ensure that all departments understand and follow that policy; and assign data stewards to promote it.

The relationship between company departments and IT has sometimes been strained. The latter’s objectives to keep infrastructure costs low and to put central policies in place to create data consistency often conflict with the company departments' drivers. And while departments have taken more control over the data, they often lack the technical skills to manage it on their own.

It is a good idea to start with small area and then expand to other areas.

Having clean and organized data would make company's operations much more effective and would enable to optimize customer service. They can take steps to improve their data quality.

Please contact us for more information or for a free consultation.

Wednesday, November 30, 2016

Three Values of Big Data

Big Data is everywhere. But to harness its potential, organizations should understand the challenges that come with collecting and analyzing Big Data. 

The three values that are important in managing big data are volume, velocity, and variety. These three factors serve as guidance for Big Data management, highlighting what businesses should look for in solutions.

But even as organizations have started to get a handle on these three V’s, two other V’s, veracity and value are important as well, if not more so.

Volume is the ability to ingest, process, and store very large data sets. Definition of "very large" can vary by business and is dependent upon the particular circumstances of the business problem, as well as the preceding volumes used by that business.

Volume can also be defined as the number of rows, or the number of events that are happening in the real world that are getting captured in some way, a row at a time. Accordingly, the more rows that you have, the bigger the data set is going to be.

Bigger Volumes, Higher Velocities

In today’s digital age, having huge volumes of data is hardly rare. The proliferation of mobile devices ensures that companies can gather more data on consumers than ever before, and the rise of the Internet of Things will only increase this plethora of data. Moreover, businesses will have even more information on customers as they begin to use one-on-one messaging channels to interact directly with them.

The sheer volume of data available to us is greater than ever before. In fact, in many ways, nearly every human action can be quantified and logged in a bank of data that’s growing at an incredibly fast rate. All of this data can be turned into actionable insights that drive business decisions and can help transform every customer interaction, create operational efficiency, and more.

This increase in data volume is paired with a simultaneous increase in speed. The speed with which the volume is increasing, as well as the volume itself, are both increasing. These increases have forced IT staff to spend more time trying to figure out how to process and analyze that data.

Velocity is the key V of the three V’s. For example, a customer will visit a company’s site or use its mobile application but only for a short amount of time. The business may have just seconds to gather customer information and deliver a relevant response based on that information, usually just one message or offer.

This quick turnaround time requires you to process all of that real-time behavioral data as fast as possible. If you only understand that your customer was on your Web site the day after, you’re not able to contact them anymore. One aspect of a successful customer journey is being able to send the right message at the right time to the right customer. Timeliness and relevancy are the foundation of delivering personalized customer experiences in real time.

A Variety of Formats

Data sets are in a variety of formats, and the number of data types continues to grow. Radio-frequency identification (the use of electromagnetic fields to gather information from tags attached to objects), smart metering (devices that monitor information on energy consumption for billing purposes), and the ubiquity of mobile devices with geo-location capabilities are only few examples of diverse sources of consumer information.

All of these technologies have their own methods of capturing and publishing data, which adds to the complexity of the information environment.

But overcoming these data complexities could be well worth it. Having a large variety of data is crucial for creating a holistic customer view. Access to data such as a customer’s purchasing history, personal preferences based on social media postings, exercising habits, caloric intake, and time spent in the car can help companies understand that customer on a deeper level, and thus build experiences that are tailored to that customer.

But this diversity of data sources can be a blessing and a curse. A blessing because businesses have an increasingly large range of channels from which to pull customer information, but a curse because it can be difficult to filter through that information to find the most valuable content.

Variety is a little overstated in what people talk about for Big Data. Audio and video as examples of data formats that can be particularly difficult to analyze. Usually what companies do is they try to come up with an intermediate representation of that data, and then use that intermediate representation to apply old or new algorithms to try to extract signals, whatever the definition of signal is for that business problem they’re trying to solve.

Volume, velocity, and variety are undoubtedly important to managing customer information. Companies should keep in mind other important aspects of big data if they want to make the most of it.

Data tools such as Apache Hadoop and Apache Spark have enabled new methods of data processing that were previously out of reach for most organizations. While the growing volume of data, the time needed to process it, and the sheer number of input sources pose challenges for businesses, all three can largely be addressed through technology.

New V's Emerge

Investment in Big Data has begun to stabilize and enter a maturity phase over the past year. It will take time for infrastructure and architectures to mature, and best practices should be developed and refined against these architectures.

Organizations should consider how to use Big Data to bring about specific outcomes, in other words, organizations should examine the challenges of Big Data from a business perspective as opposed to a technical one. A framework that incorporates the business-oriented characteristics of veracity and value can help enterprises harness Big Data to achieve specific goals.

Not all data is the same, but organizations may not be paying enough attention to changes within individual data sets. Contextualizing the structure of the data stream is essential. This includes determining whether it is regular and dependable or subject to change from record to record, or even with each individual transaction. Organizations need to determine how the nature and context of data content in all its forms, text, audio, or video, can be interpreted in a way that makes it useful for analytics.

This is where the veracity of data or the trustworthiness of data comes in. Determining trustworthiness is particularly important when it comes to third-party data. It passes through a set of edits and validation rules.

Veracity entails verifying that data is suitable for its intended purpose, and usable within a given analytic model. Organizations should use several measurements to determine the trustworthiness and usefulness of a given data set. Establishing the degree of confidence in data is crucial so that analytic outputs based on that data can be a stimulus for business change.

Important metrics for evaluating and cleaning up data records are:
  • completeness measurements, or the percentage of instances of recorded data versus all available data within a business ecosystem or market (or the percentage of missing fields within a data record);
  • uniqueness measurements, or the percentage of alternate or duplicate data records;
  • accessibility measurements, or the number of business processes and personnel that can benefit from access to specific data, or that can actually access that data;
  • relevancy measurements, or the number of business processes that utilize or could benefit from specific data;
  • scarcity measurements, or the probability that other organizations including competitors and partners have access to the same data (the scarcer the data, the more it has impact).
Value is Paramount

While veracity can’t be overlooked, value is the most important factor. The first three V’s are really talking about architecture, infrastructure, representation of data, things that are important to IT organizations and, by far, less interesting to the business stakeholders.

The business stakeholders really don’t care about the first three, they only care about the value they can extract from the data. Executives often expect the analytical teams at their organizations to hide the first three V’s (volume, velocity, and variety) and only generate the last V - the value that is fundamental to the success of the business.

The concept of value is essential for organizations to succeed in monetizing their data assets. Value is a property that helps identify the purpose, scenario, or business outcomes that analytic solutions seek to address. It helps to confirm what questions are to be answered and what actions will be taken as a result, and defines what benefits are anticipated from collecting and analyzing the data.

Value is a motivating force when it comes to developing new and innovative ideas that can be tested by exploring data in different ways.

The ability to pull valuable information from Big Data and use that information to build a holistic view of the customer is absolutely critical. It’s no longer just an option to develop one-to-one relationships with customers; it’s a requirement. And to build that relationship, companies have to leverage all the customer information they can to personalize every interaction with them.

By using such information to lead customers on a personal journey, companies can help ensure that customers will stay with them long term, and even become brand advocates. Value is derived from making the data actionable. Organizations can have all the information about a customer, but it’s what we they can do with it that drives value for the business.

The Three V’s model of volume, velocity, and variety is useful for organizations that are just beginning to take control of their data, and certainly should not be forgotten by enterprises that have advanced further in their management of customer information.

The first three V’s are equally important. In the digital age, companies have accumulated more data than ever before, are pulling data from a variety of sources, and are increasing the rate at which that data flows, and that a combination of these three factors can help organizations to create relevant, personal, and one-on-one customer interactions.

Deriving value is the ultimate business goal for any enterprise. The standard Three V’s model does not satisfactorily identify any data properties from a business usage perspective. Even though Big Data, and data in general, provides organizations with a lot of capabilities, the challenge for businesses is to make sure that they adapt how they think about the business processes, how they report on them, and how they define key performance indicators.

Organizations should try to get to the value. They need to turn that data into value. It’s figuring out how to use that data to optimize business processes. In the end, the Three V’s model for Big Data is a useful start point. But then it becomes about the ultimate goal, the one organizations must not lose sight of: driving value.

Galaxy Consulting has 17 years experience in big data management. We are on the forefront of driving value of big data.

Wednesday, July 27, 2016

Navigating Big Data

Big Data is an ever-evolving term which is used to describe the vast amount of unstructured data. Published reports have indicated that 90% of the world’s data was created during the past two years alone.

Whether it’s coming from social media sites such as Twitter, Instagram, or Facebook, or from countless other Web sites, mobile devices, laptops, or desktops, data is being generated at an astonishing rate. Making use of Big Data has gone from a desire to a necessity. The business demands require its use.

Big Data can serve organizations in many ways. Ironically, though, with such a wealth of information at a company's disposal, the possibilities border on the limitless, and that can be a problem. Data is not going to automatically bend to a company's will. On the contrary, it has the potential to stir up organizations from within if not used correctly. If a company doesn't set some ground rules and figure out how to choose the appropriate data to work with, as well as how to make it align with the organization's goals, it's unlikely to get anything worthy out of it.

There are three layers of Big Data analytics, two of which lead to insights. The first of these, and the most basic, is descriptive analytics, which simply summarize the state of a situation. They can be presented in the form of dashboards, and they tell a person what's going on, but they don't predict what will happen as a result. Predictive analytics forecast what will likely happen, prescriptive analytics guide users to action. Predictive and prescriptive analytics provide insights.

Presenting the analytics on a clean, readable user interface is vital but sometimes is ignored. Users get frustrated when they see content that they can't decipher. A canned dashboard does not work for users. They need to know what action they have to take. Users demand a sophisticated alert engine that will tell them very contextually what actions to take.

Using such analytics, ZestFinance was able to glean this insight: those who failed to properly use uppercase and lowercase letters while filling out loan applications were more likely to default on them later on. Knowing this helped them identify a way to improve on traditional underwriting methods, pushing them to incorporate updated models that took this correlation into consideration. As a result, the company was able to reduce the loan default rate by 40% and increase market share by 25%.

Unfortunately, insights have a shelf life. They must be interpretable, relevant, and novel. Once an insight has been incorporated into a strategy, it's no longer an insight, and the benefits it generates will cease to make a noticeable difference over time.

Getting the Right Data

To get the right data leading to truly beneficial insights, a company must employ a sophisticated plan for its collection. Having a business case around the usage of data is the first important step. A company should figure out what goals it would like to meet, how and why data is crucial to reaching them, and how this effort can help increase revenue and decrease costs.

Data relevance is the key and what is important to a company is determined by the problems it is trying to solve. There is useful data and not useful data. It is important to distinguish them and weed out not useful data. Collecting more than what is useful and needed is impractical.

Often data is accumulating before a set of goals has been outlined by stakeholders. It is being collected irrespective of any specific problem, question, or purpose. Data warehouses and processing tools such as Hadoop, NoSQL, InfoGrid, Impala, and Storm make it especially easy for companies to quickly attain large amounts of data. Companies are also at liberty to add on third-party data sources to enrich the profiles they already have, from companies such as Dun & Bradstreet. Unfortunately, most of the data, inevitably, is irrelevant. The key is to find data that pertains to the problem.

Big Data is nothing if not available, and it takes minimal effort to collect it. But unfortunately, it will not be of use to anyone if it’s not molded to meet the particular demands of those using it. Some people are under the impression that they are going to get a lot of information simply from having data. But businesses don’t really need Big Data - information and insight are what they need. While a vast amount of data matter might be floating around in the physical and digital universes, the information it contains may be considerably less substantial.

While it might seem advisable to collect as much information as possible, some of that information just might not be relevant. Relevant insights, on the other hand, allow companies to act on information and create beneficial changes.

It is a good idea to set parameters for data collection by identifying the right sources early on. It could be a combination of internal and external data sources. Determine some metrics that you monitor on an ongoing basis. Having the key performance indicators (KPIs) in place will help companies identify the right data sources, the types of data sources that can help solve their problems.

Technology plays a key role in harnessing Big Data. Companies should figure out what kinds of technology make sense for them. Choice of technology should be based on company's requirements.

Data collection is an ongoing process that can be adjusted over time. As the business needs change, newer data sources are integrated, and newer business groups or lines of businesses are brought in as stakeholders, the dynamics and qualities of data collection will change. So this needs to be treated not as a one-time initiative, but as an ongoing program in which you continually enrich and enhance your data quality.

Companies should continually monitor the success of their data usage and implementation to ensure they're getting what they need out of it. There should be a constant feedback stream so that a company knows where it stands in relation to certain key metrics it has outlined.

Risks

Companies must always be aware of the risks involved in using data. Companies shouldn't use prescriptive analytics when there is significant room for error. It takes good judgment, of course, to determine when the payoffs outweigh the potential risks. Unfortunately, it's not always possible to get a prescriptive read on a situation. There are certain limitations. For one thing, collecting hard data from the future is impossible.

People and Processes

Big Data adoption often becomes a change management issue and companies often steer clear of it. When a company implements something that's more data-driven, there's a lot of resistance to it.

Like most initiatives that propose technology as a central asset, Big Data adoption can create conflicts among the various departments of an organization. People struggle to accept data, but people also aren’t willing to give it up. To avoid such clashes, companies should make it clear from the outset which department owns the data. Putting the owner in charge of the data, having this person or department outline the business rules and how they should be applied to customers would be helpful to overcome this issue.

These are two good tips to follow: Give credit where credit is due and don't dehumanize the job. Don’t attribute the success to the data, but to the person who does something with the data. Remember that change can't just come from the top down. Big Data adoption requires more than executive support. It needs buy-in from everyone.

Sunday, June 26, 2016

Better Business Operations with Better Data

Businesses today understand that data is an important enterprise asset, relied on by employees to deliver on their customers' needs, among other uses of data such as making business decisions and many others.

Yet too few organizations realize that addressing data quality is necessary to improve customer satisfaction. A recent Forrester survey shows that fewer than 20% of companies see data management as a factor in improving customer relationships. This is very troubling number.

Not paying attention to data quality can have a big impact both on companies and the customers they serve. Following are just two examples:

Garbage in/garbage out erodes customer satisfaction. Customer service agents need to have the right data about their customers, their purchases, and prior service history presented to them at the right point in the service cycle to deliver answers. When their tool sets pull data from low-quality data sources, decision quality suffers, leading to significant rework and customers frustration.

Lack of trust in data has a negative impact on employees productivity. Employees begin to question the validity of underlying data when data inconsistencies and quality issues are left unchecked. This means employees will often ask a customer to validate product, service, and customer data during an interaction which makes the interaction less personal, increases call times, and instills in the customer a lack of trust in the company.

The bottom line: high-quality customer data is required to support every point in the customer journey and ultimately deliver the best possible customer experience to increase loyalty and revenue. So how can organizations most effectively manage their data quality?

While content management systems (CMS) can play a role in this process, they can't solve the data-quality issue by themselves. A common challenge in organizations in their content management initiatives is the inability to obtain a complete trusted view of the content. To get started on the data-quality journey, consider this five-step process:

1. Don't view poor data quality as a disease. Instead, it is often a symptom of broken processes. Using data-quality solutions to fix data without addressing changes in a CMS will yield limited results. CMS users will find a work-around and create other data-quality issues. Balance new data-quality services with user experience testing to stem any business processes that are causing data-quality issues.

2. Be specific about bad data's impact on business effectiveness. Business stakeholders have enough of data-quality frustrations. Often, they will describe poor data as "missing," "inaccurate," or "duplicate" data. Step beyond these adjectives to find out why these data-quality issues affect business processes and engagement with customers. These stories provide the foundation for business cases, highlight what data to focus on, and show how to prioritize data-quality efforts.

3. Scope the data-quality problem. Many data-quality programs begin with a broad profiling of data conditions. Get ahead of bottom-up approaches that are disconnected from CMS processes. Assess data conditions in the context of business processes to determine the size of the issue in terms of bad data and its impact at each decision point or step in a business process. This links data closely to business-process efficiency and effectiveness, often measured through key performance indicators in operations and at executive levels.

4. Pick the business process to support. For every business process supported by CMS, different data and customer views can be created and used. Use the scoping analysis to educate CMS stakeholders on business processes most affected and the dependencies between processes on commonly used data. Include business executives in the discussion as a way to get commitment and a decision on where to start.

5. Define recognizable success by improving data quality. Data-quality efforts are a key component of data governance that should be treated as a sustainable program, not a technology project. The goal is always to achieve better business outcomes. Identify qualitative and quantitative factors that demonstrate business success and operational success. Take a snapshot of today's CMS and data-quality conditions and continuously monitor and assess them over time. This will validate efforts as effective and create a platform to expand data-quality programs and maintain ongoing support from business stakeholders and executives.

Galaxy Consulting has over 16 years experience helping organizations to make the best use of their data and improve it. Please contact us today for a free consultation!

Sunday, June 12, 2016

Hadoop Adoption

True to its iconic logo, Hadoop is still very much the elephant in the room. Many organizations heard of it, yet relatively few can say they have a firm grasp on what the technology can do for their business, and even fewer have actually implemented it successfully at their organization.

Forrester Research predicted that Hadoop will become a cornerstone of the business technology agenda at most organizations.

Scalability, affordability, and flexibility make Hadoop uniquely suited to change the big data scene. An open-source software framework, Hadoop allows for the processing of big data sets across clusters on commodity hardware either on-premises or in the cloud.

At roughly one-thirtieth the cost of traditional data storage and processing, Hadoop makes it realistic and cost effective to analyze all data instead of just a data sample. Its open-source architecture enables data scientists and developers to build on top of it to form customized connectors or integrations.

Typically, data analysis requires some level of data preparation, such as data cleansing and eliminating errors, outside of traditional data warehouses. Once the data is prepared, it is transferred to a high-performance analytics tool, such as a Teradata data warehouse. With data stored in Hadoop, however, users can see "instant ROI" by moving the data workloads off of Teradata and running analytics right where the data resides.

Other use of Hadoop is for live archiving. Instead of backing up data and storing it in a data recovery system, such as Iron Mountain, users can store everything in Hadoop and easily pull it up whenever necessary.

The greatest power of Hadoop lies in its ability to house and process data that couldn't be analyzed in the past due to its volume and unstructured form. Hadoop can parse emails and other unstructured feedback to reveal similar insight.

The sheer volume of data that businesses can store on Hadoop changes the level of analytics and insight that users can expect. Because it allows users to analyze all data and not just a segment or sample, the results can better anticipate customer engagement. Hadoop is surpassing model analytics that can describe certain patterns and is now delivering full data set analytics that can predict future behavior.

There are few challenges.

Hadoop's ability to process massive amounts of data, for example, is both a blessing and a curse. Because it's designed to handle large data loads relatively quickly, the system runs in batch mode, meaning it processes massive amounts of data at once, rather than looking at smaller segments in real time. As a result, the system often forces users to choose between quantity and quality. At this point in Hadoop's life cycle, the focus is more on enormous data size than high-performance analytics.

Because of the large size of the data sets fed into Hadoop, the number-crunching doesn't take place in real time. This is problematic because as the time between when you input the data and the time at which you have to make a decision based on that data grows, the effectiveness of that data decreases.

The biggest problem of all is that Hadoop's seeming boundlessness instills a proclivity for data exploration in those who use it. Relying on Hadoop to deliver all the answers without asking the right questions is inefficient.

As companies begin to recognize Hadoop's potential, demand is increasing, and vendors are actively developing solutions that promise to painlessly transfer data onto Hadoop, improve its processing performance, and operationalize data to make it more business-ready.

Big data integration vendor Talend, for example, offers solutions that help organizations transition their data onto Hadoop in high volume. The company works with more than 800 connectors that link up to other data systems and warehouses to "pull data out, put it into Hadoop, and transform it into a shape that you can run analytics on.

While solutions such as those offered by Talend make the Hadoop migration more manageable for companies, vendors such as MapR tackle the batch-processing lag. MapR developed a solution that enhances the Hadoop data platform to make it behave like enterprise storage. It enables Hadoop to be accessed as easily as network-attached storage is accessed through the network file system; this means faster data management and system administration without having to move any data.

Veteran data solution vendors such as Oracle are innovating as well, developing platforms that make Hadoop easier to use and to incorporate into existing data infrastructures. Its latest updates revolved around allowing users to store and analyze structured and unstructured data together and giving users a set of tools to visualize data and find data patterns or problems.

RapidMiner's approach to Hadoop has been to simplify it, eliminate the need for end users to code, and do for Hadoop analytics what Wordpress did for Web site building. Once usable insights are collected, RapidMiner can connect the data platform to a marketing automation system or other digital experience management system to deploy campaigns or make changes based on data predictions.

Moving forward, analysts predict that leveraging Hadoop's potential will become a more attainable goal for companies. Because it's open-source, the possibilities are vast. Hadoop's ability to connect openly to other systems and solutions will increase adoption in the coming months and years.

Saturday, April 23, 2016

Analytics for Big Data

Companies are just now beginning to harness the power of big data for the purposes of information security and fraud prevention.

Only 50% of companies currently use some form of analytics for fraud prevention, forensics, and network traffic analysis.

Less than 20% of companies use big data analytics to identify information, predict hardware failures, ensure data integrity, or check data classification, despite the fact that by doing so, companies are able to improve their balance of risk versus reward and be in a better position to predict potential risks and incidents.

Banks, insurance, and other financial institutions use big data analytics to support their core businesses. Large volumes of transactions are analyzed to detect fraudulent transactions and money laundering. These, in turn, are built into profiles that further enhance the analysis. Some insurance companies, for example, share and analyze insurance claims data to detect patterns that can point to the same fraudulent activities against multiple companies. Healthcare is another area in which data analysis can be used for information security.

Big data can arise from internal and external sources, spanning social media, blogs, video, GPS logs, mobile devices, email, voice, and network data. It's estimated that 90% of the data in the world today has been created in the past two years, and some 2.5 million terabytes of data are created every day.

Although many companies already use data warehousing, visualization, and other forms of analytics to tap into this high-volume data, using that data to prevent future attacks or breaches remains relatively uncharted territory. This is changing and will continue to do so as security increasingly moves from being a technical to a business issue.

To balance the business benefits of big data analytics with the cost of storage, organizations need to regularly review the data they are collecting, determine why and for how long they need it, and where and how they should store it.

The Human Element of the Big Data Equation

Because data volumes grow considerably every day, deciphering all the information requires both technology and people-driven processes. People often find patterns that a computer can pass over. Some other steps organizations can take to analyze big data for information security purposes include the following:
  • Identify the business issue;
  • construct a hypothesis to be tested;
  • select the relevant data sources and provide subject matter expertise about them;
  • determine the analyses to be performed;
  • interpret the results.
Most companies struggle to find value from their customer analytics efforts. Lack of data management, integration, and quality are the biggest inhibitors to making better use of customer analytics. 54% of surveyed companies have difficulty managing and integrating data from the many varied sources, while 50% are concerned about consistent data quality.

Companies also struggle with assembling the right type of analytics professionals, communicating the results of the analysis to relevant colleagues, performing real-time analytics and making insights available during customer interactions, protecting data and addressing privacy concerns, and keeping pace with the velocity of data generation.

While key drivers of adoption include increasing customer satisfaction, retention, and loyalty, analytics use skews largely toward acquisition of new customers. 90% of surveyed companies use analytics for this purpose.

Other factors driving the use of analytics include reacting to competitive pressures, reducing marketing budgets, and addressing regulatory issues.

The use of predictive analytics as a growing trend, with 40% of organizations use it. 70% have been using descriptive analytics and business intelligence reporting for more than 10 years.

Organizations that have already mastered basic analytics methodologies and gained efficiencies in aggregate analysis are now looking to adopt advanced ways to do real-time, future-looking analysis.

Additionally, companies would like to start using social data as a viable source of customer analytics. This was cited as a long-term goal by 17% of the companies in the survey.

Organizations should look beyond social media for unstructured data. While many marketers have embraced social media as an effective way to engage customers, from an analytics standpoint, they have only scratched the surface in how other data sources, such as call center data and voice-of-the-customer data, can feed traditional customer analytics processes.

Analytics can also be used to improve customer engagements. Customer engagement features at the bottom of the list of metrics. This is a missed opportunity for customer analytics practitioners to gain deeper insight into how individual customers interact with content, offers, and messaging across various touchpoints.

Customer analytics practitioners do a number of things right, including focusing on the right types of analytics and methodologies to achieve a basic understanding of who their customers are, their propensity to buy, how to target them effectively, and how best to experiment with content, features, and offers. But despite this, companies should develop a holistic customer analytics solution framework.

Although individual customer analytics techniques answer specific business questions, they fail to deliver efficiency in generating insights at an aggregate level.

Organizations should look outside their own four walls and connect with partners who are knowledgeable in analytics technology, analytical services, and data mining to explore the next steps for customer analytics. It is not just about buying an analytics tool, it is also about employing the professional services to make sense of the data.

Galaxy Consulting has 16 years experience in the area of analytics. Contact us today for a free consultation and let's get started!

Monday, December 7, 2015

Data Lake

A data lake is a large storage repository and processing engine. Data lakes focus on storing disparate data and ignore how or why data is used, governed, defined and secured.

Benefits

The data lake concept hopes to solve information silos. Rather than having dozens of independently managed collections of data, you can combine these sources in the unmanaged data lake. The consolidation theoretically results in increased information use and sharing, while cutting costs through server and license reduction.

Data lakes can help resolve the nagging problem of accessibility and data integration. Using big data infrastructures, enterprises are starting to pull together increasing data volumes for analytics or simply to store for undetermined future use. Enterprises that must use enormous volumes and myriad varieties of data to respond to regulatory and competitive pressures are adopting data lakes. Data lakes are an emerging and powerful approach to the challenges of data integration as enterprises increase their exposure to mobile and cloud-based applications, the sensor-driven Internet of Things, and other aspects.

Currently the only viable example of a data lake is Apache Hadoop. Many companies also use cloud storage services such as Amazon S3 along with other open source tools such as Docker as a data lake. There is a gradual academic interest in the concept of data lakes.

Previous approaches to broad-based data integration have forced all users into a common predetermined schema, or data model. Unlike this monolithic view of a single enterprise-wide data model, the data lake relaxes standardization and defers modeling, resulting in a nearly unlimited potential for operational insight and data discovery. As data volumes, data variety, and metadata richness grow, so does the benefit.

Data lake is helping companies to collaboratively create models or views of the data and then manage incremental improvements to the metadata. Data scientists and business analysts using the newest lineage tracking tools such as Revelytix Loom or Apache Falcon to follow each other’s purpose-built data schemas. The lineage tracking metadata also is placed in the Hadoop Distributed File System (HDFS) which stores pieces of files across a distributed cluster of servers in the cloud where the metadata is accessible and can be collaboratively refined. Analytics drawn from the data lake become increasingly valuable as the metadata describing different views of the data accumulates.

Every industry has a potential data lake use case. A data lake can be a way to gain more visibility or to put an end to data silos. Many companies see data lakes as an opportunity to capture a 360-degree view of their customers or to analyze social media trends.

Some companies have built big data sandboxes for analysis by data scientists. Such sandboxes are somewhat similar to data lakes, albeit narrower in scope and purpose.

Relational data warehouses and their big price tags have long dominated complex analytics, reporting, and operations. However, their slow-changing data models and rigid field-to-field integration mappings are too brittle to support big data volume and variety. The vast majority of these systems also leave business users dependent on IT for even the smallest enhancements, due mostly to inelastic design, unmanageable system complexity, and low system tolerance for human error. The data lake approach helps to solve these problems.

Approach

Step number one in a data lake project is to pull all data together into one repository while giving minimal attention to creating schemas that define integration points between disparate data sets. This approach facilitates access, but the work required to turn that data into actionable insights is a substantial challenge. While integrating the data takes place at the Hadoop layer, contextualizing the metadata takes place at schema creation time.

Integrating data involves fewer steps because data lakes don’t enforce a rigid metadata schema as do relational data warehouses. Instead, data lakes support a concept known as late binding, or schema on read, in which users build custom schema into their queries. Data is bound to a dynamic schema created upon query execution. The late-binding principle shifts the data modeling from centralized data warehousing teams and database administrators, who are often remote from data sources, to localized teams of business analysts and data scientists, who can help create flexible, domain-specific context. For those accustomed to SQL, this shift opens a whole new world.

In this approach, the more is known about the metadata, the easier it is to query. Pre-tagged data, such as Extensible Markup Language (XML), JavaScript Object Notation (JSON), or Resource Description Framework (RDF), offers a starting point and is highly useful in implementations with limited data variety. In most cases, however, pre-tagged data is a small portion of incoming data formats.

Lessons Learned

Some data lake initiatives have not succeeded, producing instead more silos or empty sandboxes. Given the risk, everyone is proceeding cautiously. There are companies who create big data graveyards, dumping everything into them and hoping to do something with it down the road.

Companies would avoid creating big data graveyards by developing and executing a solid strategic plan that applies the right technology and methods to the problem. Hadoop and the NoSQL (Not only SQL) category of databases have potential, especially when they can enable a single enterprise-wide repository and provide access to data previously trapped in silos. The main challenge is not creating a data lake, but taking advantage of the opportunities it presents. A means of creating, enriching, and managing semantic metadata incrementally is essential.

Data Flow in the Data Lake

The data lake loads extracts, irrespective of its format, into a big data store. Metadata is decoupled from its underlying data and stored independently. This enables flexibility for multiple end-user perspectives and maturing semantics.

How a Data Lake Matures

Sourcing new data into the lake can occur gradually and will not impact existing models. The lake starts with raw data, and it matures as more data flows in, as users and machines build up metadata, and as user adoption broadens. Ambiguous and competing terms eventually converge into a shared understanding (that is, semantics) within and across business domains. Data maturity results as a natural outgrowth of the ongoing user interaction and feedback at the metadata management layer, interaction that continually refines the lake and enhances discovery.

With the data lake, users can take what is relevant and leave the rest. Individual business domains can mature independently and gradually. Perfect data classification is not required. Users throughout the enterprise can see across all disciplines, not limited by organizational silos or rigid schema.

Data Lake Maturity

The data lake foundation includes a big data repository, metadata management, and an application framework to capture and contextualize end-user feedback. The increasing value of analytics is then directly correlated in increase in user adoption across the enterprise.

Risks

Data lakes therefore carry risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.

Another risk is security and access control. Data can be placed into the data lake with no oversight of the contents. Many data lakes are being used for data whose privacy and regulatory requirements are likely to represent risk exposure. The security capabilities of central data lake technologies are still in the beginning stage.

Finally, performance aspects should not be overlooked. Tools and data interfaces simply cannot perform at the same level against a general-purpose store as they can against optimized and purpose-built infrastructure.

Careful planning and organization of data lake strategy is required to make this project a success.

Saturday, October 24, 2015

Humanizing Big Data with Alteryx

In my last post, I described Teradata Unified Data Architecture™ product for big data. In today's post, I will describe Teradata partner Alteryx which provides innovative technology that can help you to get the maximum business value from your analytics using the Teradata Unified Data Architecture.™

Companies can extract the highest value from big data by combining all relevant data sources in their analysis. Alteryx makes it easy to create workflows that combine and blend data from relevant sources, bringing new and ad hoc sources of data into the Teradata Unified Data Architecture™ for rapid analysis. Analysts can collect data within this environment using connectors and SQL-H interfaces for optimal processing.

Create Business Analytics in an Easy-to-Use Workflow Environment

Using the design canvas and step-by-step, workflow-based environment of Alteryx, you can create analytics and analytic applications. With a single click, you can put those applications and answers to critical business questions in the hands of those who need them most. And when business conditions and underlying data change, Alteryx helps you iterate your analytic applications quickly and easily, without waiting for an IT organization or expensive statistical specialists.

Base Your Decisions on the Foresight of Accessible Predictive Analytics

Alteryx helps you make critical business decisions based on forward-looking, predictive analytics rather than past performance or simple guesswork. By embedding predictive analytics tools based on the R open source statistical language or any of the in-database analytic capabilities, Alteryx makes powerful statistical techniques accessible to everyone in your organization through a simple drag-and-drop interface.

Understand Where and Why Things Happen: Location Matters

Whether you are building a hyper-local marketing and merchandizing strategy or trying to understand the value of social media investments, location matters. Traditionally, this type of insight has been in the hands of a few geo specialists focused on mapping and trade areas. With Alteryx, you can put location specific intelligence in the hands of every decision maker.

With the rise of location-enabled devices such as smart phones and tablets, consumer and business interactions increasingly include a location data-point. This makes spatial analysis more critical than ever before. Alteryx provides powerful geospatial and location intelligence tools as part of any analytic workflow. You can visualize where events are taking place and make location-specific decisions.

Alteryx can push custom spatial queries into the Teradata Database to leverage its processing power and eliminate data movement. You can enrich your spatial data within the Teradata system using any or all of these functions provided by Alteryx:
  • geocoding of data;
  • drive-time analytics;
  • trade area creation;
  • spatial and demographic analysis;
  • spatial and predictive analysis;
  • mapping.
Alteryx simplifies the previously complex tasks of predictive and spatial analytics, so every employee in your organization can make critical business decisions based on real, verifiable facts.

Deliver the Right Data for the Right Question To The Right Person

To answer today’s complex business questions, you need to access your sources of insight in a single environment.That is why Alteryx allows you to bring together data from virtually any data source, whether structured, unstructured or cloud data, into an analytic application. Using Alteryx, you can extend the reach of business insight by publishing applications that let your business users run in-database analytics and get fast answers to their pressing business questions.

Teradata and Alteryx: Powerful Insights for Business Users

To exploit the opportunities of all their data, organizations need flexible data architectures as well as sophisticated analytic tools. Analysts need to rapidly gather, make sense of and derive insights from all the relevant data to make faster, more accurate strategic decisions. But given the variety of potential data sources, it is difficult for any single tool to be most effective at capturing, storing and exploring data. Using the Teradata Unified Data Architecture™ with Alteryx enables you to explore data from multiple sources, as well as the ability to deploy the insights derived from the data.

You can create sophisticated analytics, taking advantage of new, multi-structured data sources to deliver the most ROI. The combined solution:
  • integrates and addresses both structured and emerging multi-structured data Leverages the Teradata Integrated Data Warehouse, Teradata Aster Discovery platform and Hadoop to optimal advantage;
  • creates both in-database and cross-platform analytics quickly without requiring specialized SQL, MapReduce or R programming skills;
  • lets you combine the capabilities of the Alteryx environment with routines developed in other analytical tools within a single analytical workflow;
  • easily deploys analytics to the appropriate users beyond the analyst community.
The combined solution of Alteryx and the Teradata Unified Data Architecture™ provides an IT-friendly environment that supports the need to analyze data found inside and outside the data warehouse. Analysts and business users can leverage powerful engines to create and execute integrated applications. This kind of analysis is only possible with an environment that can bring together routines created by separate tools and running on different platforms.

Enhancing the Teradata Unified Data Architecture™ with the speed and agility of Alteryx creates a powerful environment for traditional and self-service analytics using integrated data and massively parallel processing platforms. It delivers:
  • a complete solution for the full life-cycle of strategic and big data analytics, from transforming, enriching and loading data to designing analytic workflows and putting easy-to-use analytic applications in the hands of business users;
  • improved ability to manage and extract value from structured and multi-structured data;
  • ability for business analysts to create data labs and perform predictive and spatial analytics on the Teradata data warehouse and Teradata Aster discovery platforms;
  • faster analytical processing within applications using in-database analytics in Teradata and SQL MapReduce functions in Teradata Aster.
The Alteryx solution helps customers with the Teradata Unified Data Architecture™ achieve these benefits by providing the following:
  • robust set of analytical functions;
  • access to a rich catalog of horizontal and industry-specific analytic applications in the
  • Alteryx Analytics Gallery;
  • syndicated household, demographic, firmographic, map and Census data to enrich existing sources;
  • native data integration and in-database analytical support for Teradata data warehouse and Teradata Aster capabilities;
  • ability to leverage Teradata SQL-H™ for accessing Hadoop data from Aster or Teradata Database platforms.
Use Case: Predicting and Preventing Customer Leave

Problem

A global communication service provider is interested in preventing customers leaving by identifying at-risk customers and providing special offers that reduce the likelihood of leaving in a profitable way. To do this, they need predictive analytics.

Solution

Teradata and Alteryx deliver an end-to-end analytic workflow process from data consumption and analysis to application deployment. Alteryx integrates and loads call detail records from diverse sources along with customer data from the Teradata warehouse into the Aster database to create a complete, rich data set for iterative analysis. You can run iterative discovery analysis to determine the key indicators behind customer leaving and loyalty. These key indicators are captured as repeatable applications to enrich the data warehouse with leaving and loyalty scores. In addition, the discovery analysis is captured and deployed to the business users as a parameterized application for further iterative analysis.

Key Solution Components
  • Aster Discovery platform for deep analytics and segmentation;
  • Teradata data warehouse to operate and deploy insights and enriched data across the enterprise;
  • Alteryx for the user workflow engine to orchestrate data blending and analytics.
Benefits
  • Ability to identify key customers that are likely move candidates;
  • determine problem spots on the network (cell sites, network elements) that are driving move;
  • discover other key reasons for move (performance, competitive offers);
  • discover which offers have prevented churn for similar customers in the past;
  • identify which offers will work and evaluate a least-cost offer to prevent move;
  • ability to make offers to keep customers from leaving;
  • deeper understanding of customer behavior.

Monday, October 12, 2015

Teradata - Analytics for Big Data

Successful companies know that analytics is the key to winning customer loyalty, optimizing business processes and beating their competitors. 

By integrating data from multiple parts of the organization to enable cross-functional analysis and a 360-degree view of the customer, businesses can make the best possible decisions. With more data and more sophisticated analytics, you can realize even greater business value.

Today businesses can tap new sources of data for business analytics, including web, social, audio/video, text, sensor data and machine-generated data. But with these new opportunities come new challenges.

For example, structured data (from databases) fits easily into a relational database model with SQL-based analytics. Other semi-structured or unstructured data may require non-SQL analytics, which are difficult for business users and analysts who require SQL access and
iterative analytics.

Another challenge is identifying the nuggets of valuable data from among and between multiple data sources. Analysts need to run iterations of analysis quickly against differing data sets, using familiar tools and languages. Data discovery can be especially challenging if data is stored on multiple systems employing different technologies.

Finally, there is the challenge of simply handling all the data. New data sources often generate data at extremely high frequencies and volumes. Organizations need to capture, refine and store the data long enough to determine which data to keep, all at an affordable price.

To exploit the competitive opportunities buried in data from diverse sources, you need a strong analytic foundation capable of handling large volumes of data efficiently. Specifically, you need to address the following three capabilities:

Data Warehousing - integrated and shared data environments for managing the business and delivering strategic and operational analytics to the extended organization.

Data Discovery - discovery analytics to rapidly explore and unlock insights from big data using a variety of analytic techniques accessible to mainstream business analysts.

Data Staging - a platform for loading, storing and refining data in preparation for analytics.

Teradata Unified Data Architecture™ product includes a Teradata data warehouse platform and the Teradata Aster discovery platform for analytics, as well as open-source Apache Hadoop for data management and storage as needed.

Data Warehousing

The Teradata Active Enterprise Data Warehouse is the foundation of the integrated data warehouse solution. This appliance works well for smaller data warehouses or application-specific data marts.

Data Discovery

For data discovery, the Teradata platform uses patented SQL-MapReduce® on the Aster Big Analytics Appliance, providing pre-packaged analytics and applications for data-driven discovery. Mainstream business users can easily access this insight using familiar SQL-based interfaces and leading business intelligence (BI) tools. If you are performing discovery on structured data, a partitioned data lab in the data warehouse is the recommended solution.

Data Staging

Hadoop is an effective, low-cost technology for loading, storing and refining data within the unified architecture. However, Hadoop is not designed as an analytic platform.

The Teradata Data Warehouse Appliance and the Teradata Extreme Data Appliance offer cost-effective storage and analytics for structured data. The Teradata Unified Data Architecture™ integrates these components into a cohesive, integrated data platform that delivers the following capabilities:
  • unified management of both structured and unstructured data at optimal cost;
  • powerful analytics spanning SQL and MapReduce analytics;
  • seamless integration with the existing data warehouse environment and user skillset.
The Teradata Unified Data Architecture™ handles all types of data and diverse analytics for both business and technical users while providing an engineered, integrated and fully supported solution.

Monday, February 9, 2015

Using Big Data Efficiently in 2015

Will 2015 be the year that your enterprise be able to finally harness all of that customer data that they have compiled over the years? Will there be ways to organize and use this information to impact the bottom line? Indeed, this data has become a form of capital for enterprises. So what will change in 2015?

Big Data Brands to Watch

Here are the areas to watch: secure storage and backup with encryption, reliable data management and data visualization (DV) are key ingredients as far as next generation big data software is concerned. 

As far as vendors are concerned, there are several players in the space including Twitter-owned Lucky Sort, Tableau, Advanced Visual Systems, JasperSoft, Pentaho, Infogram Tibco, 1010 Data, Salesforce, IBM, SAP, Hewlett-Packard, SAS, Oracle, Dell and Cisco Systems. These are a mix of independent and majors, but all have solid reputations in the industry. Choosing which one depends on numerous factors like budget, IT systems already in place, preference, reaquirements, etc.

The Coming Influx of Big Data

Big data must be useful and many professionals within all sorts of organizations are actively seeking out ways to use the data they have collected, rather than just consuming it.

Is your organization prepared for the influx of new users and devices that will flood the Internet and electronic communications, encapsulating customer interactions more than ever before? Many enterprises could be unprepared for the massive wave of data coming as billions of devices join the Internet. More devices, not just smartphones and computers, will be connected, bringing more data into organizations' servers. 

Gartner reportedly estimated the Internet of Things, or IoT, market at 26 billion devices by 2020 and Cisco thinks it will add $14 trillion in economic value by 2020. These devices include everything from household and office electronics and appliances to industrial manufacturing equipment. IoT will increase big data exponentially. It will hit pretty much every industry in a big way, but planning and preparing for the road ahead can ensure at least some adaptability for 2015 and beyond.

How to Deal With Data

Assess your organization's needs thoroughly including a checklist of IT systems in place and what needs or opportunities exist there. IT management will likely find a multitude of ways to incorporate new systems or upgrades through the right software options. Try to find robust, dynamic systems that are tailored to the way information is used or may be used within and outside the organization. Also, explore ways to improve customer relationships through the targeting and taxonomy of their data. Big data will be a more useful asset in 2015.

After you have taken the time to make an assessment of need and checklist for problem areas, you have to implement changes so that you can make the most of your information. You want to absolutely make sure that the data that you have collected from customers, suppliers, personnel and others is accessible, useful and organized. 

For example, a good search software that can access thousands of records and display results based on varying factors is a great way to handle the problem of search. Great search software is sometimes already a part of your organization’s CMS or other software for handling data and is just may not be fully utilized to make search more useful or easier.

Using Enterprise Search

Enterprise search applications vary by brand, but you may recognize a few of the names immediately from the larger tech firms such as IBM, HP and Microsoft. There are also open source options. Other vendors include Oracle, LucidWorks, Lexmark Perceptive Software, Sinequa and others. Sharepoint is probably one of the most popular options, which is also a tool for collaboration available from Microsoft. Google and IBM are also top companies in search technology. Many systems support multiple languages too.

HP is a great example of useful search for enterprise. Their flagship system is appropriately named Autonomy. Autonomy can index, or “crawl” (a search software term) millions of records including various types of content like documents, audio, video and social media. Employees and customers have come to expect a great system of search within their companies as expectations for technology have continued to climb higher due to a surge in search application use (such as Google searches on the web).

There are some important facets to search applications that should be noted. The HP Autonomy system, for instance, is capable of searching based on concept and context. This is becoming much more important in the era of big data. Searching through such large volumes of data requires some scrutiny to access the right information assets. Enterprise search applications can help with this obstacle.

Start with Little Data

It has been suggested that to deal with big data, you must first deal with little data. We are talking about metadata of course. Metadata are bits of information that can offer insight to content, helping to optimize search. Essentially it is information about information. Metadata can provide that context and concept information we referenced earlier in search applications.

Working with metadata can help with the overall process of keeping data organized and easy to access. The smaller pieces of information come together to become big information sets. Your team must start there to adequately solve this information overload problem.

Start by analyzing the exact needs or perceived functionality of the information. Taxonomy and terminology can be critical. Defining terms and putting them into contextual and conceptual order will help to provide a road map to access and utilize all of your team’s critical business data. This way, your data will actually become more valuable, too. Your information assets need to be managed in order to fully take advantage of them.

Some Big Data Tips

Here are some general tips to help with organizing and using your data assets:

  • Perform usability testing of your organization’s tools for data management.
  • Develop compliance and governance model for handling information.
  • Develop master data management (MDM) plan to reinforce and promote compliance.
  • Assess taxonomy and develop a controlled vocabulary to keep data structured.
  • Compress files (such as PDF documents) when necessary to save on storage cost.

Tuesday, July 22, 2014

Hadoop and Big Data

During last ten years the volume and diversity of digital information grew at unprecedented rates. Amount of information is doubling every 18 months, and unstructured information volumes grow six times faster than structured.

Big data is the nowadays trend. Big data has been defined as data sets whose size is beyond the ability of commonly used software tools to capture, manage, and process the data within a tolerable elapsed time.

Hadoop was created in 2005 by Doug Cutting and Mike Cafarella to address the big data issue. Doug Cutting named it after his son's toy elephant It was originally developed for the Nutch search engine project. Nutch is an effort to build an open source web search engine based on Lucene and Java for the search and index component.

As of 2013, Hadoop adoption is widespread. A number of companies offer commercial implementations or support for Hadoop. For example, more than half of the Fortune 50 use Hadoop. Hadoop is designed to process terabytes and even petabytes of unstructured and structured data. It breaks large workloads into smaller data blocks that are distributed across a cluster of commodity hardware for faster processing.

Ventana Research, a benchmark research and advisory services firm published the results of its groundbreaking survey on enterprise adoption of Hadoop to manage big data. According to this survey:
  • More than one-half (54%) of organizations surveyed are using or considering Hadoop for large-scale data processing needs.
  • More than twice as many Hadoop users report being able to create new products and services and enjoy costs savings beyond those using other platforms; over 82% benefit from faster analysis and better utilization of computing resources.
  • 87% of Hadoop users are performing or planning new types of analysis with large scale data.
  • 94% of Hadoop users perform analytics on large volumes of data not possible before; 88% analyze data in greater detail; while 82% can now retain more of their data.
  • 63% of organizations use Hadoop in particular to work with unstructured data such as logs and event data.
  • More than two-thirds of Hadoop users perform advanced analysis such as data mining or algorithm development and testing.
Today, Hadoop is being used as a:
  • Staging layer: the most common use of Hadoop in enterprise environments is as “Hadoop ETL” — pre-processing, filtering, and transforming vast quantities of semi-structured and unstructured data for loading into a data warehouse.
  • Event analytics layer: large-scale log processing of event data: call records, behavioral analysis, social network analysis, clickstream data, etc.
  • Content analytics layer: next-best action, customer experience optimization, social media analytics. MapReduce provides the abstraction layer for integrating content analytics with more traditional forms of advanced analysis.
  • Most existing vendors in the data warehousing space have announced integrations between their products and Hadoop/MapReduce.
Hadoop is particularly useful when:
  • Complex information processing is needed.
  • Unstructured data needs to be turned into structured data.
  • Queries can’t be reasonably expressed using SQL.
  • Heavily recursive algorithms.
  • Complex but parallelizable algorithms needed, such as geo-spatial analysis or genome sequencing.
  • Machine learning.
  • Data sets are too large to fit into database RAM, discs, or require too many cores (10’s of TB up to PB).
  • Data value does not justify expense of constant real-time availability, such as archives or special interest information, which can be moved to Hadoop and remain available at lower cost.
  • Results are not needed in real time.
  • Fault tolerance is critical.
  • Significant custom coding would be required to handle job scheduling.
Does Hadoop and Big Data Solve All Our Data Problems?

Hadoop provides a new, complementary approach to traditional data warehousing that helps deliver on some of the most difficult challenges of enterprise data warehouses. Of course, it’s not a panacea, but by making it easier to gather and analyze data, it may help move the spotlight away from the technology towards the more important limitations on today’s business intelligence efforts: information culture and the limited ability of many people to actually use information to make the right decisions.