Big data. Predictive analytics. Machine learning. We've all heard these technology buzzwords tossed around, but what do they really mean?
To think clearly about the increasing role of data in business, organizations must navigate a new world of tools and terminology.
Here is a roundup of some of the more common concepts used to discuss big data, and some clarifications to aid our understanding of the transformations underway.
Big data is part of a family of tech buzzwords. It refers to vast digital output, generated by human and machine activity, or to its use in business when processed and made accessible. The first documented use of big data as a term is in a 1997 paper by NASA scientists.
While no universal or standard definition exists, key aspects of big data are known as the "Three Vs" -- its volume, velocity and variety.
Quantifying big data volume requires numbers so enormous as to boggle the mind. "As of 2012... it is estimated that Walmart collects more than 2.5 petabytes of data every hour from its customer transactions," say Andrew McAfee and Erik Brynjolfsson in the Harvard Business Review. "A petabyte is one quadrillion bytes, or the equivalent of about 20 million filing cabinets' worth of text."
Numbers like these add perspective to the Wikipedia definition of big data: "a term for data sets so large or complex that traditional data processing applications are in adequate to deal with them."
Velocity refers to the speed of data creation. How much data is flowing, and how fast does it flow? Facebook users upload "more than 900 million photos every day," says Data Center Frontier. Keeping up with real-time demands to quickly process high volumes of data is a characteristic challenge of big data.
Variety refers to the format of the data structure. Structured data fits easily into a relational database or spreadsheet. Extracting insights is relatively easy for existing tools such as search queries, algorithms or simple operations.
Unstructured data describes most of the new data to be tapped. Their parts don't map neatly to fields in a spreadsheet or database, and they vary as to format. Human-generated email messages are all different in terms of content and attachments. Photos, videos, sound recordings, and much sensor or machine output is unstructured and varied.
In brief, big data is a collection of data offering advantageous insights. But it is so vast, fast and varied, it requires new and advanced methods to extract value.
Small data emerged on the heels of the term big data. It refers to "everything Big Data is not," explains Fred Shilmover of Datanami news. It’s the information that most small and medium businesses are generating and collecting themselves.
Small data differs from big data in that it may be created through human data entry. For example, it exists in enterprise resource planning (ERP) systems, or a client management (CRM) database. A Google Analytics table describing web traffic to a small business website is small data.
Small data is often used to describe the comparatively small datasets collected by individual smart devices, such as wearables, among the growing Internet of Things (IoT).
Its smaller volume should in no way diminish the business value of small data. In fact, the benefits to business are too often overlooked. "According to Forrester Research, most companies are analyzing a mere 12% of their existing data," says Shilmover. "That leaves a whopping 88% of data that businesses are flat out ignoring. Can you imagine the potential of actually leveraging that existing data to derive data-driven business insights?" Low-hanging business advantages may be realized just by "picking up the dollars that are effectively lying on the floor" in the form of small data.
Fast data is like a fire hose to a water supply. Volume isn't measured in warehouse dimensions of terabytes or petabytes. "We're measuring volume in terms of time: the number of megabytes per second, gigabytes per hour, or terabytes per day," writes John Hugg in InfoWorld.
Examples include data pulled from clickstreams, financial tickers, and sensors. The rise of IoT is a driving factor in the rise of fast data.
The value is often in its freshness. Leveraging fast data means having the ability to act on the information in real time, in the context of what is happening. Big data's velocity affords three key advantages, says Randy Bean in The Wall Street Journal. Fast data can power applications that are:
Personal - tailored to individual user preferences, rather than aggregates or averages
Contextual - 'smart' about what the user has already done and seen, and where he or she is located
Interactive - quickly responsive to real-time actions
"Dirty data refers to data that contains erroneous information," explains Techopedia. Dirty data can be generated from a number of possible methods or causes such as:
- Acquisition of outdated data
- Accidental duplications
- Intentional falsification
It is not usually practical to completely remove inaccuracies, errors or quality issues such as incomplete or duplicate data from a source, and present a key management challenge.
Though the terms are sometimes used interchangeably, big data and data analytics are different concepts.
Data analytics refers to the process of transforming raw information into something a business can use.
Three main types of analytics, according to Dataconomy, are:
Descriptive analytics summarizes big numbers into condensed pieces of information. Their value lies in showing what happened. Social analytics, such as views, clicks, likes and shares are examples of descriptive analytics.
Historical data can suggest what is most likely to happen in the future. Predictive analytics is not foolproof by any means. But with data mining, statistical analysis, algorithms and other approaches, systems can recommend or predict the likelihood of an event.
Amazon.com, for example, patented a system for shipping products not yet ordered, but sent to a general geographic area in anticipation of the actual order. The delivery address is supplied en route. The goal is to save time and money while delivering on two-day shipping for its customers.
Prescriptive analytics combines predictive information with hypothetical courses of action. It uses simulation and optimization algorithms to help target desired outcomes. The ultimate goal of prescriptive analytics is to point out ways of changing a company's future for the better. It answers the question "What should we do?" says analytics service provider Halo.
Artificial intelligence (AI) is a branch of computer science. A textbook definition, says Kris Hammond in ComputerWorld, "is to enable the development of computers that are able to do things normally done by people -- in particular, things associated with people acting intelligently."
AI technologies help businesses tackle the huge job of analyzing big data faster and in greater amounts than people could ever do.
AI systems vary by how strongly they simulate human reasoning. They also vary according to specificity of design and intended use. For example, an AI system that can turn your past behavior into recommendations for you, will differ from a system that "can learn to recognize images from examples," explains Hammond.
For example, facial recognition is one popular application of artificial intelligence.
Machine learning is one approach to building artificial intelligence systems. In machine learning, a computer doesn't need to be "taught" to analyze specific data using only proven models. Instead, model building is automated. Each new iteration builds on the last.
The basic goal of machine learning is to create a framework in which a computer can teach itself how to solve a complex problem.
Google is on the front lines of machine-learning for business, as are other search-driven services. "Pinterest uses machine learning to show you more interesting content. Yelp uses machine learning to sort through user-uploaded photos. NextDoor uses machine learning to sort through content on their message boards. Disqus uses machine learning to weed out spammy comments," reports Lucas Biewald in TechCrunch.
User-generated content (USG) will likely join other forms of web content to be combed and ranked by machine learning. On average, most UGC is "awful," Biewald claims. "It's actually way worse than you think.... But by identifying the best and worse, UGC, machine-learning models can filter out the bad and bubble up the good" without direct human help.
The Future of Decision Making
Many businesses claim to be "data driven," while others specify that they are "data informed." Is there a difference?
Data Driven vs. Data Informed
A truly data-driven activity is one where decisions about strategy are based on insights from data, rather than being primarily driven by intuition. It may also be called evidence-based decision making (per Wikipedia)
A data-informed approach is about finding the balance between objective facts and subjective viewpoints and intuition.
One way companies succeed in realizing more value from data is by creating a data-driven culture. While it can take time to make this shift, companies can start simply, by considering the saying, "you can't manage what you don't measure."
The rising importance of data brings with it the need to leverage it to find wisdom in the numbers. The goal is business advantage, and it demands clear thinking about new tools and terminologies to manage and use data in decision-making.