Big Data Reference Model

A project that approaches Big Data as a purely technical challenge will not deliver results. It is about more than just massive Hadoop clusters and number-crunching. In order to deliver value, a Big Data project has to enable change and adaptation. This requires that there are known problems to be solved. Yet, identifying the problem can be the hardest part. It's often the case that you have to collect some information to even discover what problem to solve. Deciding how to solve that problem creates a need for more information and analysis. This is an empirical discovery loop similar to that found in any research project or Six Sigma initiative.

Diagram of the Big Data Reference Model

Handling the data itself is a technical challenge, and it can be a big one. Still, those other aspects of empirical learning, human decision-making and problem identification must also be addressed.

This model depicts a schematic form of the different aspects to consider, including both human and technological components. It helps discussions by guiding the team to consider the problem space first: the business context, needs, and decision cycles. Then it moves into the solution space: collection, analysis, visualization, and feedback mechanisms. Using this model, projects and project teams are encouraged to pause before implementation to make sure they understand the problem to be solved. They may find that the problem has yet to be identified.

The model is divided into three areas with mutual dependencies and feedback loops. The following sections will discuss each of these areas.


This is where efforts begin, especially with the Problem Identification step. "Problem Identification" is always in terms of the business and its situation. A particularly well-formed problem might be "We must raise conversion rates by 0.5% with our top customer segment in consumer electronics this year." Notice that this statement has several dimensions: "what?" (conversion rate), "how much?" (0.5%), "where?" (top customer segment, consumer electronics), and "when?" (this year).

Most problems will not be so well defined. It is very common to see problem identification stall on the need for more information. In our example, "top customer segment" might not actually be known. In that case, one problem identification exercise spawns a subsidiary problem identification, which could then spawn another in a kind of fractal process.

Once a problem is identified, there is an accompanying hypothesis. In the case of our well-formed example, the explicit hypothesis might be that we can effect change in the conversion rate via pricing and promotions, presentation changes, or adjustments to the product assortment. In the subordinate problem, there is an implicit hypothesis that says "our customers can be segmented in meaningful ways, and some segments are more desirable than others."

Once a set of hypotheses are defined, three things must happen. First, testing the hypotheses almost always requires additional information. That may be in the form of assigning customers to randomized trial groups, then attaching their group identification to all their downstream activity. Or, it may be a need to tap into a different data source that hasn't been used before (internal or external.) Virtually all of these information needs will require some changes to existing applications, so identifying the needs before leaping into implementation is important.

In any of these cases, it is tempting to think that we could build a complete panopticon: a universal data warehouse with everything in the company. This is an expensive endeavor, and not a historically successful path. Whether structured or unstructured, any data store is suited to answer some questions but not others. No matter how much you invest in building the panopticon, there will be dimensions you don't think to support. It is better to skip the massive up-front time and expense, focusing instead on making it very fast and easy to add new data sources or new elements to existing sources.

The second thing that happens after describing the hypotheses is creating a way to effect change. For example, if the hypothesis says "we can increase our conversion rate as needed by broadening our product assortment," then there must be some way to add the additional products into the system. This is likely to be well-supported, since it resembles an ordinary business function!

Most often, testing the hypothesis will require some code or data changes in the applications. (Even our expanded assortment hypothesis might require adding an attribute to the applications' databases, to track products and orders that are part of the expanded assortment.) These changes will include ways to split or group entities, apply parameters to the system, or apply differential parameters to test groups. We'll refer to all of these operations as "parameterization" in the general sense: adjusting the behavior of the systems in the space of all possible behaviors.

At this stage, we expect that parameterization is based on ad-hoc analysis, which is the third thing to follow the hypotheses. Because this is still within the empirical discovery loop, we don't want to invest in fully automated machine learning and feedback. That will follow once we validate a hypothesis and want to integrate it into our routine operation.

Ad-hoc analysis refers to human-based data exploration. This can be as simple as spreadsheets with line graphs or as sophisticated as a room full of quants running K programs at light speed. The key aspect is that most of the tools are interactive. Questions are expressed as code, but that code is usually just "one shot" and is not meant for production operations. The main goal of ad-hoc analysis is to discover and validate, not to optimize. Ad-hoc analysis operates on the same kind of large-scale data store that the production tools will operate on. The goal of this activity is to validate the hypotheses and help discover the next round of problems to solve.

Ad-hoc analysis continues even after one set of solutions is operating in production. They should always be working a step ahead of the production environment, looking for the next set of problems or opportunities.

The final activity in Investigation usually happens much later. After one set of hypotheses are validated, the solution should be implemented and operationalized. At that point, the humans' attention will shift to the next set of problems, but they should also keep an eye on results of all the previous efforts. This is necessary because the competitive environment shifts constantly and your big solution today may be neutralized by your competitors tomorrow. In addition, the initial experiments may just show transient effects due to novelty or even the Hawthorne effect. So humans need to have good information—presented in human-digestible ways—to make good decisions. These decisions will feed back into the applications in a way similar to the Needs Identification activity. The main difference is that routine decisions probably use knobs and levers that already exist rather than creating new ones.

Now let's turn our attention to the implementation area.


Implementation deals with the technology: applications, data stores, and integration among them. The goal of implementation is to harvest and collect raw data for downstream use.

Implementation covers all sources of data. That includes applications producing business-level metrics and events, but it also includes system and network metrics. (For example, there's a surprising amount of knowledge to be gleaned from web logs.) Implementation increasingly includes external data sources such as advertising and social networks.

The implementation area also covers ongoing changes to applications in order to introduce new parameters or ways to alter the parameters. For example, if Needs Identification says that arriving customers must be split into a control group and a test group, then something in the application or infrastructure needs to support parameters like: "number of groups", "percentage of arrivals in each group", "stickiness of group assignment", and so on.

It is also usually necessary to propagate such parameters along to downstream systems, which may require other incremental changes to the applications or databases.

Raw data is where everything lands. The raw data activity includes any extraction tools needed in order to induct data from the sources. "Raw data" may actually include a processing stage or two if incoming sources need to be joined or split.

A word of caution about data cleansing: it is tempting to canonicalize data here, removing unmatched or otherwise "not sane" records. That is usually a bad idea. Records that get deleted from raw data become "dark matter" than cannot be seen, reported on, or even compared to the amount of "visible matter" later. It is better to handle this with projections.

Technical teams have a tendency to jump into the implementation area first. That's not necessarily wrong, but it may be incomplete. Starting with implementation can be useful when it helps discover and identify problems. In that case, the process would begin with collection and raw data, follow the path through ad-hoc analysis and into problem identification.


Production deals with making routine operations out of useful information.

Once a hypothesis is validated, we would like to integrate the new thing we've learned into every day activity. This is a job of automation and feedback. Like the other areas, "Production" covers several distinct activities.

"Automated Information Extraction" refers to processing of the raw data to find interesting information contained within the data itself. This step usually includes a wide variety of tools and languages. This polyglot can seem chaotic---and it can be difficult to support---but the goal is not to produce a cohesive software environment. Rather, it is to express the fundamental algorithms in the most expressive and performant way possible, given limited time for implementation and a limited lifespan of the algorithm itself. Extraction can include aggregation, grouping, joining, or splitting data. These programs should be small and loosely coupled. These programs should be well-engineered (in contrast to the ad-hoc analysis code). They must be resilient to messy input, changing data volumes, and partial system failures.

The output of automated information extraction is more data, in a condensed format. This digested data could be used directly by applications, but more frequently goes through an augmentation step first.

Augmentation refers to processing that adds new information to the data. This is where we find the statistical and machine learning algorithms. The specific kinds of processing to apply here were defined by the earlier ad-hoc analysis and validation. It would be rare, however, to see exactly the same code running in production operations as on the analysts workstation. The augmentation code will be a new implementation of the same algorithm.

As we discussed in the "Investigation" section, even once a feedback loop is automated in production, humans still need to see the outcomes. Therefore, presentation and visualization is part of ongoing operations. Presenting quantitative information is a specialization of design activities, so this will normally be done by people with command of both graphical design and statistics. They must also understand enough about the psychology of numbers to avoid triggering universal human cognitive biases and failures. Bad visualization provokes bad decisions, just as much as bad data does.

Finally, once the raw data has been digested and augmented, we reach "Optimization and Parameterization". In this activity, we take the validated hypothesis and implement it in software. This software has the power to change the parameters in applications in order to reach the optimum value of some tradeoff. For example, an optimization loop may continually adjust discounting to optimize the tradeoff between revenue and margin. Or, in a different example, it may optimize the tradeoff between new customer registrations and cost per acquisition.

Feedback from optimization back to the applications closes the decision loop and enables automated learning within the system, without requiring humans in the loop. As such, it can go as fast as the total data latency around the lower loop in the model. Batch processes will always have higher latency (so slower optimization) than real-time or streaming processes.

A few words of caution are in order about optimization and parameterization. This activity requires the highest level of organizational maturity about its data and processes. Not every company reaches this activity, or does it for every problem/hypothesis. This is the way to achieve the fastest response time to external events, which means that both great benefits and great harm can be achieved with astonishing speed. Automated optimization has been blamed for stock market crashes (the "Flash Crash") and patently ridiculous pricing events (the $23M textbook). These programs need to be designed very carefully, with governors and automatic limits baked in. They should optimize a trade-off equation rather than just maximizing some metric, or you risk damaging the health of your whole business in mechanistic pursuit of whichever metric has the fastest agent.


Using this model to guide Big Data projects helps focus efforts on the most productive areas. It puts the first things first, by starting with problem and needs identification. The model incorporates both human and technological activities so we create a fully integrated set of nested decision loops at different time scales.