Real data, real problems: Pre-processing for Big Data analytics

In data-driven marketing, we talk a lot about clustering. Using specific parameters and algorithms, data scientists work to group customers into unique clusters that share certain attributes. Once these models are created, they can be tested, refined, and applied to marketing and product development.

But just like real life, real-life data has real problems. Before data scientists can begin cluster analysis, they need to know where their data came from and whether it can even be used. This is why we spend so much time and effort gathering the right data and pre-processing it for use. Pre-processing is critical to data science because quality decisions come from quality data—hence, the computer scientist’s ‘garbage in, garbage out’ manifesto. Here are some of the biggest problems with real-life data to watch for.

Data integration

It’s not uncommon that data scientists must integrate data from different sources in order to conduct Big Data analytics. Data coming from multiple places within your enterprise, as well as external data, need to share a common format.

Think of how differing data sources can introduce trivial, but serious, inconsistencies, such as:

  • Variations in the number of characters allocated for customer names.
  • Whether ZIP codes use the same five-digit format.
  • Calling the same attribute (e.g., year of birth) by different names.
  • Using different currency units (e.g., in dollars versus yen).
  • Using different scales (e.g., sales in dollars versus sales in millions of dollars).
  • Using derived attributes (e.g., one database uses aggregated annual salary, while the the other includes only monthly salary).
  • Redundant customer information (e.g., a customer with customer ID 150 has three children in one database, but four children in another).

Compromised values

Real-life data isn’t perfect; values are often missing or inconsistent. For example, for a given customer, you may not know the number of children or the number of cars in her household. Or, the same customer appears in two different databases with different addresses, or with different numbers of children. It’s critical to reconcile these differences, especially if the missing or noisy values are crucial for your data modeling effort.


Data can be collected together, or aggregated, at different levels. For example, in transactional data, every individual item in your shopping cart could be collected as separate, line item orders. But what if you want to do analysis at the total order level, or at the customer level? In that case, you’d need to take all of the orders of each customer, aggregate them up to the customer level, and then attach this number to the customer’s information to do your analysis.