The 411 on Data Cleaning, Modeling, and Governance for Marketers

You can access a wealth of marketing-related data — from web analytics and customer journey behavior to competitor analysis and product usage.

However, if the data isn’t clean, you can’t truly tap into its value. Or worse, you could steer your marketing in the wrong direction and see diminishing returns.

James Hunt, principal consultant at Vivanti, says data cleaning and modeling are essential to extract value and gain knowledge and wisdom from the information. In his presentation at the Marketing Analytics & Data Science Conference, he details why it’s necessary, the basics of data cleaning, and the role of governance and observability.

What is data modeling?

Data models turn data into something useful, and you need to understand data modeling so you can understand the best cleaning options. James explains that data modeling involves three parts — additive, context, and domain.

Additive means you let the machines figure out how to standardize the data. You don’t manually “fix” the data, such as lowercasing the sporadic all-cap names on a spreadsheet. That would actually be data destruction because, as James says, “As humans, we’re really bad at doing the same thing twice.”

Context organizes the data to tell a story. You don’t add new information; you impute the existing data. For example, the context of a sales transaction could include the marketing emails the buyer saw, the social media content the buyer engaged with, and the other products they viewed.

Domain is the set of all possible data values for a given element. It can be qualitative and quantitative. James points to these five common domain types:

Identity — a unique value that distinctly and discretely pinpoints somebody, such as an email address, Social Security number, or customer ID

Nominative — a supplemental identity not strong enough to stand on its own, such as a person’s full name or a product name
Categorical — a grouping across arbitrary boundaries, such as customer type or industry; often used for cohort subdivision
Monetary — the currency which can be compared, ordered, aggregated, and disaggregated, such as order total or unit price
Temporal — a point or span of dates and times, such as sign-up date, last purchase date, or loyalty period

With this foundational understanding of modeling, you’re ready to learn about cleaning the data.

What types of data cleaning exist?

James details the three types of data cleaning — mechanical, explicit mappings, and patterns and rules:

With mechanical cleaning, the data is cleaned up without changing the meaning of the information, such as normalizing the case for names and removing unnecessary spaces. “These are all things that I can do all by myself as a data engineer that nobody gets mad (about),” James says. “Nobody says, ‘Well, you took the spaces out of their first name, so it’s a different person.”

Explicit mapping uses an activity called “cardinality reduction” to decrease the number of unique values associated with an attribute. It simplifies the dataset by grouping values while retaining the relevant information. These datasets are more manageable and can improve model performance.

For example, James says, perhaps a customer status field started with two values — active and inactive. Over time, the field expanded to include suspended, on-hold, and prospective options. An explicit mapping cleaning might move the “suspended” customer status into the “active” value.

A cleaning for patterns and rules identifies and corrects inconsistencies, inaccuracies, or errors in the data based on identifiable structures (i.e., patterns) and constraints (i.e., rules).

Standard patterns encompass data like email addresses, date strings, and phone numbers. Deviations from that structure indicate data that needs to be cleaned.

Rules refer to logical conditions or constraints. So, for example, if the monetary data for an insurance policy exceeds its maximum value, the entry needs to be cleaned.

James says you also can set rules and patterns to map the customer journey. Let’s say a brand doesn’t care how many times a person opens and clicks its email. Instead, it cares about identifying who is susceptible to purchasing from an email marketing campaign. It could set up rules to clean the data for that goal.

For example, all emails sent would be labeled “E”, and all clicks would be labeled “C”, while an order would be recognized as “O.” Those rules collapse the data so it’s most helpful for the brand and its marketing goals.

What is governance’s role in data cleaning?

“Anytime you are cleaning data, you are making a decision. You are deciding what is relevant; you are deciding what is important. You’re deciding what to keep and what to surface,” James says.

You must document those data-cleaning decisions in an internal repository, such as a spreadsheet, or use a version control system like the open-source Git.

Each decision should answer these four questions:

What decision was made?
When was it made? This point-in-time reference helps with historical analysis.
Who made the decision?
Why was this decision made? It’s helpful to inform future actions. For example, if the decision was made because of a government update, reversing it probably isn’t possible. But, if the decision was made because the data team thought it was a better way to do it, reversing course may remain a viable option, James says.

Let’s go back to the example of collapsing the customer status fields so the “suspended” status was grouped into “active” customers. Here’s how that decision might be recorded:

“Customers with ‘suspended status’ are still considered active as of Oct. 22, 2024. The decision was made by James Hunt because a mapping analysis showed customer behaviors can best be assessed by active or inactive status.”

Humans are essential to the governance process, James says. Computer-generated algorithms can suggest data-cleaning steps, but a human should be in the loop to review the suggestions and approve or reject them.

What is observability?

Even after you set up rules and patterns to ensure clean data, some data will run afoul of those parameters. Instead of letting this data through or cleaning it up automatically, you should embrace observability, which James says is 10 times more important than governance.

Surfacing the metadata of your data cleaning might look like this example from a client of James’. The data-cleaning rules set a lower limit on policy sizes to catch bad data. It worked well for about six months until a policy entered the system with a limit below the one set in the rules.

James flagged this record and then asked the client, “Do you want us to adjust the limit?” The client said yes, and the lower limit data rule was updated.

“We caught that through the observability loop by saying, ‘This is what we expect the data to look like. It didn’t look like that when we were cleaning it. We weren’t comfortable making that decision (without client input). And that’s what observability is going to get you,” James says.

Having the right observability practices can save you hours, days, weeks, months, and a whole lot of embarrassment, he notes.

Are you ready to pursue data cleaning?

Now that you’ve learned about data modeling, cleaning, governance, and observability, you are ready to apply them to your marketing if you have:

Datasets where the integrity of data is not pristine or perfect
Datasets with a high number of unique values (i.e., for which cardinality reduction can help processing and analysis)

Where would you find that data? It could come from a multitude of sources, such as:

CRM platforms
Customer contact records
Customer questionnaires and feedback forms
Survey responses
Web analytics
Customer behaviors
Product or platform information
Competitor analyses

Start with the ones that would most benefit from one or more of the three types of data cleaning, proper governance, and observability. Then, you can decide whether to engage with data teams in your organization to assist.

MADS 2024 is over, but you can still experience all the learning and inspiration. A Digital Pass gives you access to recordings of the keynote talk by Seth Stephens-Davidowitz and in-depth sessions from Etsy’s Vishwa Bhuta, Google’s Suraj Rajdev, ReflexAI’s John Callery-Coyne, and many other experts. Register for a MADS Digital Pass today to make every minute of access count — access expires on January 31, 2025. (Don’t forget to use code DAA200 to save $200).

HANDPICKED RELATED CONTENT:

Cover image by Joseph Kalinowski/Content Marketing Institute

What is data modeling?

What types of data cleaning exist?

What is governance’s role in data cleaning?

What is observability?

Are you ready to pursue data cleaning?

Leave a Reply Cancel reply