Data Quality

Why should I read this White Paper?

In a world driven by data, we often underestimate the importance of the quality of our data. Do we really know that the data we base our decisions on is accurate? How often do we stop trusting and therefore stop using a great system because the data in the system is bad? In this White Paper, we hope to highlight the importance of Data Quality and give some hints and tips on how you can tackle your Data Quality challenges and implement some processes to ensure that you maintain Data Quality.

What is Data Quality?

When we talk about Data Quality in this White Paper, what we really mean is “Quality of Information” because data is in itself of little value. With that said, all information that we make decisions on is based on data. As the need to derive meaningful information from large sets of data becomes a higher priority, Integrity and Quality of the Data is what gives users the assurance that the information they see is trustworthy. In this White Paper, when we refer to Data Quality, we’re therefore talking about the quality of the data that transforms into information that is used to make decisions.

In terms of data quality there are Known Knowns, Unknown Knowns and Unknown Unknowns:

“There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know.”

Donald Rumsfeld (February 2002)

We’ll explain this later in the White Paper, but essentially it’s categorising your data quality in a way that will help you quantify the quality of your data and therefore your decisions.

For example, we’ve seen in a number of companies where the Sales and CRM systems use completely different coding structures. In the Sales system “Joe Bloggs inc” has the code “abc123” but in the CRM system “Joe Bloggs ltd” (note the subtle difference) has the code “67345”. This is a Data Quality issue. These are “Known knowns”. We know that we need to maintain a mapping between the two different systems, otherwise when we try and combine the data to produce the information that we use to make decisions, what is the quality of our decisions?

What is a Data Warehouse?

A Data Warehouse is the whole ecosystem that processes data from being raw data to information upon which you can make decisions. This often includes the industry standard Extract, Transfer and Load processes to get the data into the Data Warehouse and turn it into information.

Most Data Warehouses render information to the users in (hopefully) a nice graphical way (graphs, charts, tables, dashboards, etc.). In reality what the system has done is to take raw data and transform it into information that is then used to make decisions.

In order to achieve this the system must often combine data from different sources, for example:

Sales system
CRM system
Local (to the data warehouse), such as special aggregation dimensions (e.g. aggregate the various African countries into Nigeria, South Africa and “rest of Africa”).
Telephony systems

The Data Warehouse needs to provide a way of combining the above four sets of data, but without addressing these Data Quality issues, the information that is generated from the Data Warehouse (that should be used to make decisions) will be of little value.

What can go wrong with your data?

We have identified 7 of the most common things that can go wrong with your data, which highlight the need to give attention to Data Quality:

Incomplete (missing) data, for example:
1. All of UK’s forecast sales volume is missing from the feed
2. CRM data for a country missing
Dirty data, for example:
1. Extra two zeros in a measure (instead of being 123.45 it’s 12345.00)
2. Misspelt dimensions
3. Corrupt data (as in bad format etc)
4. Incorrect Data (the colour of a garment in stock data being ‘Silk’)
Inconsistent dimensions, for example:
1. GB instead of UK
2. USA instead of US
3. “01/12/2016” instead of “12/01/2016”
Cloned data, for example:
1. Exact same sales data as was provided last month
Late data
Orphaned data, for example:
1. Sales rows for “Joe Bloggs inc” are included in the sales table but that company doesn’t exist in the companies table
Changed formats (this could be treated as incomplete data)
1. Used to provide .csv and then provided as .txt
2. In .xls format but with different column layout
3. Different date formats, was the original date the 01/07/2016 or 6th January?

What should you do if something goes wrong?

It depends on the severity, basically one of three options:

Stop everything – don’t display any information at all.
Display only historic information (last month’s data).
Show current information “warts and all” but warn the users.

It goes without saying that users should be warned or informed!

Data Quality Dashboard/Portal

In the below diagrams we show how we can control the quality of data (and therefore information) that is rendered to the user. That is all well and good, but how would the user know what data is available? It is not sufficient to simply say “If it’s there it’s good!”. A good data warehouse will have a dashboard and/or a portal dedicated entirely to data quality.

For example, any user should be able to see:

What period is the sales data up to?
Does the closing period for sales data match the CRM data?
How many orphans are there?
What changes to highest/lowest/average/median are there in the data?
How many sales rows are there?
How many new customer accounts were created?
Are there any data sets that are identical to the previous month?
What feeds failed entirely?

There are so many Data Quality KPIs that can be measured and displayed.

Below is an example where using a chart approach a single outlier can mask other problems. USA sales shoot up from 500 to 2000, which means that the remainder of the chart is “flat”. Only two issues are easily visible:

The USA figure of 2000
Germany missing April data

Using almost the same data (the May USA sales has been changed from 2000 to 555) the chart displays the issues that were previously masked.

In the display below we throw away the nice (but useless) chart and show a set of “rules”. Ten data issues, of which three are serious, are clearly displayed. The rules in this example are based on the change in total sales value from one month to the next.

Pass/Fail Criteria

A number of pass/fail criteria can be employed, and these will differ from one gate to the next (i.e. from Staging to Foundation vs Foundation to Publish). The second gate allows us to check the quality of the linkages when combining different data sources. It may be necessary for someone, probably an administrator, to perform some mapping between systems.

Some examples of pass/fail criteria are:

Number of data rows different by x%.
More than x orphans
Data totally missing for a given period
Data identical to previous period
Highest/Lowest/Average/Median value changed by x%
No of sales rows not able to link to CRM increased by x% or x number
Data differs from plan/budget by more than x%

1. Stop Everything

Why should I “stop everything”?

Surely the data that users see is already bad?

It depends on how your data warehouse was built. If everything is thrown into a single database using a single set of tables then you have a problem. Users will lose confidence. Below is what we thinks is a sensible data warehouse design:

In the above example, data is loaded into a staging database where it is assessed for quality before being copied into the foundation database where it is processed (transformed) and combined with other data. It can then be quality-checked again before being made available to subscribers/consumers (loaded and published).

So there are in fact two “gates” to stop bad data getting through. In the image below we show how the two “gates” are employed to handle data quality.

These gates are controlled by the Data Manager. The data manager is essential because:

One has to set the pass/fail criteria using a tool
Exceptions can be made to let data through when it has failed the quality test
A mechanism is required to tell data owners when there is a problem (one should implement a formal communication plan)

2. Display only historic information.

Sometimes it’s appropriate to look at historic information if the quality of the current information is not suitable for making decisions. You would need to ensure that you have appropriate warnings on the data so that users know that the data is historic and not current.

3. Show current information “warts and all” but warn the users.

Depending on the nature of the information that you’re reporting, sometimes it’s appropriate to show all the data regardless of the quality. This can be risky due to what we mentioned in the first section of this White Paper, that sometimes when users see information that is clearly wrong, they automatically doubt the validity of the entire system or Data Warehouse and this can be detrimental to your project or process.

Normally we would recommend that any data that doesn’t meet your pass/fail criteria is not shown to all users and only kept for a select few users with the purpose of identifying the issue and resolving it.

Root Cause Analysis

In order to determine ways of preventing data quality issues it is necessary to understand the main reasons for the introduction of bad data into the staging area. There are two main reasons:

Lack of automation.
1. It’s unbelievable how someone can follow the exact same process for eleven months and on the 12th month to change it.
2. People go on leave and don’t provide good instructions to their temporary replacements
3. People move jobs and/or roles and don’t hand over effectively
Lack of data ownership – data that is not owned tends to a state of chaos
Changes to source systems that “break” downstream use of data

Depending on the systems involved, little can be done about the third cause of bad data. If the source systems don’t implement change management and effective communication plans that are fit for purpose, when they make changes they’ll probably break downstream systems. The first two causes can easily be addressed using a combination of effective solution design, project management and change management.

It is easy to figure out how to prevent the causes of bad data:

Implement the project using effective change management
Ensure data is owned
1. Decide which people/roles own each data feed
2. Ensure the processed and published data has clear ownership
3. Communicate the ownership to the people/roles
4. Implement changes to performance bonuses to take into account data ownership
Implement effective communication plans and document them
Automate feeds when possible
Implement a solution design that alerts (using the documented communication plan) unexpected changes in source data

How do I fix my Data?

Here are some pointers to help you to fix the issues that you uncover in your Data Quality Dashboard. In this section we’ll be giving you some examples of how we’ve quantified quality issues and then how you can address them.

We believe that the following steps are really important to help you address and fix any data problems:

Identify how you can quantify the issues (e.g. how many products in our ERP are not named the same as they are named in our data warehouse, therefore when we try and run a sales report to look at the sales performance of our products, the report won’t pull in all the products).
Assess the impact of your Data Quality issues. For example:
- How many products are not linked?
- How many Towns have a postcode that is not correct?
- How many customers don’t have a customer number?
- How many customers don’t have country linked to them?
Create some exception reports to start to look at the poor data. This is really important to help you to identify where the issues are in your data.
Start to fix the issues.
Re-run exception reports to make sure the fixes have been done.

Communication Plans

Communication plans should contain a list of all entities to be notified automatically in the case of data issues as well as manual communication that may be required to rectify a problem. A common mistake we see is where the relevant contact people for systems that supply data are not documented. As people move on and off the system knowledge gets lost, including who to contact if problems occur. Communication plans should be checked regularly. In fact someone should clearly own the communication plan.

The communication plan is part of the master data management strategy that requires that all aspects of your data has clear owner that’s responsible for maintaining the quality of the information and if there’s a problem, it’s their responsibility to rectify the problem.

White PaperNigel IvyOctober 28, 2017