Getting Data Quality Right for Analytics
Why is it important and how could it affect your business decisions ?
We’ll going to be covering the following topics:
Data Quality Enemies
5 Common Problems Found In Data
Real World Stories
Extremes Of Data
Missing Country Data
Topology Of A Data Warehouse
Data Quality Dashboards in Yellowfin
There are known knowns. These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don’t know. But there are also unknown unknowns. There are things we don’t know we don’t know. Donald Rumsfeld
In a world that is generating more data than ever before, we have to quantify our Data Quality to the best of our ability and make sure that we know the quality of our known knowns and known unknowns. There will always be known unknowns, but if we want to make the best decisions possible on the data we have, let’s make sure that we know the quality of those decisions.
Data Quality Enemies
Over the years of working with data, we’ve identified 4 primary enemies of Data Quality. If these are overlooked, data will tend to a state of chaos, use of the data will diminish and the quality of our decisions will be poor.
Here are the 4 primary enemies of Data Quality that we’ve identified:
Ownership – Lack of Governance (ownership and/or accountability). Clear ownership AND accountability must be defined.
Dashboards, KPIs & Tools – Lack of tools (processes, dashboards and KPIs) to effectively Measure, Monitor, Manage and Report Data Quality.
Data Itself – Data can be it’s own enemy, if left “unattended” it tends towards a state of chaos.
Overlooked – Often Data Quality is not given enough consideration and/or budget. Data Quality should be a sub-project within the overall project.
5 Common Problems Found in Data
The 5 primary enemies that we identified in the previous paragraph often result in the following problems occurring in data:
Incomplete / Missing Data. This is where a data source or data feed is incomplete or missing. The result in this problem is that we’re making decisions on incomplete or missing data, therefore we don’t know all the knowns. This is a simple problem, but the results can be very dangerous for our decisions.
Inconsistent Dimensions. This problem causes our data linkages or joins to be incomplete. For example, if some of your sales information for the United Kingdom is coded as “GB” and some coded as “UK”, you’re possibly going to have some of your data not included in your reporting or dashboards.
Dirty Data. This is self explanatory. Examples we’ve seen of this is where the decimal place is in the wrong place so inflates figures significantly and skews reporting (e.g. 10000.00 instead of 1000.00).
Cloned Data. This causes double counting, which obviously skews your reporting and dashboards.
Orphaned Data. This is often caused by the problem 2 listed above (Inconsistent Dimensions). This is when linkages or joins break.
First Story – Outliers
One of the most interesting stories we’ve heard recently around Data Quality is the USA average salary argument. This is a bit of a wild card story as the Data Quality wasn’t necessarily bad, but the dashboard and the reports that came out of the data didn’t make known the extreme outliers that were in the data. We would stress that it’s really important for you to report and quantify the outliers in your data as this will affect the results that are reported.
What was communicated?
In the recent US presidential elections, one of the big talking points was the average wage for individuals in the US.
When you looked at the mean, wages had significantly increased, which at face value was a great success and everyone was happy.
What was missed?
Unfortunately, because the mean was used for the analysis for this information, the results were skewed from the reality.
A few outliers (e.g. very high earners such as Bill Gates, Warren Buffet, etc..) caused the results to be much higher than they were.
When you look at the median, the results are shocking and the trend of average salary goes down.
Here are the results:
The dark green line shows the results based on the mean. This clearly shows significant increase in average wage.
The light green line shows the results based on the median. This eliminates the outliers that skew the figures and gives a much more realistic and shocking result of the reality.
Second Story – Missing Data
When data is missing from a feed, it can cause major issues, affecting many areas of a business. Making decisions when you don’t have a full picture of where your business is at, it could significantly affect your business.
Here’s an example of missing data:
The red table above shows that Belgium is missing. If you were just to look at the chart above, you might miss that Belgium is missing because the red line is fairly flat from one month to the next, so it doesn’t really highlight that something is wrong. This highlights the importance of how you need to visualise your Data Quality in the right way.
Third Story – Broken Linkages
In Data Warehouses, there are many number of joins and linkages between different data sets. Often these linkages can be areas where Data Quality can be a challenge and where we need to give special attention.
Here’s a diagram showing a Data Quality error that was made by one of our customers that caused them some problems:
The blue diagram shows a local product called ABC that was coded as 123 in the ERP system in Austria. Product ABC was then linked to a Global product code A1 in the sales data warehouse. The purpose of this was to see global product performance in this organisation and is a very common requirement in global businesses. Product ABC was discontinued in Austria in 2012 and was no longer sold.
The red & green diagram show that a new product was launched in Austria, which was called XYZ. Product XYZ was incorrectly coded in the ERP system as 123, which then meant that it used the old linkage to the global product code A1, instead of linking to global product code A2.
You might think this isn’t a big deal, but what happened is that product XYZ was continually sold in Austria to a number of customers. Austria was one of the biggest markets for product XYZ. The global product line management team did a review of their portfolio with a view of simplifying their portfolio and therefore making their business more efficient. They used the global data warehouse reporting on sales to make the strategic decisions about which products to delete, which products are performing well, etc.. and they saw that global product code A2 was not being sold much if at all (because it wasn’t linked correctly to product XYZ). So a decision was made to stop manufacturing global product A2, which resulted in some very disgruntled customers in Austria who couldn’t source their products because manufacturing was stopped. This angst was caused by a very simple error in the linkages or products, which could have been avoided easily.
Topology of a Data Warehouse
Over the years of building data warehouses and working on data projects, we’ve found that the follow data warehouse topology works really well to implement gates and processes to monitor your Data Quality, stop flow of data when the quality is poor and report issues so that you can know the knowns and know the unknowns.
The data comes in from sources into a staging area where it we start to process the data into a format that is suitable for reporting. The data then flows to the foundation database, which is where algorithms, calculations and analytics are performed. Then finally the data flows into the publish database, which is where the reporting happens. You can see that there are gates between the staging and foundation databases and then between the foundation and publish database. These gates are used for running checks of the data before letting it pass to the next database. Some rules are defined on checks that are done on the data before it passes to the next database, so that some automated stops can happen.
Two main things to remember when considering Data Quality
Data Governance needs to be in place.
Owners of data and Consumers of data are BUSY, so clear accountability for who’s responsible for the data needs to be really clear.
Dashboards and tools are essential for successfully maintaining Data Quality to:
Report on the status of Data Quality.
Notify key people when there is a problem.