Data Quality- The Unsung Hero

Pooja Mahajan
4 min readOct 9, 2020
Image Source

While many MOOCs educate starting from the basics of machine learning techniques to deep learning algorithms using toy datasets that are pretty clean and small to work with, but in reality, while working with real data, 50–60 % of the time is spent in making it proper to be used for any analysis.

Quality of data in terms of coverage, completeness, and correctness plays a crucial role in the success of data science projects by helping businesses providing the right insights!

With this intuition in mind, I thought of writing what all data quality checks can be performed based on my experience. While the popular ones are finding and correcting missing values and outliers, let’s understand what more can be done from a data quality perspective as DATA is the only oil for Data Science. So let’s start!

  1. Detecting redundant records
  • Checking duplicates at a record level and at the primary key level.
  • Checking if identical records are present if we ignore the primary key.

My Experience:- In one of the use cases that I have worked on, this check helped in finding duplicates in transaction data where the same transactions were being recorded under multiple business units, so technically it was not a duplicate but logically it was and thus treated.

  • Checking the presence of dummy or test records.

For e.g. Transactional records with 0 amount or very low amount.

2. Exploring Primary Key

  • The data type of primary key, especially examining it in case of primary key which is composite in nature.
  • Is there an assumption that the primary key of the record will never change?

My Experience — In one of the use cases, there were changes in the way primary key was captured that messed up the workflow. My primary key was numeric and later after a year, for one of the dataset an alphabet was appended as prefix to primary key that messed up joins and other processes.

3. Probing Numeric data

  • To check with the domain team if the missing value implies 0.
  • Whether negative values are allowed.

For e.g. In the case of retail data, return transactions are sometimes captured with negative amounts which are still fine while if columns like age have such values that is something that needs to be taken care.

  • If there are expected minimum and maximum values for each column?

My Experience :- While working on a retail use case, we had data with maximum value of Age ‘2006’ and around 0.5% of customers with age >=130 and that use case was to design marketing campaigns where attributes like age are used to define customer segments. These kinds of values questions the integrity of the other normal appearing data too.

  • Expected distribution of columns and changes with the timeline (if data for multiple years is present).
  • If a variable that seems numeric is actually categorical?
  • For financial data, if the value is the current value or point in time value?

My Experience:- While working with financial data spanned over years, we will fix a conversion rate at a point in time and convert all base values using that in order to have a fair comparison between values.

  • In the case of financial data, the analogy between country and currency needs to be checked too.
Image Source

4. Checking Categorical data

  • Tracking count of unique levels for each column while the development process.
  • Correcting similar and redundant levels. For e.g. ‘Mumbai’, ‘mumbai’.
  • Tracking new levels being added across a timespan.
  • Tracking changes in the way levels are captured.

For e.g. If earlier for ‘Gender’ column values were Male, Female, and Others and now levels are M, F, and O.

5. Exploring DateTime columns

  • Consistent format of date and time, if not consistent then what is the business format that should be followed.
  • Checking if there are any futuristic dates present.
  • Understanding the special behavior of some metrics in case of leap years, holidays, crisis, etc. [Checking outliers in case of aggregations]
  • If instead of one column, we have month and year in the data [monthly level data], asking domain if it reflects the start of the month[previous month calculation] or end of the month entry.

My Experience:- In a retail use case, we figured wrong timestamps in case of transactional data, there were some instances where if a person has done two transactions at different places which are around 10km apart with transaction time difference of 2–3 minutes only.

6. Other checks

  • If personal information like age, contact number, etc. is present in data, knowing how frequently it is updated can be beneficial.
  • Customers with the same contact information can be flagged.
  • Different combinations like Country-City, City — Retail shop location, Country-Currency, etc. can be checked.

These are a few checks that I was able to write down based on my experience while working on real datasets. The importance of data quality lies in enabling actionable insights that help businesses grow in the right direction.

--

--