Overcoming Data Mining Challenges:
Dirty Data
Copyright (c) 2011 Rexer Analytics, All rights reserved
Overcoming Data Mining Challenges:
Dirty Data

In each of the four annual data miner surveys, "Dirty Data" has been identified as a
key challenge by more data miners than any other challenge.  While completing the
4th Annual Survey (2010), 85 of the 735 data miners shared their experiences in
overcoming this challenge.  Below is the full text of the "best practices" they

shared.  A summary of data miners' experiences overcoming the top three data
challenges is also available.

Challenge:  Dirty Data

  • All projects begin with low-level data reports showing counts of records,
    verification of keys (uniqueness, widows/orphans), and distributions of
    field contents.  These reports are echoed back to the client's data
    content experts.

  • In terms of dirty data, we use a combination of two methods:  informed
    intuition and data profiling.  Informed intuition required our human
    analysts to really get to know their data.  Data Profiling entails checking
    to see if the data falls into pre-defined norms.  If it is outside the norms
    we go through a data validation step to ensure that the data is in fact

  • Don't forget to look at a missing data plot to easily identify systematic
    pattern of missing data (MD). Multiple imputation of MD is much better
    than not to calculate MD and suffer from "amputation" of your data set.
    Alternatively flag MD as new category and model it actively. MD is
    information!    Use random forest (RF) as feature selection. I used to
    incorporate often too many variables which models just noise and is
    complex. With RF before modelling, I end up with only 5-10 variables
    and brilliant models.

  • A quick K-means clustering on a data set reveals the worst as they    
    often end up as single observation clusters.

  • Use anomaly detector (GritBot) to flag records to put before subject
    matter experts.  They usually then formulate a rule that is more
    comprehensive than what Gritbot postulated that I can use to further
    clean the data.

  • We calculate descriptive statistics about the data and visualize before
    starting the modeling process. Discussions with the business owners of
    the data have helped to better understand the quality.  We try to
    understand the complexity of the data by looking at multivariate
    combinations of data values.

  • Training decision trees for each variable given the remainders allows to
    A) replace NULL values, and  B) check deviating values (expert).

  • We created artificial multidimensional definition of outliers and of virtual
    clusters using fuzzy sets and tried to trap dirty data.  Examination of
    trapped data provided clues to write programs for cleaning specific
    types of "dirtiness".

  • Being able to visualize data quickly lets us communicate the presence
    of "dirty" data to clients.    Portrait's Uplift modeling includes reports that
    let our analysts know if control groups are *really* as random as clients
    tell us they are (we often discover biases).

  • Working with the different business units generally dirty data does not
    mean that it is useless.  By working through the problem you tend to
    walk away understanding the data set better than if it were accepted as
    clean...just because the data is clean and secure does not mean that
    you fully understand all of the variables and the original intent on why
    the data was collected in the first place.

  • Visualize data over time and histograms (value distribution over time
    and per class).  Find time-dependent anomalies or distribution
    anomalies.  Outlier detection and analysis.  Missing value

  • I follow traditional data check processes, using descriptive statistics
    (mean, median, standard deviation) as well as accompanying visual
    methods (bar charts and histograms) to check distributions.  I also use
    some coding practices in SAS to check for missing values.  
    Additionally, I also use q-q plots to check for the assumption of
    normality on continuous variables.

  • Detect and remove outliers.  Remove non-responders who fail to
    provide answers for key parameters.  Remove those that provided the
    same answers for all questions.

  • Missing or incorrect data has been an issue at our organization.  Some
    areas of success at dealing with missing was to incorporate variables
    that have correlations to ones that contain missing data and using this
    with missing data patterns to perform multiple imputation techniques at
    arriving at a better imputation estimate which does not contain bias like
    imputing a mean or median does.

  • We design fault-tolerant pattern discovery algorithms.  We claim that
    this is one useful directions of research for dirty data mining.      

  • I have good experience with using Control Charts in Minitab to verify the
    quality of the data.              

  • The age old Pareto rule seems to work. 20% of the data when cleaned
    helps 80% of the organisation/enterprise.  It is a tall order to go after
    every bit and byte of information that lies inside and outside the
    organisation.  Identification of the key source that is heavily used across
    the organisation and then employing DQ measures in-line on that
    system helps.  While the other systems can be handled in an off-line
    mode and brought in sync.

  • We have found a lot of useful RapidMiner operators implementing
    various cleaning techniques  (outliers remotion, good samples filters,
    attribute filters, missing value imputation, etc. etc...).   

  • As marketers, we believe in using variables that reflect customer reality
    - not exactly what operational accounting systems provide us directly.
    Two examples:  1) Delays in order processing shift acquisition numbers
    out of one day into next resulting in wild swings not based in any reality.
    We'll typically manually correct these data points in acquisition time
    series before modeling against marketing spend.  2) We are a
    subscription based business.  Subscriber tenure typically an important
    predictor in many models.  Our subscription processing system will
    reset the start of an individual's subscription after a credit card glitch.  
    We now use an "effective subscription start date" calculated with some
    fairly harry SQL that skips over short gaps seen in the raw data.

  • I learned from the book data preparation of Dorian Pyle a lot.  When I
    started at my company we made all missing values into zero.  After
    changing this we have better models.

  • The Automated Data processing Node within SPSS Modeler has
    helped this effort immensely, as I get such good visibility to the quality
    of the data, and there are recommendations I can follow to cleanse

  • We're just beginning to address dirty data, though it has been a major
    problem for a long time. We're starting with extensive process
    documentation and understanding, including field definitions of data
    that is already stored but misunderstood.  Then we'll address the table
    structure and finally get SQL training.  We're hardly at the 30,000-foot
    level.  We're mere inches above the ground and struggling not to go
    under altogether.  

  • A workbench approach provides a good "lightweight" solution to data
    cleaning and also allows the process to interact with business
    knowledge.  Dedicated ETL & data cleaning tools are often too
    expensive in central IT resource and also not suitable for business
    analysts who have the knowledge to do this effectively.

  • I use distributional analysis a lot then "when necessary" value

  • Use more visualisation and make owners responsible for "signing off"
    prior to usage.

  • Working with account managers on "known" cases, use it as baseline to
    generate regression model and cluster specs.  Apply those models to
    new dirty data and quarantine sections of dataset outside tolerance
    parameters.  Go through quarantine points manually and feed them
    back to core sample when applicable.

  • Client data is often not good as there are no checks and balances at
    their end. A good way to solve this problem is to send them a snapshot
    of the data through various reporting tools.  This way, the client knows
    how bad the data is and where the problem is occurring.

  • Locating records which breach some rule e.g. students who have
    apparently graduated without apparently having completed all the
    required modules!  

  • Using eObjects Datacleaner to profile data and find anomalies, then
    build the rules using Kettle validation steps to approve/reject new data.  

  • Data out of range:  use automatic range checking, outlier detection in

  • Using Automatic Data Preparation (ADP) from ODM.

  • Filtering, smoothing.

  • Data cleansing tools, with business rules.

  • If training data must be manually labeled in any way, it must be shuffled
    first (even if there is only one human labeler).

  • Social network analysis tools can often overcome the dirty data problem.

  • We use SAS Data Quality Solution to 'clean' data.

  • Action: Understood all possible data limitations.  Solution: Adjusted
    data appropriately; Assumed default values.   For instance, a long
    postal address field was split into several data base fields.

  • Adopt data quality tools to standardize and correct data.

  • Effective ETL scrubbing routines with QC feedback looping.

  • Implement data assessment and cleansing as an initial phase in our

  • Standard stats.  CART to flag data inconsistency.

  • Evaluate data thoroughly before using.  Write routines to clean data as
    required prior to use.  Attempt to incorporate changes to place controls
    at source, to improve data quality for future projects.  

  • Look for ambiguous business rules.  

  • Audit critical data and removal of doubt-full data.

  • Enforcing the adoption of standard extract files across analytic
    functions.  This ensures that the data being used represents our best

  • Redesigned data acquisition applications, redesigning input masks.  
    Statisticians using basic, histograms, means, modes, standard
    deviation.  Evaluating the possible use of filters to replace missing

  • The data that we deal with contains many parameters and    
    observations, these data were cleansed using several statistical and
    Logical technique.  Still we face the same problem of identifying bad
    responses in data.  Example: A respondent might give a score of 1 to
    all the response..this might be due to his unwillingness to respond...
    Now how to filter this data is a challenge, So we had developed an
    algorithm to find such rare response in the data.

  • Master Data Management processes reduce bad data inbound to
    analytics.  We are also using SQL server for a short term solution to fix
    bad master data pre-analysis.  

  • Explain to customers what kind of data you need and, if possible, ask
    them to prepare a sample.  Clearly divide projects into phases such   
    that if data is not good enough, it is not a problem to re-plan the project
    or even cancel it.  These scenarios should be clearly explained to

  • Business user validation of data set.    

  • Filtering and recoding tools of STATISTICA.

  • Remove files that are obviously too small.  Find values that are
    obviously out of definition range, e.g. percentage.  Try to visualize the
    data and find exceptions.

  • Using data cleaning facilities in SDM.  

  • We use binning to handle dirty data.  If that's not sufficient, we cleanup
    using business rules or try to regenerate the data.  If data for re-
    generation is not available, we consider dropping the variable from
    analysis or getting it fixed at source.

  • Aggregate raw data into intelligent analytical data sets, using multiple
    source systems as data source, removing duplications in data.

  • Identify source of dirty data.  Work with Data Warehouse to see if there
    is an issue with the source data.  If all efforts to source it correctly fail,
    then assumptions have to be made using trends and probability

  • Reviewed all data and set strict limits on data precision.  Eliminated all
    data points that were biologically unreasonable, or that demonstrated
    difficulty in analysis.

  • We just use the basics: descriptive stats, plots, model diagnostics.  

  • Applying defaults or leaving out dirty information but still ensuring you
    have a good set of information to train the models.

  • Missing data is imputed.  

  • Standardized data logs accompanying data files describing
    experimental parameters in lab measurements.

  • Pre-processing of data to accommodate/interpret in the presence of
    noise. Example is in-house software developed to segment email
    messages into functional parts to allow text analytics components to
    focus on the signal (e.g., ignoring quoted reply content, email
    signatures, advertising, legal disclaimers, etc.).  

  • Standardizing methods before screening. Outlier removal with various

  • Deep descriptive analysis.

  • Pre-screening, but very time consuming and misses do occur regularly.

  • Standardizing some data cleansing routines has been helpful.

  • Develop and deploy analytic methods that are robust under uncertainty.

  • Problems with data collection methods -- calibrate the equipment.

  • Standards, standards, standards, and SAS to manipulate the bad data.

  • Our own lookup table assigning gender by name.   Fuzzy matching
    algorithm to match multiple IDs of the same person.

  • Just requires lots of careful planning.

  • SAP data needs to be exported which causes formatting errors that
    must be corrected prior to modeling.

  • Several quality checks are often needed.

  • Use record linkage technologies to dedupe what appeared to many
    records, but was really duplicated data.

  • No need for outlier-cleaning with the TIM software.  TIM can handle

  • Create templates for data domains that contain typically needed
    derived fields so these can be given to the DB team to place on the

  • Manual correction, removing outliers.  

  • Employee_Age was to be calculated based on Date_of_Birth, but when
    that field was empty, the Appl used Start_Date which made many
    employees under the age of 8.  

  • 80% of our time is typically spent cleaning and validating data for
    quality and suitability for analytics.  We haven't seen one single project
    that didn't have some form of a problem around bad data values.   

  • Perhaps not that much dirty as hard to align and coordinate, and
    sometimes understand the data collection idiosyncrasies.

  • Standardization through use of MES.

  • Checking, Checking, Checking!

  • In house code to clean up data.

  • De-dup.

  • Collect new data.