Overcoming Data Mining Challenges:
Dirty Data
Copyright (c) 2011 Rexer Analytics, All rights reserved
|



Overcoming Data Mining Challenges:
Dirty Data
In each of the four annual data miner surveys, "Dirty Data" has been identified as a
key challenge by more data miners than any other challenge. While completing the
4th Annual Survey (2010), 85 of the 735 data miners shared their experiences in
overcoming this challenge. Below is the full text of the "best practices" they
shared. A summary of data miners' experiences overcoming the top three data
mining challenges is also available.
Challenge: Dirty Data
- All projects begin with low-level data reports showing counts of records,
verification of keys (uniqueness, widows/orphans), and distributions of
field contents. These reports are echoed back to the client's data
content experts.
- In terms of dirty data, we use a combination of two methods: informed
intuition and data profiling. Informed intuition required our human
analysts to really get to know their data. Data Profiling entails checking
to see if the data falls into pre-defined norms. If it is outside the norms
we go through a data validation step to ensure that the data is in fact
correct.
- Don't forget to look at a missing data plot to easily identify systematic
pattern of missing data (MD). Multiple imputation of MD is much better
than not to calculate MD and suffer from "amputation" of your data set.
Alternatively flag MD as new category and model it actively. MD is
information! Use random forest (RF) as feature selection. I used to
incorporate often too many variables which models just noise and is
complex. With RF before modelling, I end up with only 5-10 variables
and brilliant models.
- A quick K-means clustering on a data set reveals the worst as they
often end up as single observation clusters.
- Use anomaly detector (GritBot) to flag records to put before subject
matter experts. They usually then formulate a rule that is more
comprehensive than what Gritbot postulated that I can use to further
clean the data.
- We calculate descriptive statistics about the data and visualize before
starting the modeling process. Discussions with the business owners of
the data have helped to better understand the quality. We try to
understand the complexity of the data by looking at multivariate
combinations of data values.
- Training decision trees for each variable given the remainders allows to
A) replace NULL values, and B) check deviating values (expert).
- We created artificial multidimensional definition of outliers and of virtual
clusters using fuzzy sets and tried to trap dirty data. Examination of
trapped data provided clues to write programs for cleaning specific
types of "dirtiness".
- Being able to visualize data quickly lets us communicate the presence
of "dirty" data to clients. Portrait's Uplift modeling includes reports that
let our analysts know if control groups are *really* as random as clients
tell us they are (we often discover biases).
- Working with the different business units generally dirty data does not
mean that it is useless. By working through the problem you tend to
walk away understanding the data set better than if it were accepted as
clean...just because the data is clean and secure does not mean that
you fully understand all of the variables and the original intent on why
the data was collected in the first place.
- Visualize data over time and histograms (value distribution over time
and per class). Find time-dependent anomalies or distribution
anomalies. Outlier detection and analysis. Missing value
analysis.
- I follow traditional data check processes, using descriptive statistics
(mean, median, standard deviation) as well as accompanying visual
methods (bar charts and histograms) to check distributions. I also use
some coding practices in SAS to check for missing values.
Additionally, I also use q-q plots to check for the assumption of
normality on continuous variables.
- Detect and remove outliers. Remove non-responders who fail to
provide answers for key parameters. Remove those that provided the
same answers for all questions.
- Missing or incorrect data has been an issue at our organization. Some
areas of success at dealing with missing was to incorporate variables
that have correlations to ones that contain missing data and using this
with missing data patterns to perform multiple imputation techniques at
arriving at a better imputation estimate which does not contain bias like
imputing a mean or median does.
- We design fault-tolerant pattern discovery algorithms. We claim that
this is one useful directions of research for dirty data mining.
- I have good experience with using Control Charts in Minitab to verify the
quality of the data.
- The age old Pareto rule seems to work. 20% of the data when cleaned
helps 80% of the organisation/enterprise. It is a tall order to go after
every bit and byte of information that lies inside and outside the
organisation. Identification of the key source that is heavily used across
the organisation and then employing DQ measures in-line on that
system helps. While the other systems can be handled in an off-line
mode and brought in sync.
- We have found a lot of useful RapidMiner operators implementing
various cleaning techniques (outliers remotion, good samples filters,
attribute filters, missing value imputation, etc. etc...).
- As marketers, we believe in using variables that reflect customer reality
- not exactly what operational accounting systems provide us directly.
Two examples: 1) Delays in order processing shift acquisition numbers
out of one day into next resulting in wild swings not based in any reality.
We'll typically manually correct these data points in acquisition time
series before modeling against marketing spend. 2) We are a
subscription based business. Subscriber tenure typically an important
predictor in many models. Our subscription processing system will
reset the start of an individual's subscription after a credit card glitch.
We now use an "effective subscription start date" calculated with some
fairly harry SQL that skips over short gaps seen in the raw data.
- I learned from the book data preparation of Dorian Pyle a lot. When I
started at my company we made all missing values into zero. After
changing this we have better models.
- The Automated Data processing Node within SPSS Modeler has
helped this effort immensely, as I get such good visibility to the quality
of the data, and there are recommendations I can follow to cleanse
it.
- We're just beginning to address dirty data, though it has been a major
problem for a long time. We're starting with extensive process
documentation and understanding, including field definitions of data
that is already stored but misunderstood. Then we'll address the table
structure and finally get SQL training. We're hardly at the 30,000-foot
level. We're mere inches above the ground and struggling not to go
under altogether.
- A workbench approach provides a good "lightweight" solution to data
cleaning and also allows the process to interact with business
knowledge. Dedicated ETL & data cleaning tools are often too
expensive in central IT resource and also not suitable for business
analysts who have the knowledge to do this effectively.
- I use distributional analysis a lot then "when necessary" value
imputing.
- Use more visualisation and make owners responsible for "signing off"
prior to usage.
- Working with account managers on "known" cases, use it as baseline to
generate regression model and cluster specs. Apply those models to
new dirty data and quarantine sections of dataset outside tolerance
parameters. Go through quarantine points manually and feed them
back to core sample when applicable.
- Client data is often not good as there are no checks and balances at
their end. A good way to solve this problem is to send them a snapshot
of the data through various reporting tools. This way, the client knows
how bad the data is and where the problem is occurring.
- Locating records which breach some rule e.g. students who have
apparently graduated without apparently having completed all the
required modules!
- Using eObjects Datacleaner to profile data and find anomalies, then
build the rules using Kettle validation steps to approve/reject new data.
- Data out of range: use automatic range checking, outlier detection in
Statistica
- Using Automatic Data Preparation (ADP) from ODM.
- Data cleansing tools, with business rules.
- If training data must be manually labeled in any way, it must be shuffled
first (even if there is only one human labeler).
- Social network analysis tools can often overcome the dirty data problem.
- We use SAS Data Quality Solution to 'clean' data.
- Action: Understood all possible data limitations. Solution: Adjusted
data appropriately; Assumed default values. For instance, a long
postal address field was split into several data base fields.
- Adopt data quality tools to standardize and correct data.
- Effective ETL scrubbing routines with QC feedback looping.
- Implement data assessment and cleansing as an initial phase in our
projects.
- Standard stats. CART to flag data inconsistency.
- Evaluate data thoroughly before using. Write routines to clean data as
required prior to use. Attempt to incorporate changes to place controls
at source, to improve data quality for future projects.
- Look for ambiguous business rules.
- Audit critical data and removal of doubt-full data.
- Enforcing the adoption of standard extract files across analytic
functions. This ensures that the data being used represents our best
source.
- Redesigned data acquisition applications, redesigning input masks.
Statisticians using basic, histograms, means, modes, standard
deviation. Evaluating the possible use of filters to replace missing
attributes.
- The data that we deal with contains many parameters and
observations, these data were cleansed using several statistical and
Logical technique. Still we face the same problem of identifying bad
responses in data. Example: A respondent might give a score of 1 to
all the response..this might be due to his unwillingness to respond...
Now how to filter this data is a challenge, So we had developed an
algorithm to find such rare response in the data.
- Master Data Management processes reduce bad data inbound to
analytics. We are also using SQL server for a short term solution to fix
bad master data pre-analysis.
- Explain to customers what kind of data you need and, if possible, ask
them to prepare a sample. Clearly divide projects into phases such
that if data is not good enough, it is not a problem to re-plan the project
or even cancel it. These scenarios should be clearly explained to
customers.
- Business user validation of data set.
- Filtering and recoding tools of STATISTICA.
- Remove files that are obviously too small. Find values that are
obviously out of definition range, e.g. percentage. Try to visualize the
data and find exceptions.
- Using data cleaning facilities in SDM.
- We use binning to handle dirty data. If that's not sufficient, we cleanup
using business rules or try to regenerate the data. If data for re-
generation is not available, we consider dropping the variable from
analysis or getting it fixed at source.
- Aggregate raw data into intelligent analytical data sets, using multiple
source systems as data source, removing duplications in data.
- Identify source of dirty data. Work with Data Warehouse to see if there
is an issue with the source data. If all efforts to source it correctly fail,
then assumptions have to be made using trends and probability
scores.
- Reviewed all data and set strict limits on data precision. Eliminated all
data points that were biologically unreasonable, or that demonstrated
difficulty in analysis.
- We just use the basics: descriptive stats, plots, model diagnostics.
- Applying defaults or leaving out dirty information but still ensuring you
have a good set of information to train the models.
- Standardized data logs accompanying data files describing
experimental parameters in lab measurements.
- Pre-processing of data to accommodate/interpret in the presence of
noise. Example is in-house software developed to segment email
messages into functional parts to allow text analytics components to
focus on the signal (e.g., ignoring quoted reply content, email
signatures, advertising, legal disclaimers, etc.).
- Standardizing methods before screening. Outlier removal with various
methods.
- Deep descriptive analysis.
- Pre-screening, but very time consuming and misses do occur regularly.
- Standardizing some data cleansing routines has been helpful.
- Develop and deploy analytic methods that are robust under uncertainty.
- Problems with data collection methods -- calibrate the equipment.
- Standards, standards, standards, and SAS to manipulate the bad data.
- Our own lookup table assigning gender by name. Fuzzy matching
algorithm to match multiple IDs of the same person.
- Just requires lots of careful planning.
- SAP data needs to be exported which causes formatting errors that
must be corrected prior to modeling.
- Several quality checks are often needed.
- Use record linkage technologies to dedupe what appeared to many
records, but was really duplicated data.
- No need for outlier-cleaning with the TIM software. TIM can handle
them.
- Create templates for data domains that contain typically needed
derived fields so these can be given to the DB team to place on the
back-end.
- Manual correction, removing outliers.
- Employee_Age was to be calculated based on Date_of_Birth, but when
that field was empty, the Appl used Start_Date which made many
employees under the age of 8.
- 80% of our time is typically spent cleaning and validating data for
quality and suitability for analytics. We haven't seen one single project
that didn't have some form of a problem around bad data values.
- Perhaps not that much dirty as hard to align and coordinate, and
sometimes understand the data collection idiosyncrasies.
- Standardization through use of MES.
- Checking, Checking, Checking!
- In house code to clean up data.