Overcoming Data Mining Challenges
Copyright (c) 2011 Rexer Analytics, All rights reserved
|



Overcoming Data Mining Challenges
In the four annual data miner surveys, these key challenges have been identified
by data miners more than any others:
- Dirty Data
- Explaining Data Mining to Others
- Unavailability of Data / Difficult Access to Data
In the 4th Annual Survey (2010) data miners also shared their experiences in
overcoming these challenges. Below are selected examples of the "best practices"
they shared. Complete lists of their experiences overcoming each of these data
mining challenges are also available by following the links.
Challenge: Dirty Data
Eighty-five of the 735 data miners participating in the 4th Annual Survey
described their experiences in overcoming this challenge. Key themes were
the use of descriptive statistics, data visualization, business rules, and
consultation with data content experts (business users). Many diverse and
detailed suggestions were shared (all 85 responses can be seen here).
Selected data miner survey responders' experiences overcoming this
challenge:
- All projects begin with low-level data reports showing counts of records,
verification of keys (uniqueness, widows/orphans), and distributions of
field contents. These reports are echoed back to the client's data
content experts.
- In terms of dirty data, we use a combination of two methods: informed
intuition and data profiling. Informed intuition required our human
analysts to really get to know their data. Data Profiling entails checking
to see if the data falls into pre-defined norms. If it is outside the norms,
we go through a data validation step to ensure that the data is in fact
correct.
- Don't forget to look at a missing data plot to easily identify systematic
pattern of missing data (MD). Multiple imputation of MD is much better
than not to calculate MD and suffer from "amputation" of your data set.
Alternatively flag MD as new category and model it actively. MD is
information! Use random forest (RF) as feature selection. I used to
incorporate often too many variables which models just noise and is
complex. With RF before modelling, I end up with only 5-10 variables
and brilliant models.
- A quick K-means clustering on a data set reveals the worst as they
often end up as single observation clusters.
- Use anomaly detector (GritBot) to flag records to put before subject
matter experts. They usually then formulate a rule that is more
comprehensive than what Gritbot postulated that I can use to further
clean the data.
- We calculate descriptive statistics about the data and visualize before
starting the modeling process. Discussions with the business owners of
the data have helped to better understand the quality. We try to
understand the complexity of the data by looking at multivariate
combinations of data values.
- Training decision trees for each variable given the remainders allows to
A) replace NULL values, and B) check deviating values (expert).
- We created artificial multidimensional definition of outliers and of virtual
clusters using fuzzy sets and tried to trap dirty data. Examination of
trapped data provided clues to write programs for cleaning specific
types of "dirtiness".
- Being able to visualize data quickly lets us communicate the presence
of "dirty" data to clients. Portrait's Uplift modeling includes reports that
let our analysts know if control groups are *really* as random as clients
tell us they are (we often discover biases).
- Working with the different business units generally dirty data does not
mean that it is useless. By working through the problem you tend to
walk away understanding the data set better than if it were accepted as
clean...just because the data is clean and secure does not mean that
you fully understand all of the variables and the original intent on why
the data was collected in the first place.
Challenge: Explaining Data Mining to Others
Sixty-five of the 735 data miners participating in the 4th Annual Survey
described their experiences in overcoming this challenge. Key themes were
the use of graphics, very simple examples and analogies, and focusing on
the business impact of the data mining initiative. Many diverse and detailed
suggestions were shared (all 65 responses can be seen here).
Selected data miner survey responders' experiences overcoming this
challenge:
- Leveraging "Competing on Analytics" and case studies from other
organizations help build the power of the possible. Taking small
impactful projects internally and then promoting those projects
throughout the organization helps adoption. Finally, serving the data up
in a meaningful application - BI tool - shows our stakeholders what data
mining is capable of delivering.
- Initiate Knowledge Sharing Sessions about DM basics and purposes.
- Graphical representations are very helpful (i.e., gain or lift charts).
- The problem is in getting enough time to lay out the problem and
showing the solution. Most upper management wants short
presentations but don't have the background to just get the results.
They often don't buy into the solutions because they don't want to see
the background. Thus we try to work with their more ambitious direct
reports who are more willing to see the whole presentation and, if they
buy into it, will defend the solution with their immediate superiors.
- Focus on dollars, overall benefit of model application to the Balance
Sheet and P&L.
- Measuring results compared to control groups is the best to convince
people about data mining results.
- I've brought product managers (clients) to my desk and had them work
with me on what analyses was important to them. That way I was able
to manipulate the data on the fly based on their expertise to analyze
different aspects that were interesting to them.
- Explaining results and their business impact with visual & graphical
presentation, explaining historical trends & variance analysis, logically
helps explain business trends in data to business users.
- One View of the Truth philosophy. Definitions of variables are
consistent across business functions.
- Visualize and explain models and model spaces. Explain and interpret
results. Show and explain evaluation and significance of results.
Challenge: Unavailability of Data / Difficult Access
to Data
Forty-six of the 735 data miners participating in the 4th Annual Survey
described their experiences in overcoming this challenge. Key themes were
devoting resources to improving data availability and methods of overcoming
organizational barriers. Many diverse and detailed suggestions were shared
(all 46 responses can be seen here).
Selected data miner survey responders' experiences overcoming this
challenge:
- We continue to work on better, easier, faster ways to access data. In
fact, we have employed a data gathering specialist who works full-time
on data gathering efforts, thus reducing the stress on our statisticians.
- I usually would confer with the appropriate content experts in order to
devise a reasonable heuristic to deal with unavailable data, or immute
variables. Difficult to access data means typically we don't have a good
plan for what needs to be collected. I talk with the product managers
and propose data needs for their business problems. If we can match
the business issues with the needs, data access and availability is
usually resolved.
- Our best practice is to design and implement a dedicated database
model for data mining purposes - an Analytical Data Set - that should
be populated automatically in the defined period.
- A lot of traveling to the business unit site to work with the direct
'customer' and local IT...generally put best practices into place after
cleaning what little data we can find. Going forward we generally
develop a project plan around better, more robust data collection.
- When the data is unavailable, we provide our "second best" model from
whatever data does exist, and a long disclaimer. This doesn't really
improve the current solution, but it's proven very effective in getting
people to (i) continue to call us and (ii) do a better job of getting the data
we've asked for.
- Access of data is addressed at business teams level, since lack of data
means that their goals cannot be met with our support.
- A big problem we have is that the groups (government agencies) that
generate the data we need do not trust others with that data because
they fear either being embarrassed by possible oversights they have
made, or that disclosure of this data will lead to an increased workload
for them dealing with requests and queries about the data that results
from users of the data not understanding the context of the data. An
approach for sharing this data that we have seen work is for the
organizations that hold the data to form an interest group that controls
access to the data and drives the uses of the data.