Overcoming Data Mining Challenges
Copyright (c) 2011 Rexer Analytics, All rights reserved
Overcoming Data Mining Challenges  

In the four annual data miner surveys, these key challenges have been identified
by data miners more than any others:

  • Dirty Data
  • Explaining Data Mining to Others
  • Unavailability of Data / Difficult Access to Data

In the 4th Annual Survey (2010) data miners also shared their experiences in
overcoming these challenges.  Below are selected examples of the "best practices"
they shared.  Complete lists of their experiences overcoming each of these data
mining challenges are also available by following the links.

Challenge:  Dirty Data

      Eighty-five of the 735 data miners participating in the 4th Annual Survey
      described their experiences in overcoming this challenge.  Key themes were
      the use of descriptive statistics, data visualization, business rules, and
      consultation with data content experts (business users).  Many diverse and
      detailed suggestions were shared (
all 85 responses can be seen here).

      Selected data miner survey responders' experiences overcoming this

  • All projects begin with low-level data reports showing counts of records,
    verification of keys (uniqueness, widows/orphans), and distributions of
    field contents.  These reports are echoed back to the client's data
    content experts.

  • In terms of dirty data, we use a combination of two methods:  informed
    intuition and data profiling.  Informed intuition required our human
    analysts to really get to know their data.  Data Profiling entails checking
    to see if the data falls into pre-defined norms.  If it is outside the norms,
    we go through a data validation step to ensure that the data is in fact

  • Don't forget to look at a missing data plot to easily identify systematic
    pattern of missing data (MD). Multiple imputation of MD is much better
    than not to calculate MD and suffer from "amputation" of your data set.
    Alternatively flag MD as new category and model it actively. MD is
    information!    Use random forest (RF) as feature selection. I used to
    incorporate often too many variables which models just noise and is
    complex. With RF before modelling, I end up with only 5-10 variables
    and brilliant models.

  • A quick K-means clustering on a data set reveals the worst as they   
    often end up as single observation clusters.

  • Use anomaly detector (GritBot) to flag records to put before subject
    matter experts.  They usually then formulate a rule that is more
    comprehensive than what Gritbot postulated that I can use to further
    clean the data.

  • We calculate descriptive statistics about the data and visualize before
    starting the modeling process. Discussions with the business owners of
    the data have helped to better understand the quality.  We try to
    understand the complexity of the data by looking at multivariate
    combinations of data values.

  • Training decision trees for each variable given the remainders allows to
    A) replace NULL values, and  B) check deviating values (expert).

  • We created artificial multidimensional definition of outliers and of virtual
    clusters using fuzzy sets and tried to trap dirty data.  Examination of
    trapped data provided clues to write programs for cleaning specific
    types of "dirtiness".

  • Being able to visualize data quickly lets us communicate the presence
    of "dirty" data to clients.    Portrait's Uplift modeling includes reports that
    let our analysts know if control groups are *really* as random as clients
    tell us they are (we often discover biases).

  • Working with the different business units generally dirty data does not
    mean that it is useless.  By working through the problem you tend to
    walk away understanding the data set better than if it were accepted as
    clean...just because the data is clean and secure does not mean that
    you fully understand all of the variables and the original intent on why
    the data was collected in the first place.

Challenge:  Explaining Data Mining to Others

      Sixty-five of the 735 data miners participating in the 4th Annual Survey
      described their experiences in overcoming this challenge.  Key themes were
      the use of graphics, very simple examples and analogies, and focusing on
      the business impact of the data mining initiative.  Many diverse and detailed
      suggestions were shared (
all 65 responses can be seen here).

      Selected data miner survey responders' experiences overcoming this

  • Leveraging "Competing on Analytics" and case studies from other
    organizations help build the power of the possible.  Taking small
    impactful projects internally and then promoting those projects
    throughout the organization helps adoption.  Finally, serving the data up
    in a meaningful application - BI tool - shows our stakeholders what data
    mining is capable of  delivering.

  • Initiate Knowledge Sharing Sessions about DM basics and purposes.

  • Graphical representations are very helpful  (i.e., gain or lift charts).

  • The problem is in getting enough time to lay out the problem and
    showing the solution.  Most upper management wants short
    presentations but don't have the background to just get the results.  
    They often don't buy into the solutions because they don't want to see
    the background.  Thus we try to work with their more ambitious direct
    reports who are more willing to see the whole presentation and, if they
    buy into it, will defend the solution with their immediate superiors.

  • Focus on dollars, overall benefit of model application to the Balance
    Sheet and P&L.

  • Measuring results compared to control groups is the best to convince
    people about data mining results.

  • I've brought product managers (clients) to my desk and had them work
    with me on what analyses was important to them.  That way I was able
                   to manipulate the data on the fly based on their expertise to analyze
different aspects that were interesting to them.

  • Explaining results and their business impact with visual & graphical
    presentation, explaining historical trends & variance analysis, logically
    helps explain business trends in data to business users.

  • One View of the Truth philosophy.  Definitions of variables are
    consistent across business functions.

  • Visualize and explain models and model spaces.  Explain and interpret
    results.  Show and explain evaluation and significance of results.

Challenge:  Unavailability of Data / Difficult Access
to Data

      Forty-six of the 735 data miners participating in the 4th Annual Survey
      described their experiences in overcoming this challenge.  Key themes were
      devoting resources to improving data availability and methods of overcoming
      organizational barriers.  Many diverse and detailed suggestions were shared
all 46 responses can be seen here).

      Selected data miner survey responders' experiences overcoming this

  • We continue to work on better, easier, faster ways to access data.  In
    fact, we have employed a data gathering specialist who works full-time
    on data gathering efforts, thus reducing the stress on our statisticians.

  • I usually would confer with the appropriate content experts in order to
    devise a reasonable heuristic to deal with unavailable data, or immute
    variables.  Difficult to access data means typically we don't have a good
    plan for what needs to be collected.  I talk with the product managers
    and propose data needs for their business problems.  If we can match
    the business issues with the needs, data access and availability is
    usually resolved.

  • Our best practice is to design and implement a dedicated database
    model for data mining purposes - an Analytical Data Set - that should
    be populated automatically in the defined period.

  • A lot of traveling to the business unit site to work with the direct
    'customer' and local IT...generally put best practices into place after
    cleaning what little data we can find.  Going forward we generally
    develop a project plan around better, more robust data collection.

  • When the data is unavailable, we provide our "second best" model from
    whatever data does exist, and a long disclaimer.  This doesn't really
    improve the current solution, but it's proven very effective in getting
    people to (i) continue to call us and (ii) do a better job of getting the data
    we've asked for.

  • Access of data is addressed at business teams level, since lack of data
    means that their goals cannot be met with our support.

  • A big problem we have is that the groups (government agencies) that
    generate the data we need do not trust others with that data because
    they fear either being embarrassed by possible oversights they have
    made, or that disclosure of this data will lead to an increased workload
    for them dealing with requests and queries about the data that results
    from users of the data not understanding the context of the data. An
    approach for sharing this data that we have seen work is for the
    organizations that hold the data to form an interest group that controls
    access to the data and drives the uses of the data.