Overcoming Data Mining Challenges:
Difficult Access to Data
Copyright (c) 2011 Rexer Analytics, All rights reserved
Overcoming Data Mining Challenges:
Difficult Access to Data

In each of the four annual data miner surveys, "Unavailability of Data / Difficult
Access to Data" has been identified as a one of the most frequent challenges data
miners face.  While completing the
4th Annual Survey (2010), 46 of the 735 data
miners shared their experiences in overcoming this challenge.  Below is the full text
of the "best practices" they shared.  
A summary of data miners' experiences
overcoming the top three data mining challenges is also available.

Challenge:  Unavailability of Data / Difficult Access
to Data

  • We continue to work on better, easier, faster ways to access data.  In
    fact, we have employed a data gathering specialist who works full-time
    on data gathering efforts, thus reducing the stress on our statisticians.

  • I usually would confer with the appropriate content experts in order to
    devise a reasonable heuristic to deal with unavailable data, or immute
    variables.  Difficult to access data means typically we don't have a good
    plan for what needs to be collected.  I talk with the product managers
    and propose data needs for their business problems.  If we can match
    the business issues with the needs, data access and availability is
    usually resolved.

  • Our best practice is to design and implement a dedicated database
    model for data mining purposes - an Analytical Data Set - that should
    be populated automatically in the defined period.   

  • A lot of traveling to the business unit site to work with the direct
    'customer' and local IT...generally put best practices into place after
    cleaning what little data we can find.  Going forward we generally
    develop a project plan around better, more robust data collection.

  • When the data is unavailable, we provide our "second best" model from
    whatever data does exist, and a long disclaimer.  This doesn't really
    improve the current solution, but it's proven very effective in getting
    people to (i) continue to call us and (ii) do a better job of getting the data
    we've asked for.

  • Access of data is addressed at business teams level, since lack of data
    means that their goals cannot be met with our support.  

  • A big problem we have is that the groups (government agencies) that
    generate the data we need do not trust others with that data because
    they fear either being embarrassed by possible oversights they have
    made, or that disclosure of this data will lead to an increased workload
    for them dealing with requests and queries about the data that results
    from users of the data not understanding the context of the data. An
    approach for sharing this data that we have seen work is for the
    organizations that hold the data to form an interest group that controls
    access to the data and drives the uses of the data.

  • Anonymizing all data except for 2 persons with clearance.    

  • If data are unavailable, look to Census data providers for data that can
    be appended based on zip code/postal code.  There is a rich source of
    data available that often meets or exceeds privacy standards.

  • Get rid of IT guys, they have other things to do!  Get your raw (!) data
    only one time from the IT and than use ETL options within your mining
    tool.  This is surprisingly efficient.  My Statistica ETL handles 100
    million records on a 2 CPU workstation.

  • Give motivation to the members of the project, from the beginning (on
    expected ROIs, etc.).  Then the data become more available!

  • One recent eureka moment: Customer geographical location is
    important for marketing spend optimization - especially for TV spend.  
    Our web analytics vendor guesses visitor location from their IP address.
    For subscribers, our order processing system gets billing address.  The
    knowledge gap is for our pre-subscriber "registrants" for whom we only
    ask an email address.  We can get the registrant location by working
    backwards in our web visit logs to pick up the IP based geography.     

  • Planning and intelligently implementing ETL or ELT extract to acquire
    data from various source systems.  

  • There are still some people who have key data in a localized
    spreadsheet or database and it is often hard to get them to part with it,  
    or to get that data to merge in with other data if they will allow you
    access.  It just takes diplomacy sometimes.  The managers need to get
    behind creating a more public place to store this data.

  • Show value with small set to help get more.

  • Make unavailability of data known to those best able to obtain it.  
    Simple knowledge of data mining methods is insufficient to support
    most projects.  Successful project completion requires that the analyst
    be an expert in data processing methods, encompassing knowledge of
    multiple databases, operating systems/platforms, and tools that can be
    used to clean data.  In short, analyst must be an expert in IT
    technologies and methods.  

  • Keep talking to those who has custody of data until they loosen their

  • We have created dummy data and are able to highlight the importance
    collecting data.  Clients get a feel of data and its importance.  We have
    done it in few domains.

  • Being backed by a person that does the lobbying.

  • Explain the consequences of missing data, like lack of significance of
    results and risk of misleading models and wrong decisions based upon

  • Decentralised control with governance and stewardship helped
    significantly.  Reliance on IT was reduced.

  • Explain to customers what kind of data you need and, if possible, ask
    them to prepare a sample.  Clearly divide projects into phases such that
    if data is not good enough, it is not a problem to re-plan the project or
    even cancel it.  These scenarios should be clearly explained to

  • Action: Talked to upper corporate management and got their support.  
    Received data from side database.  Even if data is in one day in delay,
    it served well for DM purposes.

  • Data access is very often an issue - usually not insurmountable, but
    essentially caused by the fact that the data being mined was usually
    collected for other purposes, and the data architectures were not
    designed with data mining in mind.   

  • This is a difficult one; we're engaging our IT dept to better understand
    what we do with data mining and why we need better, faster access to   
    the data and just a sample sample of it but really all of it on our
    customers so that they can deliver the right set of aggregated,   
    cleansed, and correctly customer mastered data sets for mining.

  • Usually we're stuck with a lag time without data but sometimes a meta-
    analysis of other published work can provide insight. Also, ours and
    other medical organizations are expanding in analysis of 'non-
    traditional' data feeds like consumer-credit bureau information which
    can be related to a populations health needs.

  • Just getting ahead of the curve with clients - making sure we have the
    "data discussion" prior to investing much time on a project.   

  • Bootstrapping.

  • Buying in data from external source upon acceptance of its source by

  • We investigate the availability and quality of data during the process of
    making a concrete proposal. We include data quality and data
    validation checks in our projects.  

  • Writing specialized nodes with R     

  • External Data providers.     

  • Creating derived fields is something that we often do and I personally
    look at this activity as an integral component of any data mining study.

  • Assess data by 'availability vs. volatility' and start working with data that
    is readily available.

  • Format is often the culprit, tool inaccessibility to the database.  

  • Requesting IT to bring the fields in.

  • SQL Server bridges the gap for us.    

  • Structuring a PKM system    

  • Take advantage of data open access, e.g., wiki, datasets in paper for

  • We have not found a cure, per se.  However, we have learned to quote
    all project schedules in terms of time elapsed from receipt of complete
    and accurate data.    

  • With client very keen on handing off the project and expecting data
    miners to solve all problems in one shot, often the client is unaware of
    how bad their data is. It could be due to a number of reasons:  1) Lack  
    of good understanding of the business at the data source   2) Data entry
    error  3) Data loading error due to numeric formats / # of characters.   

  • Sometimes data exists on Excel spreadsheets, so that has to be
    accessed and merged with Oracle data.  This process is time-
    consuming and is prone to errors.  

  • Centralized data repository.  

  • Conduct a new study.  

  • This has not been fully solved. In the UK NHS (National Health
    Service) there are many ethics committees which do not permit access
    to data even when no possible demonstrable harm to patients has been
    identified. Given that for most organizations their data are in effect their
    organisational memory, this effectively denies future patients any
    possible benefit of learning from the organisation's experience.  

  • Using a high available database architecture like Oracle RAC would
    prevent unavailability of data except if the data are not loaded in the