Overcoming Data Mining Challenges:
Difficult Access to Data
|Copyright (c) 2011 Rexer Analytics, All rights reserved
Overcoming Data Mining Challenges:
Difficult Access to Data
In each of the four annual data miner surveys, "Unavailability of Data / Difficult
Access to Data" has been identified as a one of the most frequent challenges data
miners face. While completing the 4th Annual Survey (2010), 46 of the 735 data
miners shared their experiences in overcoming this challenge. Below is the full text
of the "best practices" they shared. A summary of data miners' experiences
overcoming the top three data mining challenges is also available.
Challenge: Unavailability of Data / Difficult Access
- We continue to work on better, easier, faster ways to access data. In
fact, we have employed a data gathering specialist who works full-time
on data gathering efforts, thus reducing the stress on our statisticians.
- I usually would confer with the appropriate content experts in order to
devise a reasonable heuristic to deal with unavailable data, or immute
variables. Difficult to access data means typically we don't have a good
plan for what needs to be collected. I talk with the product managers
and propose data needs for their business problems. If we can match
the business issues with the needs, data access and availability is
- Our best practice is to design and implement a dedicated database
model for data mining purposes - an Analytical Data Set - that should
be populated automatically in the defined period.
- A lot of traveling to the business unit site to work with the direct
'customer' and local IT...generally put best practices into place after
cleaning what little data we can find. Going forward we generally
develop a project plan around better, more robust data collection.
- When the data is unavailable, we provide our "second best" model from
whatever data does exist, and a long disclaimer. This doesn't really
improve the current solution, but it's proven very effective in getting
people to (i) continue to call us and (ii) do a better job of getting the data
we've asked for.
- Access of data is addressed at business teams level, since lack of data
means that their goals cannot be met with our support.
- A big problem we have is that the groups (government agencies) that
generate the data we need do not trust others with that data because
they fear either being embarrassed by possible oversights they have
made, or that disclosure of this data will lead to an increased workload
for them dealing with requests and queries about the data that results
from users of the data not understanding the context of the data. An
approach for sharing this data that we have seen work is for the
organizations that hold the data to form an interest group that controls
access to the data and drives the uses of the data.
- Anonymizing all data except for 2 persons with clearance.
- If data are unavailable, look to Census data providers for data that can
be appended based on zip code/postal code. There is a rich source of
data available that often meets or exceeds privacy standards.
- Get rid of IT guys, they have other things to do! Get your raw (!) data
only one time from the IT and than use ETL options within your mining
tool. This is surprisingly efficient. My Statistica ETL handles 100
million records on a 2 CPU workstation.
- Give motivation to the members of the project, from the beginning (on
expected ROIs, etc.). Then the data become more available!
- One recent eureka moment: Customer geographical location is
important for marketing spend optimization - especially for TV spend.
Our web analytics vendor guesses visitor location from their IP address.
For subscribers, our order processing system gets billing address. The
knowledge gap is for our pre-subscriber "registrants" for whom we only
ask an email address. We can get the registrant location by working
backwards in our web visit logs to pick up the IP based geography.
- Planning and intelligently implementing ETL or ELT extract to acquire
data from various source systems.
- There are still some people who have key data in a localized
spreadsheet or database and it is often hard to get them to part with it,
or to get that data to merge in with other data if they will allow you
access. It just takes diplomacy sometimes. The managers need to get
behind creating a more public place to store this data.
- Show value with small set to help get more.
- Make unavailability of data known to those best able to obtain it.
Simple knowledge of data mining methods is insufficient to support
most projects. Successful project completion requires that the analyst
be an expert in data processing methods, encompassing knowledge of
multiple databases, operating systems/platforms, and tools that can be
used to clean data. In short, analyst must be an expert in IT
technologies and methods.
- Keep talking to those who has custody of data until they loosen their
- We have created dummy data and are able to highlight the importance
collecting data. Clients get a feel of data and its importance. We have
done it in few domains.
- Being backed by a person that does the lobbying.
- Explain the consequences of missing data, like lack of significance of
results and risk of misleading models and wrong decisions based upon
- Decentralised control with governance and stewardship helped
significantly. Reliance on IT was reduced.
- Explain to customers what kind of data you need and, if possible, ask
them to prepare a sample. Clearly divide projects into phases such that
if data is not good enough, it is not a problem to re-plan the project or
even cancel it. These scenarios should be clearly explained to
- Action: Talked to upper corporate management and got their support.
Received data from side database. Even if data is in one day in delay,
it served well for DM purposes.
- Data access is very often an issue - usually not insurmountable, but
essentially caused by the fact that the data being mined was usually
collected for other purposes, and the data architectures were not
designed with data mining in mind.
- This is a difficult one; we're engaging our IT dept to better understand
what we do with data mining and why we need better, faster access to
the data and just a sample sample of it but really all of it on our
customers so that they can deliver the right set of aggregated,
cleansed, and correctly customer mastered data sets for mining.
- Usually we're stuck with a lag time without data but sometimes a meta-
analysis of other published work can provide insight. Also, ours and
other medical organizations are expanding in analysis of 'non-
traditional' data feeds like consumer-credit bureau information which
can be related to a populations health needs.
- Just getting ahead of the curve with clients - making sure we have the
"data discussion" prior to investing much time on a project.
- Buying in data from external source upon acceptance of its source by
- We investigate the availability and quality of data during the process of
making a concrete proposal. We include data quality and data
validation checks in our projects.
- Writing specialized nodes with R
- Creating derived fields is something that we often do and I personally
look at this activity as an integral component of any data mining study.
- Assess data by 'availability vs. volatility' and start working with data that
is readily available.
- Format is often the culprit, tool inaccessibility to the database.
- Requesting IT to bring the fields in.
- SQL Server bridges the gap for us.
- Take advantage of data open access, e.g., wiki, datasets in paper for
- We have not found a cure, per se. However, we have learned to quote
all project schedules in terms of time elapsed from receipt of complete
and accurate data.
- With client very keen on handing off the project and expecting data
miners to solve all problems in one shot, often the client is unaware of
how bad their data is. It could be due to a number of reasons: 1) Lack
of good understanding of the business at the data source 2) Data entry
error 3) Data loading error due to numeric formats / # of characters.
- Sometimes data exists on Excel spreadsheets, so that has to be
accessed and merged with Oracle data. This process is time-
consuming and is prone to errors.
- Centralized data repository.
- This has not been fully solved. In the UK NHS (National Health
Service) there are many ethics committees which do not permit access
to data even when no possible demonstrable harm to patients has been
identified. Given that for most organizations their data are in effect their
organisational memory, this effectively denies future patients any
possible benefit of learning from the organisation's experience.
- Using a high available database architecture like Oracle RAC would
prevent unavailability of data except if the data are not loaded in the