Overcoming Data Mining Challenges:
Difficult Access to Data
In each of the four annual data miner surveys, "Unavailability of Data / Difficult Access to Data" has been identified as a one of the most frequent challenges data miners face. While completing the 4th Annual Survey (2010), 46 of the 735 data miners shared their experiences in overcoming this challenge. Below is the full text of the "best practices" they shared. A summary of data miners' experiences overcoming the top three data mining challenges is also available.
Challenge: Unavailability of Data / Difficult Access to Data
- We continue to work on better, easier, faster ways to access data. In fact, we have employed a data gathering specialist who works full-time on data gathering efforts, thus reducing the stress on our statisticians.
- I usually would confer with the appropriate content experts in order to devise a reasonable heuristic to deal with unavailable data, or immute variables. Difficult to access data means typically we don't have a good plan for what needs to be collected. I talk with the product managers and propose data needs for their business problems. If we can match the business issues with the needs, data access and availability is usually resolved.
- Our best practice is to design and implement a dedicated database model for data mining purposes - an Analytical Data Set - that should be populated automatically in the defined period.
- A lot of traveling to the business unit site to work with the direct 'customer' and local IT...generally put best practices into place after cleaning what little data we can find. Going forward we generally develop a project plan around better, more robust data collection.
- When the data is unavailable, we provide our "second best" model from whatever data does exist, and a long disclaimer. This doesn't really improve the current solution, but it's proven very effective in getting people to (i) continue to call us and (ii) do a better job of getting the data we've asked for.
- Access of data is addressed at business teams level, since lack of data means that their goals cannot be met with our support.
- A big problem we have is that the groups (government agencies) that generate the data we need do not trust others with that data because they fear either being embarrassed by possible oversights they have made, or that disclosure of this data will lead to an increased workload for them dealing with requests and queries about the data that results from users of the data not understanding the context of the data. An approach for sharing this data that we have seen work is for the organizations that hold the data to form an interest group that controls access to the data and drives the uses of the data.
- Anonymizing all data except for 2 persons with clearance.
- If data are unavailable, look to Census data providers for data that can be appended based on zip code/postal code. There is a rich source of data available that often meets or exceeds privacy standards.
- Get rid of IT guys, they have other things to do! Get your raw (!) data only one time from the IT and than use ETL options within your mining tool. This is surprisingly efficient. My Statistica ETL handles 100 million records on a 2 CPU workstation.
- Give motivation to the members of the project, from the beginning (on expected ROIs, etc.). Then the data become more available!
- One recent eureka moment: Customer geographical location is important for marketing spend optimization - especially for TV spend. Our web analytics vendor guesses visitor location from their IP address. For subscribers, our order processing system gets billing address. The knowledge gap is for our pre-subscriber "registrants" for whom we only ask an email address. We can get the registrant location by working backwards in our web visit logs to pick up the IP based geography.
- Planning and intelligently implementing ETL or ELT extract to acquire data from various source systems.
- There are still some people who have key data in a localized spreadsheet or database and it is often hard to get them to part with it, or to get that data to merge in with other data if they will allow you access. It just takes diplomacy sometimes. The managers need to get behind creating a more public place to store this data.
- Show value with small set to help get more.
- Make unavailability of data known to those best able to obtain it. Simple knowledge of data mining methods is insufficient to support most projects. Successful project completion requires that the analyst be an expert in data processing methods, encompassing knowledge of multiple databases, operating systems/platforms, and tools that can be used to clean data. In short, analyst must be an expert in IT technologies and methods.
- Keep talking to those who has custody of data until they loosen their grip.
- We have created dummy data and are able to highlight the importance collecting data. Clients get a feel of data and its importance. We have done it in few domains.
- Being backed by a person that does the lobbying.
- Explain the consequences of missing data, like lack of significance of results and risk of misleading models and wrong decisions based upon them.
- Decentralised control with governance and stewardship helped significantly. Reliance on IT was reduced.
- Explain to customers what kind of data you need and, if possible, ask them to prepare a sample. Clearly divide projects into phases such that if data is not good enough, it is not a problem to re-plan the project or even cancel it. These scenarios should be clearly explained to customers.
- Action: Talked to upper corporate management and got their support. Received data from side database. Even if data is in one day in delay, it served well for DM purposes.
- Data access is very often an issue - usually not insurmountable, but essentially caused by the fact that the data being mined was usually collected for other purposes, and the data architectures were not designed with data mining in mind.
- This is a difficult one; we're engaging our IT dept to better understand what we do with data mining and why we need better, faster access to the data and just a sample sample of it but really all of it on our customers so that they can deliver the right set of aggregated, cleansed, and correctly customer mastered data sets for mining.
- Usually we're stuck with a lag time without data but sometimes a meta-analysis of other published work can provide insight. Also, ours and other medical organizations are expanding in analysis of 'non-traditional' data feeds like consumer-credit bureau information which can be related to a populations health needs.
- Just getting ahead of the curve with clients - making sure we have the "data discussion" prior to investing much time on a project.
- Buying in data from external source upon acceptance of its source by clients.
- We investigate the availability and quality of data during the process of making a concrete proposal. We include data quality and data validation checks in our projects.
- Writing specialized nodes with R
- External Data providers.
- Creating derived fields is something that we often do and I personally look at this activity as an integral component of any data mining study.
- Assess data by 'availability vs. volatility' and start working with data that is readily available.
- Format is often the culprit, tool inaccessibility to the database.
- Requesting IT to bring the fields in.
- SQL Server bridges the gap for us.
- Structuring a PKM system
- Take advantage of data open access, e.g., wiki, datasets in paper for download.
- We have not found a cure, per se. However, we have learned to quote all project schedules in terms of time elapsed from receipt of complete and accurate data.
- With client very keen on handing off the project and expecting data miners to solve all problems in one shot, often the client is unaware of how bad their data is. It could be due to a number of reasons: 1) Lack of good understanding of the business at the data source 2) Data entry error 3) Data loading error due to numeric formats / # of characters.
- Sometimes data exists on Excel spreadsheets, so that has to be accessed and merged with Oracle data. This process is time-consuming and is prone to errors.
- Centralized data repository.
- Conduct a new study.
- This has not been fully solved. In the UK NHS (National Health Service) there are many ethics committees which do not permit access to data even when no possible demonstrable harm to patients has been identified. Given that for most organizations their data are in effect their organisational memory, this effectively denies future patients any possible benefit of learning from the organisation's experience.
- Using a high available database architecture like Oracle RAC would prevent unavailability of data except if the data are not loaded in the database.
© 2017 Rexer Analytics. All Rights Reserved.
30 Vine Street
Winchester, MA 01890