In the age of machine learning, the more data, the better. Pooling and running models on medical data leads to new insights about genetics, disease, and treatment; combining laboratory results helps scientists more quickly discover new materials for next-generation batteries or superconductors. But many obstacles — technical, legal, and ethical — prevent the open sharing of data between organizations or research groups.
Data Stations, a new project from University of Chicago researchers, will attempt to eliminate these hurdles by reversing the flow of big data science. Instead of the arduous process of tracking down the right dataset, receiving permission to use it, downloading and working with it on their own computational resources, researchers can simply query a Data Station, where all of these steps are automated out of the view of users. Data providers control how their data can be used or combined, protecting sensitive information and intellectual property.
“The Data Station is a radically new approach that is needed to change how people and organizations think about, access, and use data,” said Ian Foster, Arthur Holly Compton Distinguished Service Professor in the UChicago Department of Computer Science, Distinguished Fellow and Senior Scientist at Argonne National Laboratory, and primary investigator of the project. “This platform will ease access to sensitive data, assist with data discovery and integration, and facilitate data governance and compliance across fields of inquiry.”
The study was funded by a $1 million grant from the National Science Foundation as part of their Convergence Accelerator program, which boosts collaborations between academia and industry. Other investigators on the project include Michael J. Franklin and Raul Castro Fernandez of UChicago Computer Science and Sendhil Mullainathan of the Booth School of Business.
A Neutral Zone for Data
Imagine a large-scale medical study that wants to test the effectiveness of prospective COVID-19 treatments in patients of different ages, comorbidity status, and racial background. A single hospital system may have access to data from thousands of patients, but a fully-powered study might require hundreds of thousands or millions of individuals, necessitating the combination of data from multiple sources.
Yet health care organizations are understandably hesitant about sharing data. Legal restrictions on the use of private medical data without consent, concerns about the misuse or leaking of sensitive information on health status or race, and even competitiveness can throw up insurmountable barriers to pooling data for research.
Data Stations solve this problem by offering a “neutral zone” where data is shared but sealed, so that users cannot see, access, or download the original datasets, viewing only a broad catalog of what data is available. Users query the collected data with “data-unaware task capsules,” — for example, asking about the effectiveness of treatments across patient demographics — and the Data Station automatically does the rest: finding the right data, combining it or using it to train necessary AI models, and providing the user with their answer without disclosing the underlying raw data.
Data providers can also finely control what parts of their dataset can be used and for which tasks, either by pre-setting what is and isn’t allowed or by manually reviewing requests that would draw upon their data. Data stations automatically track what is done with each dataset, so that providers can see how their contributions are used and users can properly cite their sources.
“The Data Stations architecture opens up many different opportunities that allow people to publish data with more control of not just which data can be used, but how the data is being used,” said Raul, assistant professor of computer science at UChicago and co-PI on the project. “This is a fundamental piece that we are going to need if we want to really share data to get the value out of it without falling into the many pitfalls that exist in the way.”
To smoothly achieve these goals, the Data Station platform will be based on a foundation of software created by UChicago researchers, including Aurum for data discovery, Globus for authentication, and DLHub for the sharing of machine learning models.
Building a Data Marketplace
In addition to facilitating data sharing and discovery, Data Stations also enable the emerging concept of the data marketplace, where data generators from social media users up to large research organizations are compensated for the value of their data contributions. For example, the ability of Data Stations to track what data is used for various tasks allows for proper distribution of authorship and financial rewards should a study result in a patent or commercial opportunity.
Furthermore, Data Stations will also provide incentives — monetary or otherwise — for the generation and sharing of new datasets. If a user submits a query that cannot be answered with the data currently available, they might post a bounty for the necessary missing data to be collected or provided. Or if a query cannot be completed through automated processes alone, incentives could be provided for dataset providers or independent specialists to perform the manual tasks needed to produce the answer.
“Data Stations introduce incentive mechanisms to motivate data contributors and data users to treat data as a valuable asset and help solve data problems when the technical solution is insufficient,” said Franklin, Liew Family Chair of Computer Science at UChicago and co-PI on the project. “Because data and computation are centralized at the station, we know the supply and the demand, and we can incentivize humans to share data and concentrate their effort where it matters most: assisting with discovery and integration tasks.”
Initial partners on the Data Station project include Nightingale, a non-profit based at Chicago Booth's Center for Applied AI that works with multiple U.S. health systems on sharing medical data for research, the Network for Computational Nanotechnology Nanomanufacturing Node (nanoMFG) centered at the University of Illinois at Urbana-Champaign, global finance services firm Morningstar, and manufacturer 3M. These collaborating organizations will provide use cases and prototype early versions of Data Station in their own operations.
A second Convergence Accelerator grant was awarded to a team including Nick Feamster of UChicago CS and Lior Strahilevitz of UChicago Law to study “AI-Enabled, Privacy-Preserving Information Sharing for Securing Network Infrastructure.” Read more about the NSF Convergence Accelerator 2020 projects here.