Strategic Planning Workgroup on Big Data

Workgroup Proposal

Introduction

Big Data is a technical term used to describe the varying growth and availability of complex data (both structured and unstructured) whose management exceeds the power of traditional data processing resources. It is generally characterized as having the “Three Vs”, high volume, high velocity, and high variety (and occasionally with additional Vs: high veracity and high value). Collection of Big Data is scaling at an unprecedented rate. Data now stream from daily life, from credit cards and televisions, computers and social media; bio- and motion-sensors, GPS and other data capturing devices, including those embedded into smart phones and ‘wearable technology’ such as smart watches. Ecological Momentary Assessments are regularly combined with these and other sensor and GPS data, while Geolocation Momentary Assessment (GMA) is successfully implemented as a rapidly expanding tool in substance abuse research. Data collected from rapidly developing technologies in various biomedical and behavioral fields, including next-generation sequencing, epigenetics, genomics, epidemiology, neuroimaging, and organizational and services research are also growing fast and allowing researchers the unprecedented ability to capitalize on vast amounts of data. These technological advances, however, all face the common challenge of turning complex information into usable and manageable data.

New tools and technologies to capture, process, share and store Big Data are both goals and challenges, as data are also being produced at a rate outpacing the development of storage technologies. Accessing and using Big Data despite the plethora of format types presents a clear challenge, and the sheer volume of Big Data makes data transport a resource-intensive feat even with the fastest communications networks. The complexity of Big Data imposes enormous computational and resource challenges. Keeping Big Data secure and private, likewise, is challenging in an era of system intrusions and soft-espionage. Combining data from various sources and formats, requires implementation of existing data standards to the degree possible, which can be achieved via usage of common data elements, shared ontologies and data dictionaries. Curation and analysis of data, analysis tools (including machine learning and artificial intelligence techniques), and data visualization all pose challenges that need to be addressed. NIDA should not only be cognizant of, but also be prepared for these challenges and take advantage of the resulting opportunities over the course of the next 5-10 years. In this proposal, we consider the opportunities and challenges of Big Data as they pertain to NIDA. The appendix contains a partial list of relevant resources.

Priority 1: Big Data Sharing

Data sharing is an essential and complex component in leveraging Big Data. Harnessing large quantities of data generated world-wide has numerous methodological, ethical and economic advantages, but it requires the neuroscience community to adopt a culture of data sharing and open access to realize this benefit.

Challenges and Opportunities

Key issues for sharing Big Data include providing data according to the “FAIR” principles (Findable, Accessible, Interoperable, and Reusable; https://force11.org/groups/force11-rda-fairsharing-working-group/), including distribution and aggregation mechanisms, storage, and ensuring the security of data and privacy of research subjects. To maximize research investment and value, it is critical to capitalize on the potential of big data using FAIR principles:

  • Findable: To be useful, Big Data must be easily and efficiently searchable. There are novel models for enhancing the discoverability of data in public and private sectors, of which NIDA should take advantage. The NIH Big Data 2 Knowledge (BD2K) project, which is developing a pilot solution (Data Discovery Index, DDI), may present NIDA with a great opportunity in this regard.  Once data sets are found, they should be able to be interrogated for various scientific applications (e.g., specific sets of genes, chromatin modifications, brain regions, etc.)
  • Accessible: Management of multiple credentials may be a deterrent for users, but can be resolved through consolidation of credentials across data sources. NIDA should take advantage of activities in the extramural community that aim to improve the consent process, including the possible development of a “universal consent”. Security and privacy are extremely important issues; the extant NIH security regulations appear to suffice to secure the data, although constant vigilance is required. Regarding privacy, NIDA should consider solutions proven successful for various data sources (e.g., dephasing imaging data for open access). Accessibility also includes issues such as persistent storage, authenticated access and usability, which point to metadata and other relevant issues.
  • Interoperable: A key challenge to interoperability is the identification of common data elements (CDE, see glossary) as required at the level of the data element itself as well as the associated clinical data. Effective development/identification of CDEs requires expertise in the relevant research domains. Unfortunately, the current lack of consensus on optimal data elements for various domains of cognitive, behavioral and psychiatric function, presents a significant challenge in this regard. In particular, it fosters continued duplications of efforts in the creation of assessment instruments/nomenclatures and limits the ability to aggregate data from different studies without the need for tedious data-mapping efforts. To this end, securing the support of the user community and defining their role in the process is paramount to success. NIDA needs to be at the forefront of efforts to homogenize CDEs. A great resource for NIDA is extant NIH CDE repositories and other similar resources (see Appendix Section 1).
  • Reusable: The need for usability is generally met by having followed the first three requirements of the FAIR model, plus assuring that the data are sufficiently well described, enjoy the necessary levels of richness of metadata and provenance, and are standardized enough such that they can be utilized in future research, preferably with minimal human effort.
    • As a first step in this effort, NIDA is developing the NIDA “Addictome” Portal, which will provide data coordination, visualization and analysis tools that can be used by the scientific community to mine and visualize multi-scale data sets in a user-friendly, 4D-framework. This portal moves NIDA towards creating a big data resource generated through investigator-initiated studies to enable data mining and identify emergent opportunities across seemingly disparate data sets. The Addictome Portal is a very promising platform that can increase efficiency and provide a process by which NIDA can adopt Big Data Science as an integrated component of the research portfolio.
    • Another challenge in sharing of Big Data is the transfer time and cost. Single-site storage and analysis platforms, such as the NIH Commons, which utilizes compliant cloud services, may be one approach to mitigate this issue. Cloud computing minimizes data transfer cost/time, and can be accomplished through colocation of data and analysis software, and by providing pre-packaged computing environments facilitating use by researchers.  Complementing this solution, are mechanisms that limit the need for data transfer (e.g., federated solutions, the sharing of intermediate/processed forms of the data). Each of these models has their relative advantages and merit exploration.
    • Additionally, there is a need to create a culture valuing data sharing through incentives, and impactful attribution as has occurred in the genetics community. NIDA should support various practices and approaches that encourage equal value being placed on citations of an investigator’s shared data and citations of research articles. For example, (1) pilot studies such as the “Commons Credits” provided through the BD2K, should be evaluated for its efficacy, (2) NIDA should support extant plans for data citation and data sharing at NIH and elsewhere, and (3) encourage the education of NIH study sections and university promotion committees as to the considerable value of data depositions and citations.

Priority 2: Big Data Capture & Formats

Big Data capture requires research workflows and complementary technologies that allow data types and formats to be recorded by investigators in a form that enables subsequent re-analysis and integration with other data. The following considerations regarding data capture and formatting are key: capturing tools hardware and software; data type, format, and descriptive metadata; intended data use; and the ability to adapt to new analysis opportunities unforeseen at the time of data capture.

Challenges and Opportunities

The use of common data formats and elements, ontologies, data dictionaries, and public application programming interfaces (APIs) are extremely important, especially as available datasets expand and increase in complexity.

  • NIDA should leverage the existing interoperability resources at NIH and in extramural communities and only engage in developing its own formats, CDEs, or ontologies only where they do not already exist. As the goal of Big Data is collaboration, NIDA is encouraged to participate in and contribute to trans-NIH efforts, including the Trans-NIH Biomedical Informatics Coordinating Committee (BMIC) CDE Working Group and the BD2K Standards Coordinating Center (SCC).
  • Data format and capture need thoughtful consideration because they rapidly evolve. Ideally, all data should be formatted and annotated so that they can be effectively and efficiently used with minimal effort by researchers. To facilitate the application of natural language processing and machine learning, and techniques for achieving a streamlined and automated solution, the data producer must employ standardized data and metadata annotation, which must be provided with the dataset.
  • NIDA should facilitate and promote the use of open formats (vs. proprietary or closed formats) and the sharing and exchange of data (see Appendix Section 2: The NeuroData Without Borders Initiative). NIDA should also provide researchers with resources and technologies to discover and use these formats and application programming interfaces (APIs). It is also imperative to provide resources for training investigators to incorporate effective data capture as part of their research workflow.
  • NIDA must also consider big data that are emerging from technological developments, such as electronic health record systems (EHRs) and social media. Interoperable EHRs will become an important means of capturing and standardizing clinical research data which can be merged with, e.g., genome sequence data. EHR data from the nation’s healthcare systems provide opportunities to ascertain demographic, co-morbid, and complex phenotypes including substance use disorders. These systems enable big data capture involving large numbers of patients that can be aggregated through interoperable EHRs and through big data science methods. This synthesis would enable NIDA researchers to correlate genotypic data with clinical phenotypic data and accelerate big-data analyses to inform Precision Medicine. It would also better integrate drug abuse efforts within primary care settings where most screening and initial intervention is needed. NIDA is also encouraged to support advanced artificial intelligence solutions for analyzing clinical and non-clinical semi-structured or unstructured data relevant to NIDA’s mission. Social media is emerging as a promising data source for epidemiological research and the methods and ethics of capturing and utilizing this big data are quickly evolving.

Priority 3: Data Curation, Storage, Analytics and Visualization

Data curation, storage, analytics and visualization resources are critical for quality control and to maximize data use and reuse. These topics relate to the user experience, and often involve graphical user interfaces (GUIs). The quality of the user’s experience in using the data from data repositories predicts the success of any data sharing initiative. Consultation with library scientists who have expertise in data curation is recommended.

Challenges and Opportunities

  • Curation: Data curation, including monitoring data quality and introduction of error, is seminal to ensure the quality of the data, and thus the quality of the user experience in sharing and using data. Curation failure is one of the largest threats to the sustainability of current data sharing initiatives. To counteract the threat of “garbage in, garbage out”, it is essential to use mechanisms that support the sustainability of databases in general; these include subscription fees, pay per use, “freemium” (where users pay to upgrade an extended service) and government support. Curation requires additional personnel, however, the aforementioned methods may help to underwrite curation cost.
  • Storage: Large storage capacity is needed to accommodate the growth of primary, pre- and post-processed data as well as the analyzed results and related software. Different models—such as a federated network, centralized archives and cloud based or yet to emerge dynamical storage systems—should also be considered. Storage solutions from other domains may be instructive for NIDA. Understanding usage patterns, raw capacity, bandwidth needs and cyclical demands (and the extent to which different types of data are utilized) will inform how to store data and for how long. The format and accessibility of data may be modified to reduce storage costs as data progress from highly utilized to obsolete. The private sector has been dealing with these issues longer and at a greater scale, and these solutions should be considered.
  • Understanding data usage can inform the types and formats of data to store and maintain, and for how long.  To this end tracking data usage in big systems can help to anticipate users’ current and future data analytical needs. However, such needs will always be changing. As technologies change, data collection methods should remain fluid to permit the capture of any relevant data type. Long term active management and curation of any dataset will be essential to keeping it relevant and useful.
  • It is important to note that the generation of consensus metrics and guidelines for data quality is a prerequisite for data curation. Several fields (e.g., neuroimaging) are yet to converge on such a consensus, limiting hopes for effective data curation at the present time. Future investment in the determination of consensus metrics and guidelines is essential.
  • Analytics: To maximize value, analytics that cross scientific disciplines, data types, and levels of analysis are paramount. For example,
    • Future researchers may find value from the combined analysis of imaging, genetics, and behavioral data analysis from the same individual. Overlaying sequencing data across species is another valued analysis.
    • There is a need for more than the creation of a data library: tools are needed that allow access to diverse but related datasets, from different researchers, for the purposes of alternate analyses and new kinds of analysis.
    • Methods need to be developed to understanding and model complex and high-dimensional data, such as that that will emerge from large, complex studies such as the PATH study or the ABCD study. These datasets of the future will require constant curation, attention and annotation.
    • To mitigate some of the difficulties in Big Data transfer, the data storage resource can also provide a computational functionality via standardized virtual machines with common analysis tools and pipelines. Computational tools that can operate in a distributed fashion, including on users’ devices, while maintaining data privacy, could be especially useful.
  • Visualization: More advanced techniques/tools would improve our ability to more rapidly visualize data. Currently data analysis and visualization techniques are being developed in disparate fields, but there are many opportunities for analytic advances in one field to be applied to others. To encourage development in this area, NIDA should:
    • Encourage development and use of sophisticated machine learning tools/techniques such as hierarchical aggregation, for viewing large and complex data.
    • Engage collaboration with experts who may not be part of the NIH community (e.g., video game developers, data visualization experts, and behavioral and social science researchers)

Summary and Recommendations:

To maximize the potential of big data, scientists and users from diverse areas need to be able to find data easily and to use them in new ways. Integrating data across the continuum from basic research to health care, including big data science, is critical to advancing the Institute of Medicine's vision of a "Learning Health Care System" and that of the President’s Precision Medicine Initiative. The Big Data Working Group endorses Big Data science as an area of high priority for NIDA to pursue, and believes that the NIDA Addictome provides a model process for Big Data implementation.  The Big Data Working Group reiterates the following recommendations:

  • NIDA should adhere to widely-used practices or use existing technologies and resources such as common data elements, formats, data dictionaries and ontologies, and only create new ones when they do not exist or insufficiently address requirements of substance abuse research.
  • NIDA should actively look for the options and solutions that the NIH BD2K project is planning to make available such as the NIH Commons and Data Discovery Index (DDI).
  • NIDA should take advantage of activities in the extramural and scientific research communities that highly impact conduct of research, such as various practices and approaches toward a “universal consent” or  efforts that encourage placing high value on citations of an investigator’s shared data and citations of research articles.
  • NIDA needs to be at the forefront of efforts for the development of platforms that allow data to be easily integrated from diverse sources that are replicable, validated, standardized and that can be repurposed for future research. The Addictome is a critical model in addressing these needs.

Appendix: Resources

Section 1: Data Sharing

Some of the available resources are:

  • NIH Commons and commercial cloud services
  • NIH Big Data to Knowledge Initiative (BD2K)
  • Neuroscience Information Framework (NIF) , BioCaddie/Data Discovery Index
  • National Addiction and HIV Data Archive Program (NAHDAP)
  • 1000 Functional Connectomes Project/International Neuroimaging Data-sharing Initiative (INDI)
  • OpenFMRI.org
  • Connectome Coordinating Facility
  • Preprocessed Connectomes Project (PCP)
  • The Collaborative Informatics and Neuroimaging Suite (COINS)
  • National Database for Autism Research (NDAR)
  • Neuroimaging Informatics Tools and Resources Clearinghouse (NITRC)
  • Observational Health Data Sciences and Informatics (OHDSI)
  • White House Report to President on Big Data and Privacy, 2014
  • Culture: Research Data Alliance, Force11
  • Nature Scientific Data
  • GigaScience

Section 2: Data Capture and File Formats

Some of the available resources are:

  • CDEs, Standards: NCBO, Biosharing repository of standards and formats, MIAME, Minimum Information about an Electrophysiology Experiment, INCF Task Force: Neuroimaging Data Sharing Task Force, INCF Task Force: Requirements for storing electrophysiology data, Neuroscience Information Framework Ontologies - Standardized (NIFSTD) , NeuroLex, caDSR, NIAID IMMPORT, NIH Toolbox, Clinical Data Interchange Standards Consortium, Submission Data Standards Team, Clinical Data Acquisitions Standards Harmonization, Minimum Information for Biological and Biomedical Investigators, The Ontology for Biomedical Investigators, NEMO, BFO, http://www.obofoundry.org/, Bioontology.org, MINI – Minimum Information for a Neuroscience Investigation, NINDS epilepsy, spinal cord injury and TBI CDEs, NeuroNames, Terminology services: CTS2, LexEVS, OntoQuest, ISO/IEC 11179 metadata standard.
  • NIH Common Data Element (CDE) Resource Portal (https://cde.nlm.nih.gov/home)
  • NIDA CTN Common Data Elements (https://cde.nida.nih.gov/ )
  • PhenX Toolkit (https://www.phenxtoolkit.org/index.php)
  • W3C HCLS Dataset Description (http://www.w3.org/2001/sw/hcls/notes/hcls-dataset/)
  • W3C Datacube vocabulary (http://www.w3.org/TR/vocab-data-cube/)
  • Data Standardization Efforts
  • NeuroData without Borders (https://www.nwb.org/)
  • Nifti

Section 3: Data Curation, Storage, Analytics, and Visualization

Glossary

  • Common Data Elements (CDEs) are standardized terms for collecting data across clinical research projects and resources. A CDE consists of both a precisely defined question and an enumerated set of possible values for responses (answers) and is intended for use in multiple clinical studies or resources, such as data repositories and patient registries. CDEs consisting of individual question/answer pairs can be combined into more complex questionnaires, survey instruments, or case report forms. CDEs offer substantive benefits to the biomedical research enterprise in terms of interoperability and data integration, repurposing, and sharing. More widespread use of CDEs can accelerate the start-up of new research project by providing a set of established data elements from which investigators can select. CDEs can improve the quality of data collection by fostering the use of data collection instruments which have been validated or vetted by expert groups. They can also improve big-data science by facilitating the comparison of results across research studies and enable the aggregation and analysis of data from multiple studies to provide new insight and/or greater statistical power.

    CDEs are only as useful as the intended user community perceives them to be and the extent to which they are flexible enough to accommodate the diversity of research and rate of change. Therefore, securing the support of the user community and defining their role in the process of development is paramount to the successful development and application of CDEs. Furthermore, effective development/identification of CDEs requires expertise in many different domains. Expertise in the relevant research domains is necessary to identify measures of primary interest and assess their validity and viability in both research and practical settings. Such expertise can come from researchers, clinicians, and other health professionals, all of whom bring unique perspectives. Common Data Elements can also be determined through comparison of existing databases and data sets to identify fields that are invariably common to multiple platforms. Expertise in bioinformatics is necessary to develop or select data elements that are consistent with existing data standards, including those used in clinical care settings and electronic health records (EHRs), to define data elements in specific measurable terms, and to express data elements in ways that are both syntactically and semantically interoperable. Representatives of the patient community (e.g., patient advocates) can also bring valuable expertise and perspective, e.g., in identifying the measures of greatest interest to patients and in considering practical issues of data collection and administration.

    When identifying data elements for inclusion in CDEs, it is preferable to select data elements that have been tested to establish their validity, reliability, sensitivity, and specificity to the condition of interest. Efforts should be made to validate data elements across the populations of interest, taking into consideration characteristics such as genetic information, race/ethnicity, socioeconomic status, or geographic areas that may be involved in a study.

Staff

NIDA co-chairs:  Roger Little, PhD and Massoud Vahabzadeh, PhD
External Scientific Matter Experts: Christopher Chute, MD, DrPH; Maryanne Martone, PhD; Michael Milham, MD, PhD; Michael Neale, PhD; Eric Nestler, MD, PhD; Arthur Toga, PhD
NIDA staff:  Ericka Boone, PhD; Philip Bourne, PhD; Maureen Boyle, PhD; Udi Ghitza, PhD; Steve Gust, PhD; Vani Pariyadath, PhD; Tom Radman, PhD;   Joni Rutter, PhD; Tisha Wiley, PhD