context-dependent platform Commons: 4/5

Data Commons

Also known as: Data Ecosystem, Data Mesh, Research Data Commons

1. Overview

A Data Commons is a shared, community-governed infrastructure that co-locates data with cloud computing infrastructure and commonly used software services, tools, and applications for managing, analyzing, and sharing data to create an interoperable resource for a research community. It functions as a collaborative ecosystem where data, tools, and services are brought together to accelerate research, innovation, and the creation of public value. The core idea is to move beyond fragmented, siloed data repositories towards a more integrated and accessible data environment. By providing a unified platform, a Data Commons enables a community of users to collectively benefit from shared data resources, fostering a culture of open science and data-driven collaboration. This approach is particularly crucial in an era of big data, where the scale and complexity of datasets often exceed the capacity of individual researchers or organizations to manage and analyze them effectively. The ability to bring together diverse datasets from multiple sources into a single, harmonized environment is a key feature of a data commons, enabling researchers to ask new questions and to gain new insights that would not be possible with individual datasets alone.

The significance of the Data Commons pattern lies in its potential to democratize access to data and computational resources, thereby leveling the playing field for research and innovation. In many fields, access to large, high-quality datasets is a key determinant of success. However, such data is often locked away in proprietary databases or is difficult to access and use due to technical or legal barriers. A Data Commons addresses this challenge by creating a shared space where data can be made available to a wider community of users, under a clear governance framework that ensures data quality, security, and privacy. This not only accelerates the pace of scientific discovery but also enables new forms of collaboration and knowledge production. Furthermore, by providing access to powerful analytical tools and computational resources, a Data Commons empowers researchers to ask new questions and explore complex datasets in ways that were previously not possible. The network effects created by a data commons are also a significant benefit, as the value of the commons increases with the number of users and the amount of data it contains.

The historical origins of the Data Commons pattern can be traced back to the open-source software movement and the broader open science movement, which have long advocated for the principles of collaboration, transparency, and knowledge sharing. The concept of a “commons” itself has a long history, referring to a shared resource that is managed and sustained by a community of users. In the context of data, the idea of a commons began to gain traction in the early 2000s, as the proliferation of digital data and the rise of the internet created new opportunities for data sharing and collaboration. Early examples of data commons can be found in fields such as genomics and astronomy, where the need to share and analyze large datasets has been a driving force for innovation. The development of cloud computing has been a key enabler of the Data Commons pattern, providing the scalable infrastructure needed to store, manage, and analyze massive datasets. Today, data commons are being established in a wide range of domains, from healthcare and environmental science to urban planning and social science, reflecting the growing recognition of the value of shared data resources. The work of Elinor Ostrom on governing the commons has also been influential in shaping the development of data commons, providing a set of design principles for creating sustainable and effective commons-based institutions.

2. Core Principles

Shared Governance. A Data Commons is governed by its community of users, who collectively define the rules and policies for data access, use, and sharing. This ensures that the commons is managed in a way that reflects the interests and values of the community it serves. Governance models can range from informal, community-based approaches to more formal, institutional structures. A key consideration is the balance between centralized control and decentralized autonomy. Some data commons are managed by a single organization, while others are governed by a consortium of stakeholders. The choice of governance model will depend on the specific context and goals of the commons.
Open Access. Data within a commons should be as open as possible, while respecting privacy, security, and ethical considerations. This principle is often operationalized through the use of open data licenses and clear data access policies.
Interoperability. A Data Commons should be designed to be interoperable with other data resources and platforms. This is achieved through the use of common data standards, formats, and APIs, which enable data to be easily shared and combined across different systems.
Sustainability. A Data Commons requires a long-term sustainability plan to ensure its continued operation and development. This includes securing funding, building a strong user community, and establishing a clear governance structure.
Community Engagement. A successful Data Commons is built on a strong and engaged community of users. This requires ongoing efforts to foster collaboration, provide training and support, and solicit feedback from the community.
Data Quality and Curation. High-quality, well-curated data is essential for a useful Data Commons. This requires processes for data validation, cleaning, and annotation, as well as clear documentation of data sources and provenance.
Security and Privacy. A Data Commons must have robust security measures in place to protect sensitive data from unauthorized access or misuse. This includes implementing access controls, encryption, and other security best practices.

3. Key Practices

Establish a Clear Governance Framework. This includes defining the roles and responsibilities of different stakeholders, establishing a decision-making process, and creating policies for data contribution, access, and use.
Develop a Core Set of Services. A Data Commons should provide a range of services to support its users, including data storage, data discovery, data analysis, and data visualization tools.
Implement a Modular and Open Architecture. A modular architecture allows for flexibility and scalability, while an open architecture promotes interoperability and collaboration. This can be achieved through the use of microservices, APIs, and open-source software.
Foster a Collaborative Community. This can be done through a variety of activities, such as workshops, hackathons, and online forums. It is also important to provide training and support to help users get the most out of the commons.
Ensure Data Quality and Provenance. This includes implementing data validation and cleaning pipelines, as well as tracking the provenance of all data in the commons. This helps to ensure that the data is accurate, reliable, and trustworthy.
Prioritize Security and Compliance. This involves implementing robust security measures, such as access controls and encryption, as well as ensuring compliance with all relevant legal and ethical regulations.
Develop a Sustainability Plan. This should include a strategy for securing long-term funding, as well as a plan for ongoing maintenance and development of the commons.

4. Application Context

Best Used For:

Large-scale research collaborations that require the sharing and analysis of large, complex datasets.
Building a data ecosystem for a specific scientific domain or research community.
Democratizing access to data and computational resources for a wider community of users.
Facilitating the development of new data-driven applications and services.

Not Suitable For:

Small-scale projects with limited data sharing needs.
Situations where data is highly sensitive and cannot be shared, even in a secure environment.
Projects with no long-term sustainability plan.

Scale:

A Data Commons can operate at various scales, from a small, domain-specific commons to a large, international data ecosystem. The scalability of a Data Commons is largely dependent on its underlying infrastructure and governance model. Cloud-based infrastructure provides the elastic scalability needed to support large and growing datasets, while a modular architecture allows for the addition of new services and capabilities over time. The governance model must also be scalable, with clear processes for decision-making and community engagement that can accommodate a growing number of users and stakeholders.

Domains:

Genomics and Healthcare: The NCI Genomic Data Commons (GDC) and the All of Us Research Program are leading examples of data commons in the biomedical domain.
Environmental Science: The National Ecological Observatory Network (NEON) provides open access to ecological data from across the United States.
Astronomy: The Sloan Digital Sky Survey (SDSS) has made massive astronomical datasets available to the public, leading to numerous scientific discoveries.
Social Science: The Inter-university Consortium for Political and Social Research (ICPSR) is a long-standing data archive that provides access to a vast collection of social science data.

5. Implementation

Implementing a Data Commons is a complex undertaking that requires careful planning and execution. A key architectural decision is whether to adopt a standards-driven approach or a ‘narrow middle’ architecture. A standards-driven architecture, where a comprehensive set of standards is defined upfront, is suitable for mature fields. However, for rapidly evolving fields like data commons, a ‘narrow middle’ architecture is often more appropriate. This approach, based on the end-to-end design principle of the internet, standardizes only a minimal set of core services, allowing for greater innovation and flexibility at the ‘ends’ of the system – the data ingestion and analysis layers. The core services in a narrow middle architecture typically include: 1) a service for permanent digital IDs, 2) a metadata service, 3) authentication and authorization services, and 4) data model services. This minimalist approach allows for a diversity of tools and applications to be built on top of the commons, fostering a vibrant ecosystem of innovation.

The technical implementation of a Data Commons involves building a software platform that provides a range of services for managing, analyzing, and sharing data. This typically includes a data storage layer, a data processing layer, and a data access layer. The data storage layer is responsible for storing and managing the data in the commons, while the data processing layer provides tools for cleaning, transforming, and analyzing the data. The data access layer provides a variety of ways for users to access the data, including a web portal, APIs, and other data access tools. The platform should be built on a modular and open architecture to allow for flexibility and scalability. The use of open-source software and common data standards is also recommended to promote interoperability and collaboration.

Building a successful Data Commons is not just a technical challenge; it also requires a significant investment in community building and user engagement. This includes providing training and support to help users get the most out of the commons, as well as fostering a collaborative community through workshops, hackathons, and other events. It is also important to solicit regular feedback from the community to ensure that the commons is meeting their needs and to identify areas for improvement. Finally, a long-term sustainability plan is essential to ensure the continued operation and development of the commons. This should include a strategy for securing funding, as well as a plan for ongoing maintenance and development of the platform.

6. Evidence & Impact

The impact of data commons can be seen in the accelerated pace of scientific discovery in a wide range of fields. For example, the NCI Genomic Data Commons (GDC) has made a massive amount of cancer genomics data available to researchers around the world, leading to new insights into the genetic basis of cancer and the development of new cancer treatments. The GDC provides a unified repository for cancer genomics data, along with a suite of tools for data analysis and visualization. This has enabled researchers to conduct large-scale, cross-tumor analyses that were previously not possible. Similarly, the Sloan Digital Sky Survey (SDSS) has revolutionized the field of astronomy by providing open access to a vast collection of astronomical data, enabling countless new discoveries about the structure and evolution of the universe. The SDSS has been instrumental in creating a more collaborative and data-driven culture in astronomy.

Beyond these well-known examples, there are many other data commons that are having a significant impact in their respective fields. The Civil Justice Data Commons, for example, is a joint project between Georgetown University Law Center and the Massive Data Institute at the McCourt School of Public Policy that aims to create a secure, robust repository for civil legal data. This will enable researchers and the public to better understand the civil legal system in the United States. Another example is the Open Commons Consortium, which operates a number of data commons for scientific research, including the BloodPAC Data Commons for cancer research and the Matsu project for analyzing satellite imagery. These and other data commons are demonstrating the power of this model to accelerate research, foster innovation, and create public value.

Beyond their impact on scientific research, data commons are also having a broader societal impact. For example, the OpenStreetMap project is a global data commons for geospatial data that is used by a wide range of organizations, from humanitarian aid agencies to local governments. The project has created a highly detailed and accurate map of the world that is freely available to anyone to use and improve. This has had a transformative impact on a wide range of applications, from disaster response and urban planning to navigation and local search. The success of projects like OpenStreetMap highlights the potential of data commons to create public value and to empower citizens to participate in the creation and management of shared data resources.

7. Cognitive Era Considerations

The advent of the cognitive era, characterized by the widespread adoption of artificial intelligence (AI) and machine learning (ML), is profoundly reshaping the landscape of data commons. These technologies are not only consumers of the vast datasets curated within data commons but are also becoming integral to their operation and evolution. AI/ML models, particularly deep learning models, are data-hungry, and data commons provide the large-scale, high-quality, and diverse datasets required for their training and validation. The Therapeutics Data Commons, for instance, is a prime example of a data commons specifically designed to support the development of AI/ML models for drug discovery and development. It provides a collection of curated datasets and a suite of tools for benchmarking and evaluating new AI/ML models, thereby accelerating the pace of innovation in this critical area.

Furthermore, AI and ML are being integrated into the core infrastructure of data commons to enhance their functionality and usability. For example, AI-powered search and discovery tools can help users to more easily find relevant datasets within a large and complex data commons. ML algorithms can also be used to automate the process of data cleaning, curation, and annotation, thereby improving the quality and consistency of the data. Google’s Data Commons, for example, leverages AI to make public data more accessible and useful. Its DataGemma models are designed to ground large language models (LLMs) in real-world data from the Data Commons knowledge graph, thereby improving the accuracy and reliability of their outputs. This integration of AI into the fabric of data commons is creating a virtuous cycle, where better data leads to better AI models, which in turn leads to better data commons.

However, the increasing use of AI and ML in data commons also raises significant ethical and governance challenges. One of the most pressing concerns is the potential for bias in AI/ML algorithms. If the data used to train these algorithms is biased, the resulting models will also be biased, which can lead to unfair or discriminatory outcomes. This is a particularly acute problem in areas such as healthcare and criminal justice, where biased algorithms can have serious real-world consequences. Addressing this challenge requires a multi-faceted approach, including the development of methods for detecting and mitigating bias in AI/ML models, as well as the adoption of governance frameworks that promote fairness, accountability, and transparency. The ‘tragedy of the AI data commons’ is a term that has been coined to describe the legal and economic challenges of creating and sustaining the large-scale datasets needed for AI innovation, while also protecting privacy and intellectual property rights. The rapid decline of the AI data commons due to the increasing legal and ethical constraints on data collection and use is a major concern for the future of AI.

Another key challenge is the need to ensure the privacy and security of sensitive data in an era of increasingly powerful AI/ML models. As these models become more sophisticated, they are also becoming more adept at re-identifying individuals from anonymized data. This requires the development of new privacy-preserving technologies and governance mechanisms that can protect sensitive data from unauthorized access and use. Federated learning, for example, is a promising approach that allows AI/ML models to be trained on decentralized data without the need to centralize the data in a single location. This can help to address privacy concerns by keeping sensitive data under the control of the data owner. As the cognitive era continues to unfold, it will be essential for data commons to embrace these new technologies and to develop robust governance frameworks that can ensure the responsible and ethical use of AI and ML.

8. Commons Alignment Assessment

Shared Resource Potential: High - A Data Commons is, by its very nature, a shared resource. It is designed to be used and sustained by a community of users, who collectively benefit from the shared data and infrastructure.
Democratic Governance: High - A well-designed Data Commons has a democratic governance structure that empowers its community of users to participate in decision-making and to shape the future of the commons.
Equitable Access: High - A key goal of a Data Commons is to provide equitable access to data and computational resources, thereby leveling the playing field for research and innovation.
Sustainability: Medium - The long-term sustainability of a Data Commons can be a challenge, as it requires ongoing funding and a strong user community. However, with a clear sustainability plan and a committed community, it is possible to create a sustainable Data Commons.
Community Benefit: High - A Data Commons is designed to benefit a community of users, whether that community is a scientific discipline, a geographic region, or the public at large. By providing access to shared data resources, a Data Commons can accelerate research, foster innovation, and create public value.