FAIR in principle—but failing in scientific practice

FAIR² goes beyond FAIR—turning principles into practice so data can be cited, reused, trusted, and sustained for science and AI.

Apr 22, 2025

The FAIR principles—Findable, Accessible, Interoperable, Reusable—were introduced to make scientific data easier to discover and usable. They’ve been embraced by funders, journals, and repositories around the world. Today, many datasets carry a FAIR label. Many researchers are asked to comply.

But too often, FAIR stops short.

A dataset might be deposited, described, and technically discoverable. But when you open the file, you find ambiguous variable names, undocumented preprocessing steps, unclear assumptions, and no link to the methods that produced the data. The metadata is present—but it doesn’t answer the questions that matter.

The result is data that’s FAIR in name, but not in practice.

Professor Barend Mons, senior author of the original FAIR principles paper and leading advocate for their global implementation, has warned of this problem. He calls it “pseudo-FAIR”—datasets that check superficial boxes but lack the structure, clarity, and machine-actionable metadata needed to support real reuse. Without that depth, FAIR becomes a formality—one that fails researchers, data stewards, and AI systems alike.

To make data genuinely usable, we need to go beyond FAIR.

In my own work building and leading data platforms—ranging from biologically detailed brain models, to national and international infrastructures for neuroscience, brain injury, and mental health—I’ve seen firsthand how hard it is to bridge the gap between data availability and data usability. Whether integrating complex clinical datasets for precision mental health care or aligning multi-modal research data for international AI initiatives, the pattern is consistent: datasets are shared, but without the structure or context that reuse demands. FAIR as a set of principle is widely embraced. But FAIR in practice—the kind that makes data truly reusable and trustworthy—remains rare.

That’s why we built FAIR² (learn more about FAIR² Data Management and apply to join the pilot). To turn the principles into tools, structure, and systems that actually support the way science is done.

The Gap Between Principles and Practice

FAIR sets a powerful direction: to make data reusable. But implementation often falters—not because researchers don’t care, but because the tools, incentives, and expectations don’t support meaningful sharing.

Consider a dataset of emotional response ratings from a psychology study. The accompanying metadata says: “Valence and arousal ratings for 60 images, subject-level scores.” But critical context is missing: What scale was used? Were outliers removed? Were trials averaged? Was the task self-paced or timed? Were stimuli randomized?

I’ve seen this pattern too often. The data exists. It’s even shared. It’s been called FAIR. But it can’t be interpreted, trusted, or reused without guesswork—and guesswork breaks reproducibility.

This is where most data sharing efforts break down: not at the level of access, but at the level of interpretation, integration, and reuse.

Beyond Findable

Publishing a dataset and assigning a DOI is a good start—but real findability requires more than presence.

Too often, datasets are uploaded with basic descriptions, internal filenames, or project-specific keywords. Even if terms like “reaction time” or “Stroop task” are included, if they’re not structured in machine-readable form, aligned with shared vocabularies, or indexed in searchable registries, they won’t show up in relevant queries.

Example: A researcher uploads a dataset called Stroop_final_data_v3.xlsx, with the title “Stroop Task Behavioral Results.” The metadata mentions “cognitive control” and “reaction time.” But the metadata isn’t structured using schema.org or a domain-specific ontology, isn’t registered in a discoverable index, and isn’t linked to the protocol used. A search for “executive function datasets” or “response time in cognitive tasks” doesn’t surface the dataset—even though it’s technically online and FAIR-tagged.

Real findability requires:
- Structured metadata using schema.org or equivalent
- Persistent identifiers for datasets, variables, and workflows
- Registration in cross-repository and semantic search indexes
- Use of community-recognized vocabularies and ontologies

Being online isn’t enough. A dataset isn’t findable until it’s discoverable by the right people, tools, and systems—in context.

Beyond Accessible

Too often, accessibility is interpreted narrowly: is the file downloadable? But real accessibility is about more than access to bytes—it’s about access to meaning.

A user might download a dataset, unzip the folder, and find five spreadsheets named data1.csv, subset_clean.csv, and final_export_v2.csv. There’s no data dictionary, no guide, no context. Columns are labeled X1, X2, Z_final, and there’s no indication of how these files relate to each other or to any published results.

Technically, the dataset is “open” and accessible. But to a new user—or an AI assistant—it’s effectively unusable.

Real accessibility requires:
- Clear file naming, organization, and data dictionaries
- Interactive portals that allow users to filter and preview data
- Jupyter notebooks or analysis guides for orientation
- AI-enabled documentation that answers, “What does this measure?”
- Plain-language overviews and optional multimedia walkthroughs

Accessibility must be about orientation, not just access.

Beyond Interoperable

Interoperability is often reduced to format: “It’s in CSV, so it’s interoperable.” But true interoperability is about semantics, not syntax.

To be interoperable, data must carry consistent meaning across tools, platforms, and disciplines. It must align with shared concepts, use controlled vocabularies, and support machine parsing without guesswork.

Example: Two research teams both publish data on blood pressure. One uses a column labeled BP_sys, the other just BP. One is measured in mmHg at rest using a standardized cuff, the other’s method isn’t specified. Both datasets are in CSV, but they’re not meaningfully interoperable. You’d need to dig into the paper—if it exists—or email the authors to reconcile them.

Real interoperability requires:
- Use of domain-specific ontologies
- Metadata formats that carry meaning (e.g., JSON-LD, Croissant)
- Explicit documentation of units, measurement protocols, and context
- Support for integration across platforms and pipelines

If data can’t be used without custom glue code—or if machines can’t interpret what variables mean—it isn’t interoperable.

Beyond Reusable

Reusability is the ultimate test of FAIR—but also its most commonly failed.

A reusable dataset must be understandable to someone outside the original team. It must explain where the data came from, how it was processed, what assumptions were made, and how it connects to the analytical pipeline.

Example: A dataset of structural MRI data includes columns labeled gmv_lh, wm_mask, and motion_qc. There’s no documentation of how these were generated. Which version of the software was used? Was motion correction applied before segmentation? What pipeline was followed? The paper references “standard procedures” but doesn’t include code or preprocessing parameters.

Even when shared with good intentions, datasets without context are effectively frozen in time—useful only to those who already know the details.

Real reusability requires:
- Rich metadata at the field level
- Explicit links between data and analysis workflows
- Code notebooks and pipelines used in preprocessing
- Descriptions of inclusion/exclusion criteria and cohort construction
- Provenance metadata using standards like PROV-O
- Version tracking and changelogs

Without context, reuse becomes reconstruction. With it, data becomes infrastructure for cumulative discovery.

Infrastructure Should Help, Not Hinder

Researchers shouldn’t need to be metadata experts to share data well.

Good infrastructure should make it easy to:
- Prompt researchers for definitions, assumptions, and units
- Connect data to instruments, protocols, and analytical workflows
- Accept plain-language input and generate structured metadata
- Flag missing context and offer suggestions
- Make datasets intelligible to both humans and machines

And it should work not just for the dataset creators, but for everyone who comes after: collaborators, educators, students, reviewers—and AI systems.

That’s what FAIR² was built to do.

From Principles to Practice: What FAIR² Makes Possible

FAIR² was built to go beyond compliance.

It’s not just a checklist—it’s a system for making data genuinely usable, citable, and reproducible in real scientific workflows.

In practice, this means two things:

FAIR² provides a structured, reusable framework that helps researchers publish data that is not only technically FAIR, but also practically useful—by others, by themselves in the future, and by machines.
FAIR² is a validatable, open specification, defining a machine-readable data model, metadata schema, and structured workflow that turns datasets into AI-ready, well-documented research assets. It comes with tools and infrastructure that ensure consistency, clarity, and trustworthiness.

Findable – with Meaning and Context

A dataset might be online and assigned a DOI—but that doesn’t make it findable in any meaningful sense. Too often, datasets are uploaded with minimal titles, vague summaries, and metadata that isn’t structured or indexed. They become buried, invisible to search engines, and disconnected from the scientific ecosystem that could benefit from them.

FAIR² addresses this by using structured metadata that connects datasets to shared vocabularies, methods, and domains. It embeds the dataset into the scientific knowledge graph—using schema.org, linking to ontologies, and registering the data in searchable indexes. It doesn’t just make data available; it makes it discoverable with intent.

Uses schema.org/Dataset-compliant metadata
Connects variables and concepts to domain ontologies
Registers datasets in federated and semantic search systems
Links data to instruments, protocols, and publications

With FAIR², datasets are no longer isolated or invisible. They show up in relevant searches—by concept, method, or domain—making them usable by researchers across fields and retrievable by intelligent systems. That visibility accelerates discovery and maximizes the value of data already collected.

Accessible – Not Just Downloadable, But Usable

Technical accessibility—like being able to download a file—is often mistaken for actual usability. But when users unzip a dataset and find ambiguous filenames, unexplained variables, and undocumented preprocessing, accessibility fails. The data may be “open,” but it’s effectively unusable to anyone outside the original team.

FAIR² makes datasets truly accessible by structuring them for orientation, not just access. It generates interactive portals that let users explore and filter the data, provides plain-language documentation and multimedia walkthroughs, and includes embedded guides for both human users and digital agents. It ensures that data doesn’t just open—it makes sense.

Interactive portals for data exploration and filtering
Reproducible Jupyter notebooks with explanatory code
Plain-language documentation and optional audio walkthroughs
Embedded, machine-readable metadata for agents and voice assistants

With FAIR², users don’t have to guess what the data means or how it was created. Researchers, students, and AI tools can all access not just the files—but the story behind them. This reduces onboarding time, prevents errors, and turns accessibility into actual opportunity for reuse.

Interoperable – Aligned with Shared Meaning

Putting data in a common format isn’t enough. A dataset in CSV format with undefined variables and unclear units isn’t meaningfully interoperable. Without shared semantics, integrating data across platforms, disciplines, or studies requires manual wrangling, risky assumptions, or simply isn’t possible.

FAIR² ensures semantic interoperability—so that data carries shared meaning, not just shared syntax. It uses the Croissant format (a JSON-LD specification), aligns fields with domain ontologies, explicitly defines units and measurement methods, and documents provenance in a machine-readable way. It lets tools and researchers interpret the data the same way.

Uses Croissant (JSON-LD) for structured, semantic metadata
Aligns with domain-specific ontologies and controlled vocabularies
Captures measurement protocols, units, scales, and assumptions
Supports automated integration with APIs, pipelines, and models

With FAIR², datasets can be integrated cleanly into cross-disciplinary workflows, compared with other studies, and interpreted reliably by both humans and machines. This eliminates a major barrier to collaborative science, large-scale analysis, and responsible automation.

Reusable – Ready for Trust, Transparency, and Recognition

Reusability is the ultimate test of FAIR—but often the first thing to fail. Datasets may be shared with good intentions, but without transparent processing steps, code, or clear assumptions, they remain frozen in context—useful only to the original team. Worse, researchers often get no credit for sharing well, creating a disincentive to invest in quality.

FAIR² ensures that datasets are reusable by embedding the full analytical context: preprocessing, assumptions, methods, and analysis scripts. And it recognizes this effort as a scholarly contribution. Every FAIR² Data Package can be accompanied by a peer-reviewed, citable FAIR² Data Article that gives researchers formal credit for data that is trustworthy, documented, and ready for reuse.

Captures assumptions, variable derivations, and transformation logic
Links to protocols, analysis workflows, and code repositories
Tracks dataset versions and changelogs
Supports FAIRness audits and peer-reviewed FAIR² Data Articles

With FAIR², data becomes more than a file—it becomes a citable, trusted part of the scientific record. This encourages better documentation, supports reproducible discovery, and gives researchers the recognition they deserve for sharing data that others can actually build on.

Reproducible – Context-Built, Workflow-Linked, and Transparent

Reproducibility fails when the steps that generated a dataset are opaque or missing. If key preprocessing decisions, software versions, or cohort definitions aren’t documented, the dataset can’t be reliably interpreted—let alone reproduced. Even “shared” data remains locked in time.

FAIR² embeds reproducibility into the dataset itself by linking it to the protocols, software, and analysis workflows used to generate it. It supports versioned, executable notebooks and tracks provenance using W3C standards like PROV-O. This makes datasets auditable, interpretable, and extensible by others.

Links data to instruments, code, methods, and processing steps
Provides versioned Jupyter notebooks for end-to-end reproducibility
Captures provenance using W3C PROV-O and other standards
Documents assumptions, inclusion criteria, and transformation logic

With FAIR², datasets become living research objects—transparent, verifiable, and ready to support replication, validation, and extension. That’s essential for cumulative science and for building trust in published results.

And Beyond FAIR – Designed for What’s Next

The FAIR principles were designed to improve data stewardship—but they weren’t built for the age of AI. Today, scientific data must support automated reasoning, transparent decision-making, and responsible reuse by intelligent systems. FAIR alone doesn’t get us there.

FAIR² extends the FAIR framework to meet this challenge. It is AI-ready by design, embedding semantic structure, provenance, and field-level documentation that allows data to be interpreted by machines. It also includes safeguards for Responsible AI—capturing intended use, ethical assumptions, and context of collection. And because it’s a validatable specification, it supports automated quality checks, conformance tests, and extensible modular design.

Machine-readable metadata structured for intelligent systems
Provenance, purpose, and assumptions captured for Responsible AI
Validatable against an open specification with testable rules
Modular and extensible for domain-specific applications

With FAIR², datasets don’t just meet today’s expectations—they’re ready for tomorrow’s workflows. They support human interpretation, machine reasoning, and responsible automation in complex, multidisciplinary research environments.

From principles to practice, FAIR² turns aspiration into implementation. It transforms datasets from static files into structured, interpretable, reusable resources—designed not just to meet FAIR requirements, but to support the future of science.

Because going beyond FAIR isn’t about checking more boxes. It’s about making research data understandable, reusable, and trustworthy—by humans, and by the systems that will extend their work.

Ready to put the FAIR principles into real practice with FAIR² Data Management?

Learn more about FAIR² Data Management and apply to join the pilot.

The FAIR² Pulse

Discussion about this post