Sciamlab - The Urgency of Certified Public Data as a bulwark against information distortions in the era of Agentic AI

1. Introduction: The paradox of spurious abundance

We live in an era where Generative Artificial Intelligence (GenAI) is celebrated as the key to decoding the complexity of human knowledge. However, we face a dangerous paradox: while large-scale language models (LLMs) become increasingly powerful in the form of communication, the data they rely on to interact with the real world (so-called "agentic AI") remains fragmented, uncertified, and intrinsically fragile.

The underlying thesis of these reflections is that the current immature state of public open data, combined with the uncontrolled proliferation of Model Context Protocol (MCP) services that act as "bridges" to this uncertified data, is creating the perfect conditions for an epistemological disaster. If a truly certified public data regime (quality-by-design and certified-by-design) is not urgently established, AI will not only convey inconsistent facts, but will also autonomously construct incorrect “derived facts,” entrenching information distortions on a systemic scale.

2. The State of the Art: Public Open Data between lack of maturity and the illusion of reliability

The open data movement was born over a decade ago with the noble ambition of transparency and innovation. However, years later, the public sector has not yet reached a level of maturity sufficient to ensure the reliability required to autonomously feed AI systems.

The 5 Plagues of Current Open Data

Structural Inconsistency: Public datasets are often published in unstructured formats (PDFs, scanned images) or in proprietary schemas that change without versioning, making it impossible for an AI agent to establish logical continuity.
Lack of Certified Provenance: Current public data rarely includes cryptographic metadata certifying its integrity and provenance. An AI system has no way to distinguish between authentic data and data tampered with during transit, except through heuristic, probabilistic mechanisms.
Dirty Data: Data is often incomplete, duplicated, or affected by input errors (back-office human errors). While a human operator can contextualize a sporadic error, an AI processing millions of tokens assimilates them as absolute truth.
Lack of Timeliness: Data publication follows administrative cycles, not informational relevance cycles. For an AI that must make real-time decisions (e.g., healthcare or mobility), official but outdated data is more harmful than unofficial but up-to-date data.
Uncertain Licenses: Many public datasets are released under restrictive or ambiguous licenses regarding AI training and inference, creating legal risks that push service providers to use "unofficial" data for legal certainty.

3. The Risk Vector: AI as an amplifier of inconsistencies

The problem isn't simply that AI reads dirty data. The problem is that AI has a unique ability to amplify inconsistency through two distinct mechanisms:

Induced Hallucination

When an LLM receives conflicting data from uncertified sources, it is incapable of human-like critical discernment. Its statistical function leads it to "mediate" between the conflicting truths, generating a plausible but false synthesis. In the absence of certified data to serve as an immutable "ground truth," AI hallucinates not due to a technical defect, but rather due to the dirtyness of public data.

The Garbage In, Garbage Out (GIGO) Effect 2.0

In classical computer science, the GIGO effect was limited. In agentic AI, dirty data is not only returned to the user but is used as input for other functions. An incorrect piece of data about the availability of a public service (e.g., "the bridge is open") is processed by an AI agent that reprograms a city's logistics. The error propagates through the information value chain, causing physical and economic damage before it's even detected.

4. The MCP Phenomenon: The proliferation of connectors on uncertified data

The Model Context Protocol (MCP) represents a crucial technological breakthrough: it allows LLMs to connect directly to external data sources (databases, repositories, APIs) without using human interfaces. While this solves the problem of training data obsolescence (cut-off date), it also introduces a new systemic risk.

We are witnessing the proliferation of entities (startups, independent developers, non-specialized public bodies) offering MCP services to connect "dirty" or incompletely certified public data.

The Official Provenance Deception

Many of these MCP services are based on the assumption that "if it comes from an official database, then it is reliable." This is a dangerous logical fallacy.

Case example: An MCP connects to a region's official tender database. The database contains data linked to incorrect tax codes or misaligned amounts due to an administrative transcription error.
Effect: The AI agent, using that MCP, will certify to the user that "the data has been verified in real time against the official database," lending an aura of authority to objectively incorrect data.

The Absence of Certified Connectors

Currently, there is no standard for "Certified Connectors" or "Certified MCP Servers." Any developer can publish an MCP server claiming to interface with the Public Administration (PA), without any mechanism ensuring that:

The data leaving the PA has not been altered during transit (integrity).
The data complies with the FAIR (Findable, Accessible, Interoperable, Reusable) principles at the semantic level.
The data is actually current and not an unauthorized replica.

5. Information Distortion: From incorrect data to derived facts

The greatest risk is represented by derived facts. AI doesn't simply report data; it contextualizes, aggregates, and interprets it. When AI operates on unaudited data, the derived facts it produces have an exponential distorting impact.

Consider the social services or healthcare sector:

Dirty input: A public database shows that a certain benefit is "theoretically available" (but in reality, the fund is exhausted and the data hasn't been updated).
AI process: An AI "consultant agent" reads the data, interprets it, and suggests a citizen rely on that benefit for an upcoming expense.
Dirty output: The citizen suffers financial harm because the AI conveyed derived information (the advisability of taking action) based on unaudited, non-real-time public data.

In this scenario, liability is unclear. The public body will claim that the data was "published but not updated," the MCP provider will claim it only "connected" the data, and the AI provider will claim it only processed the data received. Citizens remain victims of a system in which the data's chain of custody is broken.

6. Proposal for a new paradigm: Public Data as critical infrastructure

To prevent AI from becoming a vehicle for inadvertent misinformation, it is necessary to move beyond the current Open Data paradigm (open but unsecured data) and toward the Certified Public Data (CPD) paradigm.

CPD Foundational Principles

Certification at Source: All public data intended for consumption by AI systems must be cryptographically signed (blockchain hash or qualified digital signature) upon issuance. This allows the AI agent to verify its authenticity and integrity before processing it.
Machine-First Semantics: Datasets must be designed not for the human eye, but for the machine. They must adhere to formal ontologies (e.g., ISA², schema.gov.it, DCAT-AP, etc.) with rigid and versioned schemas, eliminating the lexical ambiguity that currently causes misinterpretation among LLMs.
Certified MCPs: Just as digital signatures exist for websites (HTTPS/TLS), standards must exist for "Certified MCP Endpoints." A public entity should release not only the data, but also the official connector (the MCP server) digitally signed, ensuring that the access layer does not introduce errors.
Mandatory Data Provenance: Every response provided by an AI agent based on public data must include a human-readable provenance, showing not only the source, but also the certification timestamp and the hash of the original data.

The Role of Legislators and Regulators

The proliferation of MCPs based on dirty data cannot be stopped by technical innovation alone; regulatory intervention is needed to define liability.

AI Act and Public Data: The European AI Act classifies high-risk systems. It is urgent to extend the concept that the use of uncertified public data in high-risk AI systems (e.g., critical infrastructure, law enforcement, healthcare) constitutes a violation of data governance obligations.
Certification Mandatory for Public Administrations: Public administrations should be required to release, for datasets with high social impact, not only the "open" data, but also the certified MCP endpoint and the AI software development kit (SDK), covering the cost of this level of quality as part of the national digital infrastructure.

7. Conclusions: The choice between Informational order and chaos

We are at a crossroads. On the one hand, we can allow the market to proliferate with improvised connectors that tap into immature public data, creating an ecosystem of agent AI that, while fast, will be inherently unreliable and potentially harmful. On the other, we can recognize that in the AI era, public data is no longer just an information asset, but a critical precision infrastructure.

Just as we would not allow a bridge to be built without concrete certificates and testing, we cannot allow "information bridges" (MCPs) to the state and public services to be built on uncertified data. The urgency is paramount because the adoption of AI agents is proceeding at an exponential rate, while the maturity of public data is proceeding at a linear pace.

If we don't act now with a structured program to certify public data and regulate MCP services, in a few years we will find ourselves in a situation where 90% of automated interactions with the public sector will convey distorted derived facts, eroding trust not only in AI, but in the institutions themselves.

The quality of the next generation of artificial intelligence will be determined solely by the quality of the certified public data we are able to make available to them today.

References and Further Reading

Regulation (EU) 2024/1689 (AI Act) – particularly the articles relating to high-risk systems and data governance.
FAIR Principles (Findable, Accessible, Interoperable, Reusable) – Wilkinson et al., 2016.
Model Context Protocol (MCP) – Anthropic Technical Specifications, 2024.
Harmonization and Standardization of Shared Data Models schema.gov.it
DCAT-AP (Data Catalog Vocabulary Application Profile) – European standard for the interoperability of public data.

Are you interested in our service?

Let’s get in contact

The Urgency of Certified Public Data as a bulwark against information distortions in the era of Agentic AI