The standard

The Codex normalization standard.

Eight datasets, one contract. The Codex normalization standard is the shape every record takes — the part that makes any Codex dataset feel like part of a single corpus rather than eight loosely-related dumps.

Seven principles.

Each one is a property buyers can rely on across every dataset.

Schema-locked

Every dataset publishes a versioned schema. Breaking changes bump the major version. Buyers pin to a version and trust forward compatibility within a major.

Source-attributed

Every record points back to the upstream document, system, and the moment it was observed. Citations and audits are derivable from the row, not reconstructed from logs.

Bitemporally honest

Real-world time and system time are kept separate. A buyer asking ‘what did the world look like on date X?’ and ‘what did Codex know on date Y?’ get different answers — both correct.

Spatially consistent

Every spatial record carries an H3 cell at resolution 8. The same key joins maritime, civic, real estate, and labor data without bespoke wrangling.

AI-optimized

Categories, entity extractions, sentiment, and risk scores are pre-computed at normalization time and stored as columns. Records ship ready for embedding stores and fine-tunes.

Versioned snapshots

Monthly, immutable releases. Buyers can pin to any month for reproducible research and model training.

Joinable by construction

A small, documented set of shared keys lets any Codex dataset join cleanly to any other. The cross-dataset experience is not an integration project — it is a SQL JOIN.

What every record carries.

The minimum metadata envelope. Each dataset extends it with its own payload.

Identity
Stable record and chunk identifiers, plus a fetchable URL pointing to the original source document.
Schema & lineage
Per-dataset schema version and the version of the pipeline that produced the row, both semver.
Time
Real-world event time, source publication time, system ingest time, system modification time, and effective-from / effective-to dates where the record carries legal force.
Confidence & provenance
A confidence score and an ordered chain of the transformations that produced the row.
Access tier
Each record carries the license tier it qualifies for, so buyers can enforce access control during retrieval.

Aligned to the standards your auditors already accept.

Every dataset publishes a per-field crosswalk to the binding external standard for its domain — DCAT-US v3.0 (federally mandated for U.S. open data starting 2026), W3C PROV-DM, schema.org, IHO S-100 + IMO A.600 for maritime, OSCRE IDM for real estate, GS1 GLN for points of interest, NAICS for industry classification.

Browse the per-dataset crosswalks →

What lives in the commercial spec.

This page is the buyer-facing summary. The full versioned spec — field-by-field schemas, scoring methodology, provenance pipeline details, and the APRS minimum ingestion contract that governs every chunk — ships with every commercial license alongside the data.

License a dataset Browse crosswalks ← back to catalog