Dataset Knowledge Graph

The Dataset Knowledge Graph enriches the Dataset Register with insights derived from each dataset's content. The Register stores what publishers submit; the Knowledge Graph publishes an empirical, VoID-modelled view of each dataset's shape – its RDF types, predicates, languages, outgoing links, and conformance to SCHEMA-AP-NDE.

It helps researchers, service platform builders and data engineers decide which heritage datasets fit their use case, and how to query them.

What a Summary tells you

For each dataset, the Summary lets you answer questions like:

Does this dataset contain what I need? – which RDF types are present and how many instances of each; which predicates are populated for which classes.
How big and how queryable is it? – total triples, distinct subjects and predicates; example resources to start exploring.
Which terminology sources does it link to? – outgoing linksets to AAT, GTAA, GeoNames, Wikidata and other vocabularies in the Network of Terms.
Which languages and datatypes does it cover? – language tags and XSD datatypes per property, broken down by class.
Does it conform to SCHEMA-AP-NDE? – a sampled SHACL validation of the dataset against the Schema.org Application Profile for NDE.
How can I access it? – which SPARQL endpoint or data dump currently responds, and at what size.

Inside a Dataset Summary

Each Summary attaches the following information to a void:Dataset – a mix of dataset-level statistical properties, VoID partitions, separate linkset resources, and DQV/PROV quality measurements:

Aspect	Modelled as	What it tells you
Size	`void:triples`, `void:distinctSubjects`, `void:properties`, `nde:objectsLiteral`, `nde:distinctObjectsURI`	Overall scale and the literal-vs-URI balance
Classes	`void:classPartition`	Which RDF types occur, with instance counts
Properties	`void:propertyPartition`	Which predicates occur; entities and distinct objects per predicate
Property density per class	nested `classPartition` / `propertyPartition`	Which properties are populated for each subject class – answers "which fields exist on `schema:Person` records?"
Datatypes per class and property	`void-ext:datatype`, `void-ext:datatypePartition`	Which XSD datatypes are used, broken down by class and property
Languages per class and property	`void-ext:languagePartition`	Language-tag coverage per class and property
Object classes per class and property	`void-ext:objectClassPartition`	How classes connect through predicates – e.g. "books link to persons via author 1350 times"
Outgoing linksets	`void:Linkset`	Cross-dataset and cross-vocabulary links – how the dataset fits into the wider network
Subject URI spaces	`void:uriSpace` + `void:entities` on a `void:subset`	The most common namespaces for subject resources
Subject URI resolution & persistent identifiers	`subject-uris-sampled` / `subject-uris-resolved` DQV measurements on the subset, plus – when the namespace is a recognised PID scheme – `dcterms:conformsTo <https://def.nde.nl/pid-scheme#ark>` (or `#handle`) and, for ARK, `dcterms:publisher`, plus a `subject-uris-persistent` boolean flag	For the namespace the dataset mints for its own resources (the most common one that is not a terminology source), whether a sample of those URIs resolves to a self-describing landing page. ARK and Handle persistent identifiers are detected from the namespace, with the ARK issuing organisation looked up via `arks.org`. `resolved > 0` means the dataset's own identifiers genuinely dereference; `resolved = 0` next to a declared PID scheme means it claims a persistent identifier whose links are broken. A `subject-uris-persistent` flag set to `false` marks a namespace on the disallow list of known non-durable vendor namespaces – it resolves today but is not a durable home for the identifiers. Each sampled URI that failed is enumerated on the sampling activity as a failed-sample qualified usage, carrying the exact URI and a typed `failure:reason`
Vocabularies	`void:vocabulary`	Schema.org, FOAF, Dublin Core, etc. – what the predicates draw from
Licenses	`dcterms:license`	License coverage at the resource level
Media	`void:subset` marked `<https://def.nde.nl/probe#detects> <https://def.nde.nl/probe#media>` + `void:entities`	Whether the dataset exposes any media – images, audio, video, 3D. The subset exists only when the dataset has media, so its presence is the has-media signal, and its `void:entities` is a double-count-safe lower bound on the number of media objects. The IIIF subset (below) nests under it, so a media-bearing dataset that offers no IIIF reads as “media, but no IIIF” rather than being indistinguishable from “no media”
IIIF Presentation manifests	`void:subset` + `dcterms:conformsTo <http://iiif.io/api/presentation/>` + `void:entities`, plus `manifests-sampled` / `manifests-validated` DQV measurements	Whether the dataset exposes IIIF Presentation API manifests, how many, and how many of a sample actually resolve. Detected from `schema:encodingFormat` literals matching the SCHEMA-AP-NDE IIIF profile pattern; v2 and v3 collapse into one version-less subset. The `dcterms:conformsTo` marker is declared; a sample of the manifest IRIs is then dereferenced and validated, so a dataset whose manifests genuinely resolve (`validated > 0`) is distinguishable from one that declares IIIF but serves broken manifests (`validated = 0`). Each sampled manifest that failed validation is enumerated on the validation activity as a failed-sample qualified usage, carrying the exact URL and a typed `failure:reason`
Failed samples	`prov:qualifiedUsage` → `prov:Usage` with `prov:entity` + `failure:reason` on the sampling/validation `prov:Activity`	For the subject-URI resolution and IIIF manifest checks, the identity of each failed sample, so a low ratio can be triaged down to the individual broken URI/URL and its reason. See Failed samples
Distributions	`void:sparqlEndpoint`, `void:dataDump`, plus HTTP-validated status	Which distributions currently work and at what size
Example resources	`void:exampleResource`	Concrete starting points for exploration
SCHEMA-AP-NDE conformance	`dqv:QualityMeasurement` + `prov:Activity`	Whether a sample of resources passes the SCHEMA-AP-NDE SHACL shapes. Three metrics are emitted: `schema-ap-nde-sample-conformance` (boolean), `quads-validated` (number of sampled triples), and `samples-per-class` (sample cap). Combine `quads-validated > 0` with `conformance = true` to mean "tested and passed"; `quads-validated = 0` means the profile didn't apply (e.g. the dataset uses Linked.Art or EDM). The full per-resource SHACL report is written to a file rather than the triple store.

For the exact output each row produces, see the analysis CONSTRUCT queries that generate it; the sample queries below show live results.

Partition URIs

The partition resources inside a Summary – class partitions, property partitions, subsets, and so on – are identified by stable well-known URIs derived from the dataset URI:

{dataset-uri}/.well-known/void#{partition-type}-{hash}

The hash is an MD5 of the class or property URI, so each partition is uniquely and stably addressable across pipeline runs – a consumer can link to or dereference a specific partition. For example, the schema:Person class partition of dataset https://example.org/dataset is https://example.org/dataset/.well-known/void#class-5f4d3c2b1a….

Sample queries

One example per analysis. Each link opens the query pre‑loaded in the Knowledge Graph query UI — click Run to execute. The aggregate datastory demonstrates more advanced combinations.

Size

The overall size of each dataset: total triples, distinct subjects, and the literal-vs-URI object split.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
  ?dataset a void:Dataset ;
    void:triples ?triples ;
    void:distinctSubjects ?distinctSubjects .
}
ORDER BY DESC(?triples)

What a Summary tells you​

Inside a Dataset Summary​

Partition URIs​

Sample queries​

Size​

Most common classes​

Most common properties​

Property density on schema:Person​

Datatypes used for schema:Person/schema:name​

Language coverage on schema:name​

Object classes linked from schema:Book/schema:author​

Outgoing linksets to terminology sources​

Subject URI spaces​

Datasets whose subject URIs resolve​

Datasets that mint a persistent identifier​

Most-referenced vocabularies​

License usage​

Datasets exposing IIIF Presentation manifests​

Datasets with validated IIIF manifests​

Failed samples​

Datasets with working SPARQL endpoints​

Example resources per dataset​

Datasets passing SCHEMA-AP-NDE​

Access​

How summaries are produced​

What a Summary tells you

Inside a Dataset Summary

Partition URIs

Sample queries

Size

Most common classes

Most common properties

Property density on `schema:Person`

Datatypes used for `schema:Person`/`schema:name`

Language coverage on `schema:name`

Object classes linked from `schema:Book`/`schema:author`

Outgoing linksets to terminology sources

Subject URI spaces

Datasets whose subject URIs resolve

Datasets that mint a persistent identifier

Most-referenced vocabularies

License usage

Datasets exposing IIIF Presentation manifests

Datasets with validated IIIF manifests

Failed samples

Datasets with working SPARQL endpoints

Example resources per dataset

Datasets passing SCHEMA-AP-NDE

Access

How summaries are produced