Ga naar hoofdinhoud

Dataset Knowledge Graph

The Dataset Knowledge Graph enriches the Dataset Register with insights derived from each dataset's content. The Register stores what publishers submit; the Knowledge Graph publishes an empirical, VoID-modelled view of each dataset's shape – its RDF types, predicates, languages, outgoing links, and conformance to SCHEMA-AP-NDE.

It helps researchers, service platform builders and data engineers decide which heritage datasets fit their use case, and how to query them.

What a Summary tells you

For each dataset, the Summary lets you answer questions like:

Inside a Dataset Summary

Each Summary attaches the following information to a void:Dataset – a mix of dataset-level statistical properties, VoID partitions, separate linkset resources, and DQV/PROV quality measurements:

AspectModelled asWhat it tells you
Sizevoid:triples, void:distinctSubjects, void:properties, nde:objectsLiteral, nde:distinctObjectsURIOverall scale and the literal-vs-URI balance
Classesvoid:classPartitionWhich RDF types occur, with instance counts
Propertiesvoid:propertyPartitionWhich predicates occur; entities and distinct objects per predicate
Property density per classnested classPartition / propertyPartitionWhich properties are populated for each subject class – answers "which fields exist on schema:Person records?"
Datatypes per class and propertyvoid-ext:datatype, void-ext:datatypePartitionWhich XSD datatypes are used, broken down by class and property
Languages per class and propertyvoid-ext:languagePartitionLanguage-tag coverage per class and property
Object classes per class and propertyvoid-ext:objectClassPartitionHow classes connect through predicates – e.g. "books link to persons via author 1350 times"
Outgoing linksetsvoid:LinksetCross-dataset and cross-vocabulary links – how the dataset fits into the wider network
Subject URI spacesvoid:uriSpace + void:entities on a void:subsetThe most common namespaces for subject resources
Subject URI resolution & persistent identifierssubject-uris-sampled / subject-uris-resolved DQV measurements on the subset, plus – when the namespace is a recognised PID scheme – dcterms:conformsTo <https://def.nde.nl/pid-scheme#ark> (or #handle) and, for ARK, dcterms:publisher, plus a subject-uris-persistent boolean flagFor the namespace the dataset mints for its own resources (the most common one that is not a terminology source), whether a sample of those URIs resolves to a self-describing landing page. ARK and Handle persistent identifiers are detected from the namespace, with the ARK issuing organisation looked up via arks.org. resolved > 0 means the dataset's own identifiers genuinely dereference; resolved = 0 next to a declared PID scheme means it claims a persistent identifier whose links are broken. A subject-uris-persistent flag set to false marks a namespace on the disallow list of known non-durable vendor namespaces – it resolves today but is not a durable home for the identifiers. Each sampled URI that failed is enumerated on the sampling activity as a failed-sample qualified usage, carrying the exact URI and a typed failure:reason
Vocabulariesvoid:vocabularySchema.org, FOAF, Dublin Core, etc. – what the predicates draw from
Licensesdcterms:licenseLicense coverage at the resource level
Mediavoid:subset marked <https://def.nde.nl/probe#detects> <https://def.nde.nl/probe#media> + void:entitiesWhether the dataset exposes any media – images, audio, video, 3D. The subset exists only when the dataset has media, so its presence is the has-media signal, and its void:entities is a double-count-safe lower bound on the number of media objects. The IIIF subset (below) nests under it, so a media-bearing dataset that offers no IIIF reads as “media, but no IIIF” rather than being indistinguishable from “no media”
IIIF Presentation manifestsvoid:subset + dcterms:conformsTo <http://iiif.io/api/presentation/> + void:entities, plus manifests-sampled / manifests-validated DQV measurementsWhether the dataset exposes IIIF Presentation API manifests, how many, and how many of a sample actually resolve. Detected from schema:encodingFormat literals matching the SCHEMA-AP-NDE IIIF profile pattern; v2 and v3 collapse into one version-less subset. The dcterms:conformsTo marker is declared; a sample of the manifest IRIs is then dereferenced and validated, so a dataset whose manifests genuinely resolve (validated > 0) is distinguishable from one that declares IIIF but serves broken manifests (validated = 0). Each sampled manifest that failed validation is enumerated on the validation activity as a failed-sample qualified usage, carrying the exact URL and a typed failure:reason
Failed samplesprov:qualifiedUsageprov:Usage with prov:entity + failure:reason on the sampling/validation prov:ActivityFor the subject-URI resolution and IIIF manifest checks, the identity of each failed sample, so a low ratio can be triaged down to the individual broken URI/URL and its reason. See Failed samples
Distributionsvoid:sparqlEndpoint, void:dataDump, plus HTTP-validated statusWhich distributions currently work and at what size
Example resourcesvoid:exampleResourceConcrete starting points for exploration
SCHEMA-AP-NDE conformancedqv:QualityMeasurement + prov:ActivityWhether a sample of resources passes the SCHEMA-AP-NDE SHACL shapes. Three metrics are emitted: schema-ap-nde-sample-conformance (boolean), quads-validated (number of sampled triples), and samples-per-class (sample cap). Combine quads-validated > 0 with conformance = true to mean "tested and passed"; quads-validated = 0 means the profile didn't apply (e.g. the dataset uses Linked.Art or EDM). The full per-resource SHACL report is written to a file rather than the triple store.

For the exact output each row produces, see the analysis CONSTRUCT queries that generate it; the sample queries below show live results.

Partition URIs

The partition resources inside a Summary – class partitions, property partitions, subsets, and so on – are identified by stable well-known URIs derived from the dataset URI:

{dataset-uri}/.well-known/void#{partition-type}-{hash}

The hash is an MD5 of the class or property URI, so each partition is uniquely and stably addressable across pipeline runs – a consumer can link to or dereference a specific partition. For example, the schema:Person class partition of dataset https://example.org/dataset is https://example.org/dataset/.well-known/void#class-5f4d3c2b1a….

Sample queries

One example per analysis. Each link opens the query pre‑loaded in the Knowledge Graph query UI — click Run to execute. The aggregate datastory demonstrates more advanced combinations.

Size

The overall size of each dataset: total triples, distinct subjects, and the literal-vs-URI object split.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
?dataset a void:Dataset ;
void:triples ?triples ;
void:distinctSubjects ?distinctSubjects .
}
ORDER BY DESC(?triples)

▶ Run in the query UI

Most common classes

Which RDF types appear most across the network, with instance counts summed across datasets and the number of datasets each class appears in.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT ?class (SUM(?count) AS ?instances) (COUNT(DISTINCT ?dataset) AS ?datasets) WHERE {
?dataset a void:Dataset ;
void:classPartition [
void:class ?class ;
void:entities ?count
] .
}
GROUP BY ?class
ORDER BY DESC(?instances)
LIMIT 20

▶ Run in the query UI

Most common properties

Which predicates appear most across the network, with the total number of entities that carry each predicate.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT ?property (SUM(?entities) AS ?totalEntities) WHERE {
?dataset a void:Dataset ;
void:propertyPartition [
void:property ?property ;
void:entities ?entities
] .
}
GROUP BY ?property
ORDER BY DESC(?totalEntities)
LIMIT 20

▶ Run in the query UI

Property density on schema:Person

For each dataset, which predicates are populated on schema:Person resources, and how many entities carry them.

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX schema: <https://schema.org/>
SELECT * WHERE {
?dataset void:classPartition [
void:class schema:Person ;
void:propertyPartition [
void:property ?property ;
void:entities ?entities ;
void:distinctObjects ?distinctObjects
]
]
}
ORDER BY DESC(?entities)
LIMIT 50

▶ Run in the query UI

Datatypes used for schema:Person/schema:name

Which XSD datatypes appear in schema:name values on schema:Person resources — useful for spotting unexpected datatype mixes.

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX void-ext: <http://ldf.fi/void-ext#>
PREFIX schema: <https://schema.org/>
SELECT ?datatype (SUM(?triples) AS ?count) WHERE {
?dataset void:classPartition [
void:class schema:Person ;
void:propertyPartition [
void:property schema:name ;
void-ext:datatypePartition [
void-ext:datatype ?datatype ;
void:triples ?triples
]
]
]
}
GROUP BY ?datatype
ORDER BY DESC(?count)

▶ Run in the query UI

Language coverage on schema:name

Which language tags appear on schema:name values of schema:CreativeWork resources, and how often.

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX void-ext: <http://ldf.fi/void-ext#>
PREFIX schema: <https://schema.org/>
SELECT ?language (SUM(?triples) AS ?count) WHERE {
?dataset void:classPartition [
void:class schema:CreativeWork ;
void:propertyPartition [
void:property schema:name ;
void-ext:languagePartition [
void-ext:language ?language ;
void:triples ?triples
]
]
]
}
GROUP BY ?language
ORDER BY DESC(?count)

▶ Run in the query UI

Object classes linked from schema:Book/schema:author

Which classes are the targets of schema:author on schema:Book resources, and how often each class is linked — shows how Book connects to other things in the data.

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX void-ext: <http://ldf.fi/void-ext#>
PREFIX schema: <https://schema.org/>
SELECT ?objectClass (SUM(?triples) AS ?count) WHERE {
?dataset void:classPartition [
void:class schema:Book ;
void:propertyPartition [
void:property schema:author ;
void-ext:objectClassPartition [
void:class ?objectClass ;
void:triples ?triples
]
]
]
}
GROUP BY ?objectClass
ORDER BY DESC(?count)

▶ Run in the query UI

Outgoing linksets to terminology sources

Every cross-dataset and cross-vocabulary linkset emitted by the pipeline, with the number of triples in each — shows how datasets connect to terminology sources and to one another.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
[] a void:Linkset ;
void:subjectsTarget ?dataset ;
void:objectsTarget ?terminologySource ;
void:triples ?triples .
}
ORDER BY DESC(?triples)
LIMIT 50

▶ Run in the query UI

Subject URI spaces

The most common URI namespaces used for subject resources across all datasets — shows where the network's identifiers live.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
?dataset void:subset [
void:uriSpace ?uriSpace ;
void:entities ?entities
] .
}
ORDER BY DESC(?entities)
LIMIT 50

▶ Run in the query UI

Datasets whose subject URIs resolve

For each dataset's own subject namespace – the most common one that is not a terminology source – a sample of URIs is dereferenced and checked to resolve to a self-describing landing page. subject-uris-resolved > 0 means the identifiers genuinely work; the namespace and the subject-uris-sampled denominator come along so you can read the ratio. A transient failure (timeout, network error, 429/5xx) on the multi-hop ARK/Handle resolver chain is retried and, if still failing, excluded from the denominator rather than scored as a non-resolution – so a single network blip during a crawl cannot report a healthy dataset as partially broken. URLs the dataset already exposes as IIIF manifests are excluded from this sample (a manifest serves JSON, not an HTML landing page); they are assessed by the IIIF criterion instead, so the same URL is never both a working manifest and a broken identifier.

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX dqv: <http://www.w3.org/ns/dqv#>
PREFIX nde: <https://def.nde.nl/metric#>
SELECT ?dataset ?uriSpace ?resolved ?sampled WHERE {
?dataset void:subset ?ns .
?ns void:uriSpace ?uriSpace ;
dqv:hasQualityMeasurement
[ dqv:isMeasurementOf nde:subject-uris-resolved ; dqv:value ?resolved ] ,
[ dqv:isMeasurementOf nde:subject-uris-sampled ; dqv:value ?sampled ] .
FILTER(?resolved > 0)
}
ORDER BY DESC(?resolved)

▶ Run in the query UI

Datasets that mint a persistent identifier

Datasets whose own subject namespace is a recognised ARK or Handle scheme, with the scheme and – for ARK – the issuing organisation. Combine with subject-uris-resolved above to tell a dataset that declares a persistent identifier and whose links resolve from one that claims a PID but serves broken links.

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?dataset ?uriSpace ?scheme ?publisher WHERE {
?dataset void:subset ?ns .
?ns void:uriSpace ?uriSpace ;
dcterms:conformsTo ?scheme .
FILTER(STRSTARTS(STR(?scheme), "https://def.nde.nl/pid-scheme#"))
OPTIONAL { ?ns dcterms:publisher ?publisher }
}

▶ Run in the query UI

Most-referenced vocabularies

Which vocabularies (Schema.org, FOAF, Dublin Core, …) are referenced, and by how many datasets.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT ?vocabulary (COUNT(DISTINCT ?dataset) AS ?datasetCount) WHERE {
?dataset a void:Dataset ;
void:vocabulary ?vocabulary .
}
GROUP BY ?vocabulary
ORDER BY DESC(?datasetCount)

▶ Run in the query UI

License usage

License IRIs that appear in dataset subsets and how many datasets use each.

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?license (COUNT(DISTINCT ?dataset) AS ?datasetCount) WHERE {
?dataset a void:Dataset ;
void:subset [
dcterms:license ?license
] .
}
GROUP BY ?license
ORDER BY DESC(?datasetCount)

▶ Run in the query UI

Datasets exposing IIIF Presentation manifests

Datasets that publish IIIF Presentation API manifests under the SCHEMA-AP-NDE convention, with the number of distinct manifests detected per dataset.

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?dataset ?manifests WHERE {
?dataset a void:Dataset ;
void:subset [
dcterms:conformsTo <http://iiif.io/api/presentation/> ;
void:entities ?manifests
] .
}
ORDER BY DESC(?manifests)

▶ Run in the query UI

The dcterms:conformsTo marker above is declared. To find datasets whose manifests are validated working, query the manifests-validated measurement instead – a sample of the manifest IRIs is dereferenced each run, and this counts how many resolved to a valid IIIF Presentation Manifest. validated > 0 means working manifests; validated = 0 alongside a declared subset means the dataset claims IIIF but its sampled manifests all failed to resolve.

Datasets with validated IIIF manifests

Datasets whose declared IIIF manifests actually resolve, ordered by how many of the sampled manifests were validated.

PREFIX dqv: <http://www.w3.org/ns/dqv#>
PREFIX nde: <https://def.nde.nl/metric#>
SELECT ?dataset ?validated ?sampled WHERE {
?dataset dqv:hasQualityMeasurement
[ dqv:isMeasurementOf nde:manifests-validated ; dqv:value ?validated ] ,
[ dqv:isMeasurementOf nde:manifests-sampled ; dqv:value ?sampled ] .
FILTER(?validated > 0)
}
ORDER BY DESC(?validated)

▶ Run in the query UI

Failed samples

The subject-URI resolution and IIIF manifest-validation checks each sample a handful of resources and report an aggregate ratio (subject-uris-resolved / subject-uris-sampled and manifests-validated / manifests-sampled). A low ratio tells you that something broke, not which resource or why. To answer that, every sampled resource that failed is enumerated on the check's prov:Activity as a qualified usage:

_:activity a prov:Activity ;
prov:used <https://example.org/id/123> ; # the failed resource
prov:qualifiedUsage _:usage ;
prov:wasAssociatedWith <…software> .
_:usage a prov:Usage ;
prov:entity <https://example.org/id/123> ;
failure:reason <https://def.nde.nl/subject-resolution-failure#no-self-reference> .

Only failures are persisted – the presence of a failure:reason is the contract for “this sample failed”; the resolved/validated samples are covered by the count alone. The prov:Usage hangs off the activity (a usage reifies an activity-uses-entity relationship), but you still reach failures dataset-first through the measurement that the activity generated:

void:subsetdqv:hasQualityMeasurement → measurement → prov:wasGeneratedByprov:Activityprov:qualifiedUsageprov:Usageprov:entity / failure:reason.

The reason is a SKOS concept from the scheme matching the check: subject-resolution-failure (timeout, network-error, http-error, wrong-content-type, no-self-reference) for subject URIs, and manifest-validation-failure (timeout, network-error, http-error, invalid-json, binary-content, not-a-manifest, does-not-load) for IIIF manifests. The failure:reason predicate itself is defined in the failure module.

This query lists every failed subject URI per dataset with its reason:

PREFIX void: <http://rdfs.org/ns/void#>
PREFIX dqv: <http://www.w3.org/ns/dqv#>
PREFIX prov: <http://www.w3.org/ns/prov#>
PREFIX nde: <https://def.nde.nl/metric#>
PREFIX failure: <https://def.nde.nl/failure#>
SELECT ?dataset ?failedUri ?reason WHERE {
?dataset void:subset ?subset .
?subset dqv:hasQualityMeasurement ?measurement .
?measurement dqv:isMeasurementOf nde:subject-uris-resolved ;
prov:wasGeneratedBy ?activity .
?activity prov:qualifiedUsage ?usage .
?usage prov:entity ?failedUri ;
failure:reason ?reason .
}

Swap nde:subject-uris-resolved for nde:manifests-validated to list failed IIIF manifests instead.

Datasets with working SPARQL endpoints

Datasets whose declared SPARQL endpoint passed the pipeline's smoke test.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
?dataset a void:Dataset ;
void:sparqlEndpoint ?endpoint .
}

▶ Run in the query UI

Example resources per dataset

A handful of void:exampleResource URIs per dataset — concrete starting points for exploration.

PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
?dataset void:exampleResource ?example .
}
LIMIT 50

▶ Run in the query UI

Datasets passing SCHEMA-AP-NDE

Datasets whose sampled resources passed SHACL validation against the Schema.org Application Profile for NDE. The ?n > 0 filter excludes datasets to which the profile doesn't apply (vacuous truth from an empty target set).

PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dqv: <http://www.w3.org/ns/dqv#>
PREFIX nde: <https://def.nde.nl/metric#>
SELECT * WHERE {
?dataset dqv:hasQualityMeasurement
[ dqv:value true ;
dcterms:conformsTo <https://docs.nde.nl/schema-profile/> ] ,
[ dqv:isMeasurementOf nde:quads-validated ;
dqv:value ?n ] .
FILTER (?n > 0)
}

▶ Run in the query UI

The ?n > 0 filter excludes datasets that use a different data model and to which the profile doesn't apply at all (where SHACL returns vacuously true). To find datasets that tried the profile and failed, swap dqv:value true for dqv:value false.

Access

How summaries are produced

A periodic pipeline builds the summaries:

  1. Select valid dataset descriptions with at least one RDF distribution from the Dataset Register.
  2. Load the data – directly from the publisher’s SPARQL endpoint if available, otherwise by indexing the RDF dump in QLever.
  3. Analyse by running a set of SPARQL CONSTRUCT queries, one per partition type, with code-level post-processing where needed. Each analyser emits VoID triples. For IIIF, a sample of the detected manifest IRIs is also dereferenced and validated (via @lde/iiif-validator), recording how many resolve to valid Presentation Manifests. Likewise, the dataset's own subject namespace is sampled and dereferenced to measure whether its URIs – and any ARK or Handle persistent identifiers – resolve.
  4. Validate against SCHEMA-AP-NDE by sampling a configurable number of resources per sh:targetClass and running them through the profile's SHACL shapes. The detailed per-resource SHACL report is written to a file (not the triple store).
  5. Summarise quality measurements as DQV measurements and a PROV activity, and append them to the dataset's Summary.
  6. Write the results as one n-quads file per dataset (each Summary in a named graph keyed on the dataset IRI, its SHACL report in a derived graph). A separate, read-only QLever rebuilds its served index from these files after every run – on success and on partial failure alike – so the Knowledge Graph is a pure derived cache, fully rebuilt each run rather than mutated in place.
  7. Reconcile the cache with the register: a dataset that has since been removed from the register or whose registration source has gone (became unreachable) is no longer rewritten, so its file would linger as a stale “ghost”. After writing, the pipeline deletes every file whose dataset URI is no longer present-and-not-gone in the register. Datasets merely skipped this run (no RDF distribution, an unreachable endpoint, or a description that failed validation) are kept, and an empty result from the register prunes nothing – so a register outage can never empty the cache.

Datasets without a valid RDF distribution are skipped; invalid distributions emit a schema:error triple instead of a summary, so consumers can still see which distributions are unreachable.

The pipeline source, run instructions and contributor guidance live in the dataset-knowledge-graph repository.