Dataset Knowledge Graph
The Dataset Knowledge Graph enriches the Dataset Register with insights derived from each dataset's content. The Register stores what publishers submit; the Knowledge Graph publishes an empirical, VoID-modelled view of each dataset's shape – its RDF types, predicates, languages, outgoing links, and conformance to SCHEMA-AP-NDE.
It helps researchers, service platform builders and data engineers decide which heritage datasets fit their use case, and how to query them.
What a Summary tells you
For each dataset, the Summary lets you answer questions like:
- Does this dataset contain what I need? – which RDF types are present and how many instances of each; which predicates are populated for which classes.
- How big and how queryable is it? – total triples, distinct subjects and predicates; example resources to start exploring.
- Which terminology sources does it link to? – outgoing linksets to AAT, GTAA, GeoNames, Wikidata and other vocabularies in the Network of Terms.
- Which languages and datatypes does it cover? – language tags and XSD datatypes per property, broken down by class.
- Does it conform to SCHEMA-AP-NDE? – a sampled SHACL validation of the dataset against the Schema.org Application Profile for NDE.
- How can I access it? – which SPARQL endpoint or data dump currently responds, and at what size.
Inside a Dataset Summary
Each Summary attaches the following information to a void:Dataset – a mix of dataset-level statistical properties, VoID partitions, separate linkset resources, and DQV/PROV quality measurements:
| Aspect | Modelled as | What it tells you |
|---|---|---|
| Size | void:triples, void:distinctSubjects, void:properties, nde:objectsLiteral, nde:distinctObjectsURI | Overall scale and the literal-vs-URI balance |
| Classes | void:classPartition | Which RDF types occur, with instance counts |
| Properties | void:propertyPartition | Which predicates occur; entities and distinct objects per predicate |
| Property density per class | nested classPartition / propertyPartition | Which properties are populated for each subject class – answers "which fields exist on schema:Person records?" |
| Datatypes per class and property | void-ext:datatype, void-ext:datatypePartition | Which XSD datatypes are used, broken down by class and property |
| Languages per class and property | void-ext:languagePartition | Language-tag coverage per class and property |
| Object classes per class and property | void-ext:objectClassPartition | How classes connect through predicates – e.g. "books link to persons via author 1350 times" |
| Outgoing linksets | void:Linkset | Cross-dataset and cross-vocabulary links – how the dataset fits into the wider network |
| Subject URI spaces | void:uriSpace | The most common namespaces for subject resources |
| Vocabularies | void:vocabulary | Schema.org, FOAF, Dublin Core, etc. – what the predicates draw from |
| Licenses | dcterms:license | License coverage at the resource level |
| IIIF Presentation manifests | void:subset + dcterms:conformsTo <http://iiif.io/api/presentation/> + void:entities, plus manifests-sampled / manifests-validated DQV measurements | Whether the dataset exposes IIIF Presentation API manifests, how many, and how many of a sample actually resolve. Detected from schema:encodingFormat literals matching the SCHEMA-AP-NDE IIIF profile pattern; v2 and v3 collapse into one version-less subset. The dcterms:conformsTo marker is declared; a sample of the manifest IRIs is then dereferenced and validated, so a dataset whose manifests genuinely resolve (validated > 0) is distinguishable from one that declares IIIF but serves broken manifests (validated = 0) |
| Distributions | void:sparqlEndpoint, void:dataDump, plus HTTP-validated status | Which distributions currently work and at what size |
| Example resources | void:exampleResource | Concrete starting points for exploration |
| SCHEMA-AP-NDE conformance | dqv:QualityMeasurement + prov:Activity | Whether a sample of resources passes the SCHEMA-AP-NDE SHACL shapes. Three metrics are emitted: schema-ap-nde-sample-conformance (boolean), quads-validated (number of sampled triples), and samples-per-class (sample cap). Combine quads-validated > 0 with conformance = true to mean "tested and passed"; quads-validated = 0 means the profile didn't apply (e.g. the dataset uses Linked.Art or EDM). The full per-resource SHACL report is written to a file rather than the triple store. |
For the full RDF examples behind each row, see the dataset-knowledge-graph README – the canonical reference.
Sample queries
One example per analysis. Each link opens the query pre‑loaded in the Knowledge Graph triplestore UI — click Run to execute. The aggregate datastory demonstrates more advanced combinations.
Size
The overall size of each dataset: total triples, distinct subjects, and the literal-vs-URI object split.
PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
?dataset a void:Dataset ;
void:triples ?triples ;
void:distinctSubjects ?distinctSubjects .
}
ORDER BY DESC(?triples)
Most common classes
Which RDF types appear most across the network, with instance counts summed across datasets and the number of datasets each class appears in.
PREFIX void: <http://rdfs.org/ns/void#>
SELECT ?class (SUM(?count) AS ?instances) (COUNT(DISTINCT ?dataset) AS ?datasets) WHERE {
?dataset a void:Dataset ;
void:classPartition [
void:class ?class ;
void:entities ?count
] .
}
GROUP BY ?class
ORDER BY DESC(?instances)
LIMIT 20
Most common properties
Which predicates appear most across the network, with the total number of entities that carry each predicate.
PREFIX void: <http://rdfs.org/ns/void#>
SELECT ?property (SUM(?entities) AS ?totalEntities) WHERE {
?dataset a void:Dataset ;
void:propertyPartition [
void:property ?property ;
void:entities ?entities
] .
}
GROUP BY ?property
ORDER BY DESC(?totalEntities)
LIMIT 20
Property density on schema:Person
For each dataset, which predicates are populated on schema:Person resources, and how many entities carry them.
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX schema: <https://schema.org/>
SELECT * WHERE {
?dataset void:classPartition [
void:class schema:Person ;
void:propertyPartition [
void:property ?property ;
void:entities ?entities ;
void:distinctObjects ?distinctObjects
]
]
}
ORDER BY DESC(?entities)
LIMIT 50
Datatypes used for schema:Person/schema:name
Which XSD datatypes appear in schema:name values on schema:Person resources — useful for spotting unexpected datatype mixes.
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX void-ext: <http://ldf.fi/void-ext#>
PREFIX schema: <https://schema.org/>
SELECT ?datatype (SUM(?triples) AS ?count) WHERE {
?dataset void:classPartition [
void:class schema:Person ;
void:propertyPartition [
void:property schema:name ;
void-ext:datatypePartition [
void-ext:datatype ?datatype ;
void:triples ?triples
]
]
]
}
GROUP BY ?datatype
ORDER BY DESC(?count)
Language coverage on schema:name
Which language tags appear on schema:name values of schema:CreativeWork resources, and how often.
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX void-ext: <http://ldf.fi/void-ext#>
PREFIX schema: <https://schema.org/>
SELECT ?language (SUM(?triples) AS ?count) WHERE {
?dataset void:classPartition [
void:class schema:CreativeWork ;
void:propertyPartition [
void:property schema:name ;
void-ext:languagePartition [
void-ext:language ?language ;
void:triples ?triples
]
]
]
}
GROUP BY ?language
ORDER BY DESC(?count)
Object classes linked from schema:Book/schema:author
Which classes are the targets of schema:author on schema:Book resources, and how often each class is linked — shows how Book connects to other things in the data.
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX void-ext: <http://ldf.fi/void-ext#>
PREFIX schema: <https://schema.org/>
SELECT ?objectClass (SUM(?triples) AS ?count) WHERE {
?dataset void:classPartition [
void:class schema:Book ;
void:propertyPartition [
void:property schema:author ;
void-ext:objectClassPartition [
void:class ?objectClass ;
void:triples ?triples
]
]
]
}
GROUP BY ?objectClass
ORDER BY DESC(?count)
Outgoing linksets to terminology sources
Every cross-dataset and cross-vocabulary linkset emitted by the pipeline, with the number of triples in each — shows how datasets connect to terminology sources and to one another.
PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
[] a void:Linkset ;
void:subjectsTarget ?dataset ;
void:objectsTarget ?terminologySource ;
void:triples ?triples .
}
ORDER BY DESC(?triples)
LIMIT 50
Subject URI spaces
The most common URI namespaces used for subject resources across all datasets — shows where the network's identifiers live.
PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
?dataset void:subset [
void:uriSpace ?uriSpace ;
void:entities ?entities
] .
}
ORDER BY DESC(?entities)
LIMIT 50
Most-referenced vocabularies
Which vocabularies (Schema.org, FOAF, Dublin Core, …) are referenced, and by how many datasets.
PREFIX void: <http://rdfs.org/ns/void#>
SELECT ?vocabulary (COUNT(DISTINCT ?dataset) AS ?datasetCount) WHERE {
?dataset a void:Dataset ;
void:vocabulary ?vocabulary .
}
GROUP BY ?vocabulary
ORDER BY DESC(?datasetCount)
License usage
License IRIs that appear in dataset subsets and how many datasets use each.
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?license (COUNT(DISTINCT ?dataset) AS ?datasetCount) WHERE {
?dataset a void:Dataset ;
void:subset [
dcterms:license ?license
] .
}
GROUP BY ?license
ORDER BY DESC(?datasetCount)
Datasets exposing IIIF Presentation manifests
Datasets that publish IIIF Presentation API manifests under the SCHEMA-AP-NDE convention, with the number of distinct manifests detected per dataset.
PREFIX void: <http://rdfs.org/ns/void#>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT ?dataset ?manifests WHERE {
?dataset a void:Dataset ;
void:subset [
dcterms:conformsTo <http://iiif.io/api/presentation/> ;
void:entities ?manifests
] .
}
ORDER BY DESC(?manifests)
The dcterms:conformsTo marker above is declared. To find datasets whose manifests are validated working, query the manifests-validated measurement instead – a sample of the manifest IRIs is dereferenced each run, and this counts how many resolved to a valid IIIF Presentation Manifest. validated > 0 means working manifests; validated = 0 alongside a declared subset means the dataset claims IIIF but its sampled manifests all failed to resolve.
Datasets with validated IIIF manifests
Datasets whose declared IIIF manifests actually resolve, ordered by how many of the sampled manifests were validated.
PREFIX dqv: <http://www.w3.org/ns/dqv#>
PREFIX nde: <https://def.nde.nl/metric#>
SELECT ?dataset ?validated ?sampled WHERE {
?dataset dqv:hasQualityMeasurement
[ dqv:isMeasurementOf nde:manifests-validated ; dqv:value ?validated ] ,
[ dqv:isMeasurementOf nde:manifests-sampled ; dqv:value ?sampled ] .
FILTER(?validated > 0)
}
ORDER BY DESC(?validated)
Datasets with working SPARQL endpoints
Datasets whose declared SPARQL endpoint passed the pipeline's smoke test.
PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
?dataset a void:Dataset ;
void:sparqlEndpoint ?endpoint .
}
Example resources per dataset
A handful of void:exampleResource URIs per dataset — concrete starting points for exploration.
PREFIX void: <http://rdfs.org/ns/void#>
SELECT * WHERE {
?dataset void:exampleResource ?example .
}
LIMIT 50
Datasets passing SCHEMA-AP-NDE
Datasets whose sampled resources passed SHACL validation against the Schema.org Application Profile for NDE. The ?n > 0 filter excludes datasets to which the profile doesn't apply (vacuous truth from an empty target set).
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dqv: <http://www.w3.org/ns/dqv#>
PREFIX nde: <https://def.nde.nl/metric#>
SELECT * WHERE {
?dataset dqv:hasQualityMeasurement
[ dqv:value true ;
dcterms:conformsTo <https://docs.nde.nl/schema-profile/> ] ,
[ dqv:isMeasurementOf nde:quads-validated ;
dqv:value ?n ] .
FILTER (?n > 0)
}
The ?n > 0 filter excludes datasets that use a different data model and to which the profile doesn't apply at all (where SHACL returns vacuously true). To find datasets that tried the profile and failed, swap dqv:value true for dqv:value false.
Access
- Datastory for visual aggregate insights across all datasets: datastories.demo.netwerkdigitaalerfgoed.nl/dataset-knowledge-graph.
- SPARQL endpoint for direct queries:
https://triplestore.netwerkdigitaalerfgoed.nl/repositories/dataset-knowledge-graph.
How summaries are produced
A periodic pipeline builds the summaries:
- Select valid dataset descriptions with at least one RDF distribution from the Dataset Register.
- Load the data – directly from the publisher’s SPARQL endpoint if available, otherwise by indexing the RDF dump in QLever.
- Analyse by running a set of SPARQL CONSTRUCT queries, one per partition type, with code-level post-processing where needed. Each analyser emits VoID triples. For IIIF, a sample of the detected manifest IRIs is also dereferenced and validated (via
@lde/iiif-validator), recording how many resolve to valid Presentation Manifests. - Validate against SCHEMA-AP-NDE by sampling a configurable number of resources per
sh:targetClassand running them through the profile's SHACL shapes. The detailed per-resource SHACL report is written to a file (not the triple store). - Summarise quality measurements as DQV measurements and a PROV activity, and append them to the dataset's Summary.
- Write the results to the Knowledge Graph triple store.
Datasets without a valid RDF distribution are skipped; invalid distributions emit a schema:error triple instead of a summary, so consumers can still see which distributions are unreachable.
The pipeline source, run instructions and contributor guidance live in the dataset-knowledge-graph repository.