Requirements for Datasets

1. Introduction

This section is non-normative.

To enable datasets to be found and used, they must be described according to a well-documented, shared and machine-readable publication model.

This document describes such a model and its rules. When publishers make their dataset descriptions adhere to these rules, they enable consumers – both humans and machines – to use the datasets in new and better ways.

These requirements prescribe the metadata that publishers must provide for their dataset. This metadata tells consumers:

what the dataset is called and under what license it is published (§ 4.2 Dataset information);
which person/organization has published the dataset (§ 4.3 Creator/publisher information);
where the data can be found (§ 4.4 Distributions).

1.1. Audience

This document is mainly geared towards two groups of readers.

Digital heritage collection managers can follow the requirements in this document to make their published datasets findable and usable, for instance through Google Dataset Search and the NDE Dataset Register.

Suppliers of collection management systems can implement these requirements in their software to help collection managers using it to publish datasets in the correct format. These requirements are scoped to the online publication output of collection management systems; they do not prescribe how those systems should store data internally.

1.2. Context

While focused on digital heritage institutions in The Netherlands, this document is based on broader, international best practices for publishing datasets, including [DWBP-UCR], [DWBP] and [LD-BP].

These requirements incorporate a previous publication model, which provides more background on choices made here.

1.3. Code examples

RDF code examples are in the [SCHEMA-ORG] vocabulary, serialized as [JSON-LD].

While other vocabularies, such as [VOCAB-DCAT-3] can also be used, Schema.org has the advantage that it’s better picked up by search engines, improving findability, one of the main goals of publishing datasets on the web.

2. Definitions

Dataset

A collection of metadata records. These are made available through the dataset’s distributions.

Dataset description

Metadata about the dataset, including the dataset’s name and publisher. This description must be distinguished from the metadata records themselves.

For example: imagine a dataset of Van Gogh paintings called ‘Sunflowers’, which is published by the Van Gogh Museum under a specific license. The name, license and publisher are all part of the dataset description. The dataset description also tells us the URLs of distributions where we can download or query the data. Using these distributions, we can access the metadata records themselves, which may include descriptions of paintings, persons, places etc. These are not part of the dataset description.

Data catalog

A collection of dataset descriptions.

Distribution

A channel through which a dataset is made available, either for downloading (such as a CSV file download or RDF dump), or for querying (such as a SPARQL endpoint).

Web API

An API that is available over HTTP, for example an OAI-PMH, OpenAPI or SPARQL endpoint.

Machine-readability

TODO

Publisher

An individual or organization that provides one or more datasets.

is this a good translation for ‘bronhouder’? And add examples. This would probably be 'heritage institution' in DERA.

Consumer

On organization, individual or service platform that uses one or more datasets that are provided by a publisher.

3. Conceptual model

The model consists of four resource types: organizations or persons publish datasets, which are available in distributions. Optionally, the datasets are grouped in data catalogs.

4. Requirements

4.1. Available in RDF

For machine-readable access to data, it needs to be published in an RDF format. RDF formats include [JSON-LD], [N3] and [Turtle].

Publishers MUST make their dataset description available in RDF.

Both the Schema.org and DCAT vocabularies MAY be used; Schema.org is recommended.

Google recommends including the JSON-LD directly in the HTML source of web pages.

So, on your organization’s web page, for instance www.kb.nl, include:

<html>
  <head>
    <title>Koninklijke Bibliotheek</title>
    <script type="application/ld+json">
      {
        "@context": "https://schema.org/",
        "@type": "Organization",
        "@id": "https://www.kb.nl",
        "name": {
          "@value": "Koninklijke Bibliotheek",
          "@language": "nl"
        }
      }
    </script>
  </head>
  <body>
    Here continues the web page of the organization...
  </body>
</html>

4.1.1. Content-Type

Clients that retrieve dataset descriptions rely on the HTTP Content-Type header to determine how the response should be parsed.

Therefore, the URL at which a dataset description is published MUST be served with a Content-Type header whose media type matches the RDF serialization of the response body.

Recognized RDF media types include application/ld+json for [JSON-LD], text/turtle for [Turtle], application/n-triples for [N-TRIPLES] and application/rdf+xml for RDF/XML. When the dataset description is embedded in an HTML page (see § 4.1 Available in RDF), text/html is acceptable.

4.1.2. Durable identifiers

Consumers want to refer to datasets. They prefer to do so by linking to them.

Therefore, publishers MUST maintain a permanent and unique identifier for each dataset. Publishers MUST use HTTP IRIs as identifiers.

4.1.3. Resolvable dataset IRIs

Consumers want to look up a dataset by its IRI. When the dataset IRI resolves, consumers can retrieve both the machine-readable dataset description (via content negotiation) and a human-readable HTML page without needing a separate landing page URL.

Therefore, publishers SHOULD ensure that the dataset IRI resolves and serves both an RDF representation (the dataset description) and an HTML representation. The HTML representation MUST be consistent with and include at least all details (such as distributions) from the RDF representation.

When the dataset IRI resolves to an HTML page, a separate mainEntityOfPage is not needed (see § 4.2.5 More information).

4.1.4. Information remains available

Datasets will be used by all kinds of consumers and their systems. For stability, users must be able to trust that the datasets will remain available so they can be consulted in the future.

Therefore, publishers MUST ensure information remains available in the future.

4.2. Dataset information

Consumers want to consult information about the dataset to decide whether and how they want to use its data. This information answers user questions such as:

What is the name of the dataset?
What is the dataset about? What kind of data does the dataset contain?
How recent is the data? When was the dataset last published?
How can I use the data? Are there any restrictions? Under which license is the data published?
Where can I get the data? In what formats?

4.2.1. Basic information

Publishers MUST include basic information about the dataset, at the very minimum its HTTP [IRI] and name.

Basic dataset information:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "@id": "http://data.bibliotheken.nl/doc/dataset/rise-alba",
  "name": [
    {
      "@value": "Alba amicorum van de Koninklijke Bibliotheek",
      "@language": "nl"
    },
    {
      "@value": "Alba amicorum of the Dutch Royal Library",
      "@language": "en"
    }
  ]
}

4.2.2. License

License applicable to the dataset. As a convenience, this license is inherited by all distributions that do not specify their own license. Publishers MUST make known under which license the dataset may be used.

This license specifies the conditions under which the metadata records in the dataset may be accessed and reused. It does not cover the heritage objects (either physical or digital) that the metadata describes. Access conditions for these objects should be specified in their own descriptions within the dataset.

The DERA requires metadata to be published openly, so this value SHOULD be an open license that allows consumers to reuse the data, such as https://creativecommons.org/publicdomain/zero/1.0/ (CC0), https://creativecommons.org/licenses/by/4.0/ (CC BY 4.0), or https://creativecommons.org/licenses/by-sa/4.0/ (CC BY-SA 4.0). Adopting a non-open license will severely limit reuse and does not comply with the DERA principles.

The value MUST be the canonical URI of a license. For example, use https://creativecommons.org/publicdomain/zero/1.0/ instead of https://creativecommons.org/publicdomain/zero/1.0/deed.nl.

Specify a license for the dataset:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "license": "https://creativecommons.org/publicdomain/zero/1.0/"
}

DCAT-AP-NL 3.0 requires the use of Creative Commons licences and defines a value list for licences. The default SHOULD be CC0 (https://creativecommons.org/publicdomain/zero/1.0/). The following table lists the recommended canonical license IRIs. Creative Commons IRIs MUST use https:// and end with a trailing slash.

Canonical IRI
`https://creativecommons.org/publicdomain/zero/1.0/`
`https://creativecommons.org/publicdomain/mark/1.0/`
`https://creativecommons.org/licenses/by/4.0/`
`https://creativecommons.org/licenses/by-sa/4.0/`
`https://creativecommons.org/licenses/by-nc/4.0/`
`https://creativecommons.org/licenses/by-nc-sa/4.0/`
`https://creativecommons.org/licenses/by-nd/4.0/`
`https://creativecommons.org/licenses/by-nc-nd/4.0/`

In v2.0 of this specification, the license MUST be one of the IRIs listed above.

4.2.3. Creation, publication and modification dates

Publishers SHOULD make known when the dataset description was originally created, published and when it was last updated. The dates MUST be valid according to [ISO8601].

Specify dataset description dates:

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "dateCreated": "2019-04-14",
  "datePublished": "2019-05-21",
  "dateModified": "2019-08-15"
}

4.2.4. Versions

A dataset description may change over time. Consumers, such as researchers, may want to determine which information was valid at a certain moment.

Therefore, publishers SHOULD not only publish the current version of the dataset description, but keep historical versions accessible to users.

It is up to the publisher to determine when to publish new versions.

4.2.5. More information

If more information is available, publishers SHOULD add it.

More information about the dataset.

{  "@context": "https://schema.org/",  "@type": "Dataset",  "description": {    "@value": "Alba amicorum van de Koninklijke Bibliotheek, een dataset gedefinieerd voor het Europeana Rise of Literacy project.",    "@language": "nl"  },  "mainEntityOfPage": "https://www.kb.nl/bronnen-zoekwijzers/kb-collecties/moderne-handschriften-vanaf-ca-1550/alba-amicorum",  "keywords": [    "alba amicorum"  ]}

See § 4.6.1 Dataset attributes for an overview of attributes.

4.3. Creator/publisher information

Users want to know where the dataset came from (provenance). The dataset’s creator and/or publisher is either a person or an organization. Providing information about the person/organization behind the dataset answers user questions such as:

Which person/organization has published this dataset? How reliable and credible does that make the dataset?
How can I contact the person/organization for questions or feedback?

Therefore, publishers MUST publish basic information about the person/organization. At the least, the organization’s name and HTTP IRI must be provided. The organization’s name SHOULD be a language-tagged string.

An organization description:

{
  "@context": "https://schema.org/",
  "@type": "Organization",
  "@id": "https://www.kb.nl",
  "name": {
    "@value": "Koninklijke Bibliotheek",
    "@language": "nl"
  },
  "alternateName": {
    "@value": "KB",
    "@language": "nl"
  }
}

See § 4.6.2 Organization attributes for a full overview of organization attributes.

A person description:

{
  "@context": "https://schema.org/",
  "@type": "Person",
  "@id": "https://example.com",
  "name": {
      "@value": "Jan Jansen",
      "@language": "nl"
  }
}

4.3.1. ISIL identifier

Publishers SHOULD include the organization’s ISIL code.

An organization with an ISIL code:

{
  "@context": "https://schema.org/",
  "@type": "Organization",
  "@id": "https://www.kb.nl",
  "identifier": "NL-HaKB"
}

4.3.2. Contact information

Publishers SHOULD include contact information so consumers can reach them with questions or feedback. If contact information is provided, it MUST include a name and e-mail address. The e-mail address SHOULD be an organizational or departmental address rather than a personal one for GDPR compliance.

A publisher with contact information.

{  "@context": "https://schema.org/",  "@type": "Dataset",  "publisher": {    "@type": "Organization",    "@id": "https://www.kb.nl",    "name": "Koninklijke Bibliotheek",    "alternateName": "KB",    "contactPoint": {      "@type": "ContactPoint",      "name": {        "@value": "Datasets Department",        "@language": "en"      },      "email": "datasets@kb.nl",      "telephone": "+31 6 12345678"    }  }}

4.3.3. Dataset publisher

The person/organization data is then included as the dataset’s publisher:

A dataset with an organization as publisher.

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "publisher": {
    "@type": "Organization",
    "@id": "https://www.kb.nl",
    "identifier": "NL-HaKB"
    "name": {
        "@value": "Koninklijke Bibliotheek",
        "language": "nl"
    },
    "alternateName": {
        "@value": "KB",
        "@language": "nl"
    },
    "contactPoint": {
      "@type": "ContactPoint",
      "name": {
        "@value": "Datasets Department",
        "@language": "en"
      },
      "email": "datasets@kb.nl",
      "telephone": "+31 6 12345678"
    }
  }
}

A dataset with a person as publisher.

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "publisher": {
    "@type": "Person",
    "@id": "https://example.com",
    "name": {
      "@value": "Jan Jansen",
      "@language": "nl"
    }
  }
}

4.4. Distributions

Consumers that are interested in a dataset should be able to access the data in it. Distributions tell consumers where and how they can get the data.

Therefore, publishers SHOULD add at least one distribution. Each distribution MUST have at least a MIME format and the URL where the distribution can be directly accessed.

Examples of distributions are data dumps in one or more RDF serializations, such as JSON-LD and Turtle, CSV files, SPARQL endpoints, OAI-PMH endpoints or other web APIs. All distributions of a dataset MUST contain broadly the same data.

schema:contentUrl is the URL of the distribution itself: the data file, or the endpoint of the web API. If a documentation page describes or provides access to the distribution alongside it – for example a SPARQL query editor for the endpoint, or a download landing page – add it on the distribution with schema:documentation.

A minimal definition of a SPARQL endpoint distribution. In the Schema.org vocabulary, each type of distribution is called a DataDownload, even if it is a query endpoint.

{
  "@context": "https://schema.org/",
  "@type": "DataDownload",
  "contentUrl": "http://vocab.getty.edu/sparql",
  "usageInfo": "https://www.w3.org/TR/sparql11-protocol/"
}

The distributions are then included under the distribution attribute with the dataset.

A dataset with one SPARQL and two download distributions. The SPARQL endpoint separates the endpoint URL (contentUrl) from its query UI (documentation).

{
  "@context": "https://schema.org/",
  "@type": "Dataset",
  "distribution": [
    {
      "@type": "DataDownload",
      "contentUrl": "https://service.archief.nl/sparql",
      "documentation": "https://www.nationaalarchief.nl/onderzoeken/sparql",
      "usageInfo": "https://www.w3.org/TR/sparql11-protocol/"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "application/ld+json",
      "contentUrl": "http://data.bibliotheken.nl/id/dataset/rise-alba.json"
    },
    {
      "@type": "DataDownload",
      "encodingFormat": "text/csv+gzip",
      "contentUrl": "https://example.com/data.csv.gz"
    }
  ]
}

See § 4.6.3 Distribution attributes for a full overview.

In v2.0 of this specification, contentUrl MUST be an xsd:anyURI literal. IRIs will no longer be accepted.

4.4.1. Creation and modification dates

Publishers SHOULD make known when the distribution was originally created and when it was last updated, so consumers can efficiently stay up-to-date with the latest changes. Please note that this is different from dataset description’s dates. Datetimes (2019-08-15T08:05:00Z) are preferred because they offer greater precision, but simple dates (2019-04-14) are also allowed.

Specify distribution dates:

{
  "@context": "https://schema.org/",
  "@type": "Distribution",
  "dateCreated": "2019-04-14",
  "dateModified": "2019-08-15T08:05:00"
}

4.4.2. Usage information

Each distribution SHOULD include one or more schema:usageInfo IRIs that describe what the distribution provides.

For web APIs, this MUST be the IRI of the protocol specification (such as SPARQL or OAI-PMH). The Dataset Register uses this IRI to identify the distribution as an API.

For downloads as well as APIs, this MUST be the IRI of the application profile(s) that the data conforms to.

Recommended IRIs for typing distributions
Item	Type	Applies to	IRI
GraphQL	Protocol	API	https://spec.graphql.org/
Linked Art	Application profile	API, download	https://linked.art/model/
OAI-PMH	Protocol	API	http://www.openarchives.org/pmh/
OpenAPI REST	Protocol	API	https://spec.openapis.org/oas/v3.2.0.html
[SCHEMA-AP-NDE]	Application profile	API, download	https://docs.nde.nl/schema-profile/
SPARQL	Protocol	API	https://www.w3.org/TR/sparql11-protocol/
TPF	Protocol	API	https://linkeddatafragments.org/specification/triple-pattern-fragments/
WMS	Protocol	API	https://www.ogc.org/standards/wms/

A download distribution that conforms to [SCHEMA-AP-NDE]:

{
  "@context": "https://schema.org/",
  "@type": "DataDownload",
  "encodingFormat": "application/ld+json",
  "contentUrl": "https://example.com/data.jsonld",
  "usageInfo": "https://docs.nde.nl/schema-profile/"
}

A SPARQL distribution that returns RDF conforming to [SCHEMA-AP-NDE]:

{
  "@context": "https://schema.org/",
  "@type": "DataDownload",
  "contentUrl": "https://example.com/sparql",
  "usageInfo": [ "https://www.w3.org/TR/sparql11-protocol/", "https://docs.nde.nl/schema-profile/" ]
}

4.5. Data catalog

A data catalog provides consumers with a complete overview of available dataset descriptions, which improves discoverability.

Therefore, publishers SHOULD provide a catalog.

A catalog of available datasets:

{
  "@context": "https://schema.org/",
  "@type": "DataCatalog",
  "@id": "http://data.bibliotheken.nl/id/datacatalog",
  "name": {
    "@value": "Linked Data van de KB",
    "@language": "nl"
  },
  "description": {
    "@value": "Alle linked data zoals beschikbaar gesteld door de Koninklijke Bibliotheek.",
    "@language": "nl"
  },
  "publisher": {
    "@type": "Organization",
    "@id": "https://www.kb.nl/",
    "name": {
      "@value": "Koninklijke Bibliotheek",
      "@language": "nl"
    }
  },
  "dataset": [
    {
      "@type": "Dataset",
      "@id": "http://data.bibliotheken.nl/id/dataset/rise-alba",
      ...
    },
    {
      ...
    }
  ]
}

See § 4.6.4 DataCatalog attributes for a full overview of catalog attributes.

4.5.1. Pagination

Large data catalogs may be harder to process for clients.

Therefore, publishers SHOULD split large data catalogs in parts of at most a 1000 datasets, using the Hydra Core Vocabulary.

Each page MUST be a complete RDF document in itself. Related resources, such as the publishing organization, must be described on each page, even if that resource is the same on all pages.

A paginated catalog:

{  "@context": [    "https://schema.org/",    {"hydra": "http://www.w3.org/ns/hydra/core#"}  ],  "@type": ["DataCatalog", "hydra:Collection"],  "@id": "https://example.com/catalog",  "name": {    "@value": "Paginated catalog of datasets",    "@language": "en"  },  "description": {    "@value": "This catalog is paginated using the Hydra Core Vocabulary.",    "@language": "en"  },  "publisher": {    "@type": "Organization",    "@id": "/publisher",    "name": {      "@value": "Example Publisher",      "@language": "en"    }  },  "hydra:view": {    "@id": "/catalog?page=1",    "@type": "hydra:PartialCollectionView",    "hydra:first": {"@id": "/catalog?page=1"},    "hydra:next": {"@id": "/catalog?page=2"},    "hydra:last": {"@id": "/catalog?page=498"}  },  "dataset": [    {      "@type": "Dataset",      "@id": "https://example.com/dataset/1",      ...    },    {      "@type": "Dataset",      "@id": "https://example.com/dataset/2",      ...    },    ...  ]}

4.6. Overview of attributes

This is an overview of required and recommended attributes.

4.6.1. Dataset attributes

schema:Dataset properties
Property	Description	Cardinality	Usage
@id	The HTTP [IRI] of the Dataset.	1	Required
schema:name	A name given to the dataset. See § 4.2.1 Basic information.	1..n	Required v[object Object]: must be rdf:langString
schema:description	A description of the contents of the dataset. Preferably at least three sentences and at most one paragraph (2,000 characters). The discoverability of the dataset depends in part on the quality of the description. Consider different audiences – both domain experts and others – for whom the text should be understandable. See § 4.2.1 Basic information.	1..n	Required v[object Object]: must be rdf:langString
schema:publisher	An entity (organisation or person) responsible for making the Dataset available. See § 4.3 Creator/publisher information.	1	Required
schema:license	License applicable to the dataset. As a convenience, this license is inherited by all distributions that do not specify their own license. Publishers MUST make known under which license the dataset may be used. This license specifies the conditions under which the metadata records in the dataset may be accessed and reused. It does not cover the heritage objects (either physical or digital) that the metadata describes. Access conditions for these objects should be specified in their own descriptions within the dataset. The [DERA] requires metadata to be published openly, so this value SHOULD be an open license that allows consumers to reuse the data, such as `https://creativecommons.org/publicdomain/zero/1.0/` (CC0), `https://creativecommons.org/licenses/by/4.0/` (CC BY 4.0), or `https://creativecommons.org/licenses/by-sa/4.0/` (CC BY-SA 4.0). Adopting a non-open license will severely limit reuse and does not comply with the [DERA] principles. The value MUST be the canonical URI of a license. For example, use `https://creativecommons.org/publicdomain/zero/1.0/` instead of `https://creativecommons.org/publicdomain/zero/1.0/deed.nl`. See § 4.2.2 License.	1	Required v[object Object]: must be IRI
schema:distribution	Distributions through which the dataset can be retrieved, such as as a download or via an API. See § 4.4 Distributions.	1..n	Recommended
schema:creator	An entity (organisation or person) responsible for producing the dataset. See § 4.3 Creator/publisher information.	1..n	Required
schema:datePublished	The date of formal issuance (such as publication) of the dataset.	1	Recommended v[object Object]: becomes required
schema:dateCreated	The date on which the dataset was created.	1	Recommended v[object Object]: becomes required
schema:dateModified	The most recent date on which the dataset was changed or modified.	1	Recommended v[object Object]: becomes required
schema:version	Version identifier of the dataset, such as a semantic version number or a date. See § 4.2.4 Versions.	0..1	Recommended
schema:inLanguage	The natural language of the textual values within the dataset – that is, of the metadata records themselves (titles, descriptions, and so on). Use a [BCP47] language code, such as `nl` or `en`.	0..n	Recommended v[object Object]: must match `[object Object]`
schema:mainEntityOfPage	A web page that provides access to the Dataset, its Distributions and/or additional information. Not needed when the dataset URI itself resolves to an HTML page. The human-readable description MUST be consistent with and include at least all details (such as distributions) from the RDF dataset description. See § 4.1.3 Resolvable dataset IRIs.	0..n	Recommended
schema:isBasedOn	The URI of a dataset this dataset is based on (previously schema:isBasedOnUrl).	0..n	Recommended
schema:citation	A citation or reference for the dataset.	0..n	Recommended
schema:genre	`schema:genre` is deprecated. Use `schema:about` with a URI (e.g. from AAT or GTAA via the Network of Terms) for both subject and material type.	0	Discouraged v[object Object]: no longer allowed
schema:about	The subject matter of the dataset, covering both topical themes (such as ‘post-war reconstruction’ or ‘colonial history’) and material types (such as ‘photographs’ or ‘architectural drawings’). Use a URI from a controlled vocabulary (e.g. AAT, GTAA, or Brinkman via the Network of Terms).	1..n	Recommended
schema:keywords	Words or formalized phrases to describe the dataset. Keywords are free text; use `schema:about` for URIs describing the subject matter.	1..n	Recommended v[object Object]: becomes required
schema:spatialCoverage	The geographical area to which the data in the dataset pertains. Use a URI from a controlled vocabulary such as GeoNames via the Network of Terms.	1..n	Recommended v[object Object]: must be IRI
schema:temporalCoverage	The time period to which the dataset pertains. The value must be an ISO 8601 date or time interval (such as ‘2011’, ‘2011/2012’, ‘1889-06/07’ for shortened notation, ‘-0431/-0404’ for BCE dates, or ‘1440/..’ for an open-ended range), or an HTTP(S) URI.	1..n	Recommended v[object Object]: becomes required
schema:hasPart	Indicates a dataset that is part of this dataset and also available as a separate dataset.	0..n	Recommended
schema:includedInDataCatalog	The HTTP [IRI](s) of the data catalog(s) that the dataset belongs to.	0..n	Recommended v[object Object]: must be IRI
dct:accrualPeriodicity	How often the dataset is updated. Use a value from the EU frequency list, for example http://publications.europa.eu/resource/authority/frequency/WEEKLY .	0..1	Recommended
dct:accessRights	Information about who can access the dataset. Use a value from the EU Access Rights vocabulary. Defaults to `PUBLIC` if not provided.	0..1	Recommended

4.6.2. Organization attributes

schema:Organization properties
Property	Description	Cardinality	Usage
@id	The HTTP [IRI] of the Organization.	1	Required
schema:name	The full name of the publisher or creator.	1..n	Required v[object Object]: must be rdf:langString
schema:alternateName	Alternative names such as an abbreviation that the organization is known under.	0..n	Recommended v[object Object]: must be rdf:langString
schema:identifier	Identifier(s) of the organization, at least its ISIL code.	0..n	Recommended
schema:sameAs	Links to the organization in other databases.	0..n	Recommended
schema:contactPoint	Contact information where end users can get in touch with questions about the dataset. Preferably use the details of the department that manages the dataset or catalogue, rather than an individual employee. See § 4.3.2 Contact information.		Recommended

4.6.3. Distribution attributes

schema:DataDownload properties
Property	Description	Cardinality	Usage
schema:contentUrl	The URL that provides direct access to the distribution itself: the data file or the web API endpoint. Use `schema:documentation` if there is also a documentation page that describes or provides access to the distribution (such as a SPARQL UI or a download landing page).	1	Required
schema:encodingFormat	The media type of the downloadable file. Use a value from the [IANA-MEDIA-TYPES] list. The value should indicate the media type of the response of schema:contentUrl when no Accept header is included in the request. When the distribution is compressed, the compression format (such as `zip`, `gzip`) must be included (such as `text/turtle+gzip`).		Recommended v[object Object]: max 1
schema:description	Distribution description	1..n	Recommended
schema:datePublished	The date of formal issuance (such as publication) of the distribution.	0..1	Recommended v[object Object]: becomes required
schema:dateModified	The most recent date on which the distribution was changed or modified.	0..1	Recommended v[object Object]: becomes required
schema:inLanguage	Language or languages in which the distribution is available. Use a [BCP47] language code, such as `nl`.	0..n	Recommended v[object Object]: must match `[object Object]`
schema:license	License applicable to the distribution. If the distribution has no license of its own, the license of the parent dataset is inherited. A license must always be available, either on the distribution or on the dataset.	0..1	Recommended v[object Object]: must be IRI
schema:contentSize	Distribution file size in bytes.	0..1	Recommended
schema:usageInfo	A link to the documentation: for downloads, the application profile, vocabulary or ontology; for web APIs (such as SPARQL or OAI-PMH), the protocol specification. See § 4.4.2 Usage information.	0..n	Recommended v[object Object]: must be IRI
schema:documentation	A documentation page that describes or provides access to this distribution, such as a SPARQL UI or a download landing page. `schema:contentUrl` points to the distribution itself (the data file or the API endpoint).	0..n	Recommended

4.6.4. DataCatalog attributes

schema:DataCatalog properties
Property	Description	Cardinality	Usage
@id	The HTTP [IRI] of the DataCatalog.	1	Recommended v[object Object]: must be IRI
schema:name	Name of the data catalog	1..n	Required v[object Object]: must be rdf:langString
schema:description	A description of the contents of the dataset. Preferably at least three sentences and at most one paragraph (2,000 characters). The discoverability of the dataset depends in part on the quality of the description. Consider different audiences – both domain experts and others – for whom the text should be understandable.	1..n	Required v[object Object]: must be rdf:langString
schema:publisher	Publisher of the data catalog See § 4.3 Creator/publisher information.	1	Required
schema:dataset	Dataset(s) in the data catalog	1..n	Required

4.6.5. Full example

A full dataset description that includes required and recommended attributes.

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@type": "Dataset",
      "@id": "http://data.bibliotheken.nl/id/dataset/rise-alba",
      "name": [
        {
          "@value": "Alba amicorum van de Koninklijke Bibliotheek",
          "@language": "nl"
        },
        {
          "@value": "Alba amicorum of the Dutch Royal Library",
          "@language": "en"
        }
      ],
      "description": {
        "@value": "Description only in English",
        "@language": "en"
      },
      "keywords": "alba amicorum",
      "about": { "@id": "http://vocab.getty.edu/aat/300026680" },
      "url": "https://www.kb.nl/bronnen-zoekwijzers/kb-collecties/moderne-handschriften-vanaf-ca-1550/alba-amicorum",
      "identifier": "http://data.bibliotheken.nl/id/dataset/rise-alba",
      "license": "https://creativecommons.org/publicdomain/zero/1.0/",
      "inLanguage": [
        "nl-NL",
        "en-GB"
      ],
      "spatialCoverage": {
        "@id": "https://sws.geonames.org/2750405/"
      },
      "temporalCoverage": "1939/1945",
      "publisher": {
        "@id": "https://example.com"
      },
      "creator": [
        {
          "@type": "Person",
          "@id": "https://example.com/creator1",
          "name": {
            "@value": "Creator 1",
            "@language": "en"
          }
        },
        {
          "@type": "Person",
          "@id": "https://example.com/creator2",
          "name": {
            "@value": "Creator 2",
            "@language": "en"
          }
        },
        {
          "@id": "https://example.com"
        }
      ],
      "dateModified": "2021-05-27T09:56:21.370767",
      "datePublished": "2021-05-28",
      "dateCreated": {
        "@type": "http://www.w3.org/2001/XMLSchema#date",
        "@value": "2021-05-27"
      },
      "distribution": [
        {
          "@type": "DataDownload",
          "encodingFormat": [
            "application/rdf+xml",
            "text/turtle"
          ],
          "contentUrl": "https://data.bibliotheken.nl/id/dataset/rise-alba.rdf",
          "contentSize": 12582912,
          "description": "Turtle dump"
        },
        {
          "@type": "DataDownload",
          "encodingFormat": "application/ld+json",
          "contentUrl": "https://data.bibliotheken.nl/id/dataset/rise-alba.jsonld",
          "usageInfo": "https://docs.nde.nl/schema-profile/",
          "description": "JSON-LD dump"
        },
        {
          "@type": "DataDownload",
          "contentUrl": "https://data.bibliotheken.nl/id/dataset/sparql",
          "documentation": "https://data.bibliotheken.nl/KB/Production/sparql",
          "usageInfo": "https://www.w3.org/TR/sparql11-protocol/",
          "description": "SPARQL endpoint"
        }
      ]
    },
    {
      "@type": "Organization",
      "@id": "https://example.com",
      "identifier": "test",
      "name": {
        "@value": "Koninklijke Bibliotheek",
        "@language": "nl"
      },
      "sameAs": "https://ror.org/02w4jbg70",
      "contactPoint": {
        "@type": "ContactPoint",
        "name": { "@value": "Datasets Manager", "@language": "en" },
        "email": "datasets@example.com"
      }
    }
  ]
}

5. Tools

This section is non-normative.

Developers can use the NDE Register API to validate datasets and catalogs against the requirements described in this document. The [SHACL] shape graph used to validate datasets and catalogs is available at /shacl.

Google’s Rich Results Test (previously Structured Data Testing Tool) can help with testing RDF metadata in general.

6. Changes

This section lists notable changes to this specification.

6.1. Version 1.11.0 (2026-04-23)

State temporal coverage format in description and validate dc:temporal (f71bde4).

6.2. Version 1.10.0 (2026-04-23)

Add schema:documentation to distribution for documentation page URL (8e30b05).
Plain-HTML sh:description, spec-side Bikeshed expansion, drop sh:name (28198fe).
Use HTML anchors for links in sh:description (49642a7).
Render maxCount 0 and ‘becomes forbidden’ in attribute tables (47af7ce).

6.3. Version 1.9.0 (2026-04-22)

Downgrade missing organisation identifier to sh:Info (6624d71).
Recommend schema:about instead of schema:keywords for URIs (fcf9055).
Consolidate publisher/creator shapes (b533f0a).

6.4. Version 1.8.0 (2026-04-20)

Require SPARQL protocol URI in usageInfo for SPARQL mediaType (0dfc6c6).
Align Distribution with DCAT-AP-NL 3.0 (title + description) (e901005).
Split publisher min/max-count constraints into separate shapes (2e060e4).

6.5. Version 1.7.0 (2026-04-20)

Map schema:temporalCoverage to dct:PeriodOfTime (a700556).
Accept ISO 8601 shortened-end intervals in temporalCoverage (61ffdf7).

6.6. Version 1.6.0 (2026-04-16)

Deprecate schema:genre; make schema:about canonical for dcat:theme (bdbd789).
Target sh:nodeKind sh:IRI for contentUrl in v2, not xsd:anyURI (1e86b9f).

6.7. Version 1.5.3 (2026-04-15)

Add Linked Art profile to distribution IRI table (3c01ad5).

6.8. Version 1.5.2 (2026-04-13)

Align publisher cardinality with DCAT-AP-NL 3.0 (6af64c7).
Mark ISO-8601 date pattern as v2.0 violation (62d22e1).
Mark schema:contactPoint as v2.0 violation (7aeab03).
Validate ISO-8601 dates on DCAT date properties (ee09aec).

6.9. Version 1.5.1 (2026-04-10)

Correct v2.0 annotation for IRI-constrained properties in spec (63e694a).
Use single sh:message on SPARQL constraint for Jena compatibility (5e67a50).
Add resolvable dataset URI requirement and update mainEntityOfPage (59f995d).

6.10. Version 1.5.0 (2026-04-02)

Recommended a set of canonical license URIs (ec38c3a).

6.11. Version 1.4.0 (2026-04-01)

Add Content-Type requirement for registration URLs (524dc42).
Add dct:accrualPeriodicity support for datasets (8eebd71).
Add maxCount 1 as future change for encodingFormat and mediaType (9364133).
Preserve user-provided dcat:theme from DCAT input (343b176).
Remove dct:title from Distribution (d957e44).
Validate includedInDataCatalog as HTTP IRI (08484c6).
Validate language tags on title, description, and name properties (d99f2fb).
Validate spatialCoverage and dct:spatial as HTTP IRI (bbc02b5).
Skip SHACL validation of inline DataCatalog references (37c9437).
Rewrite ‘Developer documentation’ section as ‘Usage information’ (ad5c22c).

6.12. Version 1.3.0 (2026-04-01)

Differentiate API and download distributions per DCAT-AP-NL (ba22143).
Normalize BCP 47 language codes to EU Language Authority URIs (cbf1932).
Add Person to conceptual model diagram (cd0b429).

6.13. Version 1.2.0 (2026-03-25)

Add DCAT-AP-NL 3.0 requirements (510a43c).

6.14. Version 1.1.0 (2026-03-25)

Express future requirement changes in SHACL and spec (3c3fbb7).