Ga naar hoofdinhoud

Pipeline

Draft – under review

Part of the Stack documentation (overview). Not yet endorsed by NDE.

Why pipelines

Pipelines are how the Data Layer gets built. Every derived product – search indexes, knowledge graphs, term mappings, dataset analyses – is produced by a pipeline that reads from sources, transforms the data into a target shape, and materialises it into a sink. The Stack’s substrates (A, B, C) are inputs; pipelines turn them into outputs.

Different Data Layer products share most of the work: every pipeline runs the same stages – discover, select, read, transform, validate, materialise, serve – diverging only from Transform onward, where the target shape takes over. Pipelines factor these stages so each product is built by composing the same primitives differently. The patterns chapter names the operational mechanics individually; this chapter shows how they compose into a working pipeline.

The Stack standardises on standards-backed pipelines – SHACL for shape constraints, SPARQL for extraction, JSON-LD Framing for output reshaping – instead of bespoke Python scripts. Each pipeline is configured by declarative linked-data standards, not by custom transformation code. This is what makes pipelines portable across deployments, lets the same SHACL drive both the build and the API contract, and keeps the system inspectable: each stage’s input and output is RDF that any standards-compliant tool can read.

@lde/pipeline

@lde/pipeline is the Stack’s orchestration framework. Pipelines compose Executors (stages that read, transform, or validate RDF) and Writers (sinks that materialise output: files, SPARQL stores, search engines). Every Data Layer pipeline – the Search Pipeline, knowledge-graph building, Term Backlink Graph – is @lde/pipeline configured differently.

This chapter names the configuration axes a pipeline picks along, and shows how the patterns slot into each axis. A builder picks one option per axis; the choices interact in known ways.

Stages

Every Data Layer pipeline runs the same spine of seven stages from source to served API. A pipeline is built over stages 1–6 and serves at stage 7. Each configuration axis attaches to one stage; the per-pipeline examples below expand the stages where they specialise.

#StageWhat it doesConfigured by
1DiscoverObtain the candidate source listDiscovery axis
2SelectFilter to the sources worth ingestingSelection axis
3ReadLoad source RDF – all selected sources, or only changedScope axis
4TransformConvert source RDF into the target shapethe specialisation point – no axis
5ValidateSHACL-check the output and handle violationsValidation axis
6Materialise & deployWrite to the sink and make it liveDeploy axis; State axis pairs with it
7Serve (query-time)Expose the typed APIderived from the transformation contract – no axis

Stages 1–3 are shared upstream; pipelines diverge from Transform onward, where the target shape determines validation, sink, and API. The Search Pipeline expands stages 4–7 into finer phases; the Knowledge Graph and EDM export pipelines expand them differently.

Configuration axes

AxisQuestionOptionsPatterns
DiscoveryWhere does the source list come from?
  • DCAT-AP registry
  • static config
Discovery via DCAT-AP Registry
SelectionWhich discovered sources to ingest?
  • Declared metadata only
  • Declared + VoID-derived
Augmented Dataset Selection
ScopeWhich selected sources does each run process?
  • All selected sources
  • Only changed sources
Change-driven Rebuild
ValidationHow is the transformation checked, and what happens to violations?
  • Reuse transformation SHACL (search)
  • Separate target shapes (EDM)
  • Sampled profile (DKG)
; policy: fail / drop / report / sample
Non-conformant source data
DeployHow does the result land in the sink?
  • Blue/green swap
  • In-place upsert + sweep
Blue/green Rebuild, In-place Rebuild
StateWhat carries between runs?
  • Nothing
  • Per-source transformation cache
  • Per-doc last_seen
Last-known-good Per-source Caching

Source input may be the source itself (with the pipeline running its own change detection) or an upstream LDES feed from a Change Stream Producer. Both feed the Transform step the same way; the choice is per-deployment.

Composing the axes

The axes are mostly independent. The one meaningful interaction:

  • Scope × Deploy. Change-driven + Blue/green needs Last-known-good Caching so unchanged sources can reuse their previous transformation during a full alias swap. Change-driven + In-place needs per-doc last_seen state for the sweep. Full-scope + Blue/green is the simplest configuration: re-transform every source every run, no per-source state.

Example: Search pipeline

Scope and State below are expressed as the search pipeline’s named update modesMode 1 (full rebuild, the default), Mode 2 (per-source incremental, designed-in but not yet active), Mode 3 (per-resource, future):

AxisChoice
DiscoveryDCAT-AP registry (Dataset Register)
SelectionAugmented (declared dcterms:conformsTo + VoID class partitions)
ScopeFull today (Mode 1); change-driven planned (Mode 2/3)
ValidationReuse transformation SHACL – the SHACL-honest policy (validator rules = indexer decisions)
DeployBlue/green – Typesense alias swap
StateNone today (Mode 1); per-doc source + last_seen designed in for Mode 2
AxisChoice
DiscoveryDCAT-AP registry (Dataset Register)
SelectionDeclared metadata only – any RDF source, no AP-conformance filter
ScopeFull re-transformation per run (until LDES adoption broadens)
ValidationNone – data-model-agnostic walk, no shapes to conform to
DeployBlue/green – QLever directory-level swap
StatePer-source transformation cache via Last-known-good Caching

The choice differences encode the structural difference between the two pipelines: the Search pipeline is AP-aware (drives a SHACL transformation against SCHEMA-AP-NDE); the Term Backlink Graph is data-model-agnostic (walks any RDF for term URI references). The axis structure is the same.

Example: EDM export (loda-pipeline)

loda-pipeline (PR #14) transforms heritage datasets into EDM for aggregation into Europeana. It is a downstream consumer of SCHEMA-AP-NDE-first: it maps the SCHEMA-AP-NDE pivot to EDM, so a single mapping covers every source instead of one mapping per source data model. EDM export runs today on legacy LD Workbench; the @lde/pipeline + QLever migration in that PR is what makes it a Stack-conformant pipeline and is not yet on main. The axis choices below describe that migrated pipeline.

AxisChoice
DiscoveryDCAT-AP registry (Dataset Register), filtering out already-transformed EDM distributions
SelectionDatasets published in SCHEMA-AP-NDE – the pivot shape is the mapping’s input contract
ScopeFull re-transformation per run
ValidationSeparate EDM shapes – report violations and continue
DeployEDM output for Europeana aggregation (file artifact, not a live sink swap)
StateNone

The transformation is SPARQL CONSTRUCT mapping SCHEMA-AP-NDE to EDM, not a reshaping of the framed document: each dataset’s SCHEMA-AP-NDE is imported into QLever and CONSTRUCTed into EDM, then SHACL-validated against EDM shapes. Per-dataset CONSTRUCTs are split into separate executors (e.g. edm:ProvidedCHO + aggregation + constants) joined by UNION rather than OPTIONAL, so multi-valued properties don’t multiply into cross-product duplicates. The axis structure is the same as the Search and KG pipelines, aimed at a third target shape.

When to extend the axis set

A new axis is warranted when a real choice opens up that the existing axes don’t capture. Candidates to keep an eye on:

  • Trigger axis (push: webhooks, LDN (Linked Data Notifications)) when network sources begin to offer push notifications. Today every pipeline is pull-based and scheduled, so the trigger is a constant rather than a choice; the Scope axis above covers the only real lever (process all selected sources vs. only changed ones). A separate Trigger axis lands the day push arrives. This source-change trigger is orthogonal to the config-change trigger that already exists – the schema fingerprint that fires a full Blue/green Rebuild when index-affecting configuration changes is deploy-driven, not source-driven, so it is not the Trigger axis discussed here.
  • Cross-pipeline event publishing (an upstream substrate-B pipeline that fans out to per-transformation consumers) – currently aspirational; would warrant a Distribution axis if built.

Until those land, the six axes above cover the configurations the Stack actually deploys today.