Skip to content

feat: rdflib reference SPARQL and SHACL engines (future, not 0.1.0)#74

Draft
Haigutus wants to merge 1 commit into
mainfrom
feat/sparql-shacl-reference
Draft

feat: rdflib reference SPARQL and SHACL engines (future, not 0.1.0)#74
Haigutus wants to merge 1 commit into
mainfrom
feat/sparql-shacl-reference

Conversation

@Haigutus

Copy link
Copy Markdown
Owner

Future feature — not part of the 0.1.0 release line. Clean branch off main; dev_shacl stays frozen as idea-reference. Draft for review.

First reference (correctness-first) implementations of SPARQL querying and SHACL validation over triplet data, both rdflib-based, fed by the datatype-aware N-Quads export. Faster engines (qlever for SPARQL; pandas/polars/duckdb for SHACL) plug into the same registry dispatchers later.

What's here

  • In-memory N-Quadsexport_to_nquads(..., export_to_memory=True) returns a BytesIO (csv/cimxml convention); the rdflib load path touches no temp files.
  • Shared loader (_rdflib_loader.py) — load_dataset()rdflib.Dataset(default_union=True) from in-memory nquads; scoped_graph(ds, scope) restricts to INSTANCE_ID named graphs so you can validate/query a single profile or a dependent set while all data stays loaded for reference resolution.
  • triplets.sparql — registry dispatcher (rdflib reference; no oxigraph by design) + query(): SELECT→DataFrame, ASK→bool, CONSTRUCT/DESCRIBE→triplets.
  • triplets.validation — registry dispatcher (pyshacl reference) + validate() → canonical violations DataFrame [ID, KEY, VALUE, VIOLATION_TYPE, MESSAGE, SEVERITY, SOURCE_SHAPE], built in a single columnar pass over the report graph (the schema the future vectorized engines will produce natively).
  • Accessorsdf.sparql.query / df.shacl.validate as separate root namespaces (pandas + polars); duckdb connections work via the loader.
  • Extrassparql (rdflib), validation (pyshacl); both lazy-imported, so import triplets works without them (verified).

Highlights from testing

  • SPARQL COUNT(ACLineSegment) == type_tableview row count (cross-engine consistency).
  • SHACL with rdf_map → 0 violations (conforms); without → 97 sh:datatype violations, because untyped Conductor.length fails xsd:float. Demonstrates why schema-driven datatypes matter for validation.
  • Scope filter verified to restrict focus nodes by INSTANCE_ID.
  • Full suite: 193 passed, 44 skipped, 1 xfail. The real complex-CGMES-SHACL test is performance-marked (pyshacl on full shapes takes minutes) and skip-guarded on the external shapes path.

Design follows TARGET_ARCHITECTURE (dev_shacl): registry dispatch, schema-optional, rdflib as the always-works reference, separate shacl/sparql accessors.

First reference (correctness-first) implementations of SPARQL querying and
SHACL validation over triplet data, both built on rdflib and fed by the
N-Quads export. Future fast engines (qlever, pandas/polars/duckdb SHACL) plug
into the same registry dispatchers.

Not part of the 0.1.0 release line — feature branch off main; dev_shacl stays
frozen as idea-reference.

- export_to_nquads gains export_to_memory=True → BytesIO (csv/cimxml
  convention); no temp files in the load path.
- triplets/_rdflib_loader.py: load_dataset(data, rdf_map) → in-memory nquads →
  rdflib.Dataset(default_union=True); scoped_graph(ds, scope) restricts to
  INSTANCE_ID named graphs (validate one profile or a dependent set while all
  data stays loaded for reference resolution).
- triplets/sparql/: registry dispatcher (rdflib reference; no oxigraph) +
  query() — SELECT→DataFrame, ASK→bool, CONSTRUCT/DESCRIBE→triplets.
- triplets/validation/: registry dispatcher (pyshacl reference) + validate()
  → canonical violations DataFrame (single columnar pass over the report
  graph); shapes loaded from .ttl/.rdf by extension.
- df.sparql.query / df.shacl.validate root accessors (pandas + polars);
  duckdb connections supported via the loader (no .query collision).
- extras: sparql (rdflib), validation (pyshacl); both lazy-imported so
  `import triplets` works without them.

Tests (Svedala, skip-guarded; rdflib/pyshacl importorskip): SELECT count ==
type_tableview, ASK/CONSTRUCT, typed-literal round trip, INSTANCE_ID scope,
pandas/polars/duckdb parity; SHACL typed-conforms vs untyped datatype
violations, scope, violations schema; real CGMES shapes test marked
performance (opt-in). Full suite: 193 passed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant