feat: rdflib reference SPARQL and SHACL engines (future, not 0.1.0)#74
Draft
Haigutus wants to merge 1 commit into
Draft
feat: rdflib reference SPARQL and SHACL engines (future, not 0.1.0)#74Haigutus wants to merge 1 commit into
Haigutus wants to merge 1 commit into
Conversation
First reference (correctness-first) implementations of SPARQL querying and SHACL validation over triplet data, both built on rdflib and fed by the N-Quads export. Future fast engines (qlever, pandas/polars/duckdb SHACL) plug into the same registry dispatchers. Not part of the 0.1.0 release line — feature branch off main; dev_shacl stays frozen as idea-reference. - export_to_nquads gains export_to_memory=True → BytesIO (csv/cimxml convention); no temp files in the load path. - triplets/_rdflib_loader.py: load_dataset(data, rdf_map) → in-memory nquads → rdflib.Dataset(default_union=True); scoped_graph(ds, scope) restricts to INSTANCE_ID named graphs (validate one profile or a dependent set while all data stays loaded for reference resolution). - triplets/sparql/: registry dispatcher (rdflib reference; no oxigraph) + query() — SELECT→DataFrame, ASK→bool, CONSTRUCT/DESCRIBE→triplets. - triplets/validation/: registry dispatcher (pyshacl reference) + validate() → canonical violations DataFrame (single columnar pass over the report graph); shapes loaded from .ttl/.rdf by extension. - df.sparql.query / df.shacl.validate root accessors (pandas + polars); duckdb connections supported via the loader (no .query collision). - extras: sparql (rdflib), validation (pyshacl); both lazy-imported so `import triplets` works without them. Tests (Svedala, skip-guarded; rdflib/pyshacl importorskip): SELECT count == type_tableview, ASK/CONSTRUCT, typed-literal round trip, INSTANCE_ID scope, pandas/polars/duckdb parity; SHACL typed-conforms vs untyped datatype violations, scope, violations schema; real CGMES shapes test marked performance (opt-in). Full suite: 193 passed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Future feature — not part of the 0.1.0 release line. Clean branch off
main;dev_shaclstays frozen as idea-reference. Draft for review.First reference (correctness-first) implementations of SPARQL querying and SHACL validation over triplet data, both rdflib-based, fed by the datatype-aware N-Quads export. Faster engines (qlever for SPARQL; pandas/polars/duckdb for SHACL) plug into the same registry dispatchers later.
What's here
export_to_nquads(..., export_to_memory=True)returns a BytesIO (csv/cimxml convention); the rdflib load path touches no temp files._rdflib_loader.py) —load_dataset()→rdflib.Dataset(default_union=True)from in-memory nquads;scoped_graph(ds, scope)restricts to INSTANCE_ID named graphs so you can validate/query a single profile or a dependent set while all data stays loaded for reference resolution.triplets.sparql— registry dispatcher (rdflibreference; no oxigraph by design) +query(): SELECT→DataFrame, ASK→bool, CONSTRUCT/DESCRIBE→triplets.triplets.validation— registry dispatcher (pyshaclreference) +validate()→ canonical violations DataFrame[ID, KEY, VALUE, VIOLATION_TYPE, MESSAGE, SEVERITY, SOURCE_SHAPE], built in a single columnar pass over the report graph (the schema the future vectorized engines will produce natively).df.sparql.query/df.shacl.validateas separate root namespaces (pandas + polars); duckdb connections work via the loader.sparql(rdflib),validation(pyshacl); both lazy-imported, soimport tripletsworks without them (verified).Highlights from testing
COUNT(ACLineSegment)==type_tableviewrow count (cross-engine consistency).rdf_map→ 0 violations (conforms); without → 97sh:datatypeviolations, because untypedConductor.lengthfailsxsd:float. Demonstrates why schema-driven datatypes matter for validation.performance-marked (pyshacl on full shapes takes minutes) and skip-guarded on the external shapes path.Design follows TARGET_ARCHITECTURE (dev_shacl): registry dispatch, schema-optional, rdflib as the always-works reference, separate shacl/sparql accessors.