Summary
Replace the (stubbed, never-implemented) CodeQL-based level-2 enrichment with Jelly (Aarhus University), and build taint analysis on top of it. Jelly is a pure-TypeScript static analyzer (call graph + flow-insensitive points-to + access paths) — BSD-3-Clause, zero native dependencies — so unlike CodeQL or Joern it can be embedded in-process inside the bun --compile cants binary, preserving the single-self-contained-binary goal.
This issue tracks three pieces of work: (1) full Jelly integration with a taint layer, (2) removing CodeQL from the codebase/docs, and (3) adding the missing LICENSE.
1. Full Jelly integration (level-2 enrichment + taint)
Why Jelly: pure JS/TS (Babel-based), no node-gyp/native addons, BSD-3-Clause (compatible with our Apache-2.0), direct TypeScript support, JSON call-graph output. Embeddable in cants — no JVM, no sidecar.
Known friction (from evaluation):
- CLI-only, no public library API → vendor
src/ and hook the solver internals.
- Parses with Babel (type-stripping, not type-aware), whereas our level-1 uses the tsc resolver → two foundations to reconcile.
- Output is location-based; our
call_graph is keyed by canonical callable signatures that byte-match symbol_table keys → needs an identity-mapping layer.
- Jelly gives call graph + points-to, not taint → we build the taint layer.
- Node
>=22 engine / Bun compatibility unverified → must test under Bun.
Tasks
Taint analysis layer (build on Jelly's value-flow graph)
Taint = labeled reachability over Jelly's inclusion/points-to edges, seeded at sources, blocked at sanitizers, checked at sinks. Flow-insensitive + context-insensitive → cheap and over-approximate (sound-leaning, FP-prone), which is the desired default; precision knobs below.
Configurable taint models (sources / sinks / sanitizers as args)
Models are data the analyzer consumes, never hardcoded: ship sensible built-ins that the user can extend, override, or fully replace at invocation. The engine (fixpoint) is decoupled from where a model came from.
CLI surface (extends the commander setup in src/cli.ts):
--taint enable taint (or implied by -a 2)
--taint-config <path> JSON model file ('-' = stdin)
--source <spec> inline source (repeatable)
--sink <spec> inline sink (repeatable)
--sanitizer <spec> inline sanitizer (repeatable)
--taint-builtins <on|off> load the bundled default pack (default on)
Precedence (later extends/overrides earlier): built-in pack → --taint-config file → inline --source/--sink/… flags. File = real, versionable specs; inline flags = quick one-offs.
Spec schema:
match keys — dual identity so users can target both library and first-party code:
match key |
targets |
resolved via |
accessPath |
library values/props (express.request.query) |
Jelly access paths |
callee (wildcards ok, *.query) |
called functions |
Jelly call graph → resolved target |
qualifiedName / signature |
first-party funcs the user marks |
our symbol_table canonical keys |
args / receiver |
which param/this is the tainted slot |
argument position at the call |
Suggested staging
- MVP — call-graph-only taint: intraprocedural def-use from ts-morph (already in-tree) + Jelly's call graph for inter-procedural arg→param/return→caller. Ships without Jelly-internals surgery.
- v2 — points-to-backed propagation over Jelly's constraint graph (alias-aware).
- v3 — models-as-data so sources/sinks are user-extensible without recompiling (Configurable taint models above).
Implementation guide & learning path
The order below keeps each step independently testable; the reading list is so the design rests on first principles rather than copied recipes.
Build order:
- Spike Jelly (task above) — get its call graph + constraint graph in hand on a real fixture.
- Identity-mapping layer (Jelly locations ↔
symbol_table signatures).
- MVP taint — call-graph-only propagation (no points-to); prove one source→sink flow end-to-end on a fixture.
- Swap the substrate to Jelly's points-to constraint graph (alias-aware).
- Configurable models: CLI flags + schema + validator (above).
- Precision knobs + path reconstruction +
taint_flows output.
Concepts worth reading first:
- Andersen-style (inclusion-based) points-to analysis — the substrate Jelly computes; taint rides the same subset edges. (Start here; it's the mental model for everything else.)
- Taint as graph reachability / IFDS — Reps–Horwitz–Sagiv, "Precise Interprocedural Dataflow Analysis via Graph Reachability" (POPL '95). The canonical framing even though our MVP is flow-insensitive.
- Access paths — how Jelly models library values (
express.request.query) without analyzing the library; this is what your match.accessPath keys off.
- Flow- vs context-sensitivity tradeoffs — why flow-insensitive = cheap + over-approximate (more FPs, sound-leaning).
- Jelly's lineage — the JAM / TAPIR / ACG papers linked from its README: what its points-to actually models and its known unsoundness (dynamic
eval, reflection, etc.).
- Reference model designs (for what to model, not how): CodeQL's JS/TS sources & sinks and Semgrep's taint mode.
Reading Jelly's source: begin at the solver/constraint state (FragmentState / AnalysisState) — its subset edges + tokens are the value-flow graph you'll propagate taint over.
2. Drop CodeQL mentions
CodeQL was only ever a stub; retarget all references to Jelly.
3. Add a LICENSE (and resolve the licensing thread)
The LICENSE file is missing — the README links to ./LICENSE and both package.json and pyproject.toml declare Apache-2.0, but no file exists.
Context: this supersedes the original CodeQL level-2 plan. The single-binary goal is the driver — Jelly is the only credible engine that runs in pure JS and can live inside the cants binary (Joern/CodeQL are external JVM/proprietary tools).
Summary
Replace the (stubbed, never-implemented) CodeQL-based level-2 enrichment with Jelly (Aarhus University), and build taint analysis on top of it. Jelly is a pure-TypeScript static analyzer (call graph + flow-insensitive points-to + access paths) — BSD-3-Clause, zero native dependencies — so unlike CodeQL or Joern it can be embedded in-process inside the
bun --compilecantsbinary, preserving the single-self-contained-binary goal.This issue tracks three pieces of work: (1) full Jelly integration with a taint layer, (2) removing CodeQL from the codebase/docs, and (3) adding the missing
LICENSE.1. Full Jelly integration (level-2 enrichment + taint)
Why Jelly: pure JS/TS (Babel-based), no node-gyp/native addons, BSD-3-Clause (compatible with our Apache-2.0), direct TypeScript support, JSON call-graph output. Embeddable in
cants— no JVM, no sidecar.Known friction (from evaluation):
src/and hook the solver internals.call_graphis keyed by canonical callable signatures that byte-matchsymbol_tablekeys → needs an identity-mapping layer.>=22engine / Bun compatibility unverified → must test under Bun.Tasks
test/fixtures/sample-app, capture itscg.jsonschema, locate the constraint-graph data structures (FragmentState/AnalysisStatesubset edges + tokens), and confirm it runs under Bun.src/) under e.g.src/semantic_analysis/jelly/; preserve its BSD-3-Clause notice.call_graph(provenancejelly), viamergeEdges.cantsbinary (in-process, no external runtime).Taint analysis layer (build on Jelly's value-flow graph)
Taint = labeled reachability over Jelly's inclusion/points-to edges, seeded at sources, blocked at sanitizers, checked at sinks. Flow-insensitive + context-insensitive → cheap and over-approximate (sound-leaning, FP-prone), which is the desired default; precision knobs below.
+, template literals, property read/write.--taint-assume-passthrough(unmodeled call ⇒ taint passes through), collapse containers, field-insensitive mode, optional ignore-sanitizers.taint_flowssection inanalysis.json(source/sink models, locations, path, sanitized flag); optionally tag carriercall_graphedges.Configurable taint models (sources / sinks / sanitizers as args)
Models are data the analyzer consumes, never hardcoded: ship sensible built-ins that the user can extend, override, or fully replace at invocation. The engine (fixpoint) is decoupled from where a model came from.
CLI surface (extends the commander setup in
src/cli.ts):Precedence (later extends/overrides earlier): built-in pack →
--taint-configfile → inline--source/--sink/… flags. File = real, versionable specs; inline flags = quick one-offs.Spec schema:
{ "sources": [{ "id": "express-query", "kind": "user-input", "match": { "accessPath": "express.request.query" } }], "sinks": [{ "id": "child-exec", "kind": "command-injection", "match": { "callee": "child_process.exec", "args": [0] } }], "sanitizers": [{ "match": { "callee": "encodeURIComponent" } }], "passthrough": [{ "match": { "callee": "JSON.parse" }, "from": [0], "to": "return" }], "rules": [{ "from": "user-input", "to": "command-injection", "id": "cmd-injection" }], "disable": ["builtin:eval-sink"] }matchkeys — dual identity so users can target both library and first-party code:matchkeyaccessPathexpress.request.query)callee(wildcards ok,*.query)qualifiedName/signaturesymbol_tablecanonical keysargs/receiverthisis the tainted slotmatchto Jelly constraint-graph nodes (access path → token; callee → call-arg nodes; signature → mapped node) to seed the fixpoint.--taint-config -(stdin) so the python-sdk can forward user-defined models without a temp file; report the matching source/sink/ruleidper flow for explainability.Suggested staging
Implementation guide & learning path
The order below keeps each step independently testable; the reading list is so the design rests on first principles rather than copied recipes.
Build order:
symbol_tablesignatures).taint_flowsoutput.Concepts worth reading first:
express.request.query) without analyzing the library; this is what yourmatch.accessPathkeys off.eval, reflection, etc.).Reading Jelly's source: begin at the solver/constraint state (
FragmentState/AnalysisState) — its subset edges + tokens are the value-flow graph you'll propagate taint over.2. Drop CodeQL mentions
CodeQL was only ever a stub; retarget all references to Jelly.
src/semantic_analysis/codeql/codeql.ts→jelly/(rename dir +buildCodeqlCallGraph→buildJellyCallGraph).src/core.ts— import + the level-2 enrich call/comment (lines ~4, ~29, ~31).src/cli.ts—--analysis-levelhelp text (2 = + CodeQL enrichment→ Jelly).src/options/options.ts— level-2 doc comment.src/semantic_analysis/callGraph.tsandsrc/semantic_analysis/index.ts— comments.README.md—--helpblock + "Deeper analysis" example + level-2 prose (lines ~77, ~115, ~122).3. Add a LICENSE (and resolve the licensing thread)
The
LICENSEfile is missing — the README links to./LICENSEand bothpackage.jsonandpyproject.tomldeclareApache-2.0, but no file exists.LICENSE(matches thecodeanalyzer-python/-javasiblings).NOTICEor third-party license file) for attribution.Context: this supersedes the original CodeQL level-2 plan. The single-binary goal is the driver — Jelly is the only credible engine that runs in pure JS and can live inside the
cantsbinary (Joern/CodeQL are external JVM/proprietary tools).