Skip to content

Level-2: integrate Jelly (drop CodeQL), build taint analysis, add LICENSE #1

@rahlk

Description

@rahlk

Summary

Replace the (stubbed, never-implemented) CodeQL-based level-2 enrichment with Jelly (Aarhus University), and build taint analysis on top of it. Jelly is a pure-TypeScript static analyzer (call graph + flow-insensitive points-to + access paths) — BSD-3-Clause, zero native dependencies — so unlike CodeQL or Joern it can be embedded in-process inside the bun --compile cants binary, preserving the single-self-contained-binary goal.

This issue tracks three pieces of work: (1) full Jelly integration with a taint layer, (2) removing CodeQL from the codebase/docs, and (3) adding the missing LICENSE.


1. Full Jelly integration (level-2 enrichment + taint)

Why Jelly: pure JS/TS (Babel-based), no node-gyp/native addons, BSD-3-Clause (compatible with our Apache-2.0), direct TypeScript support, JSON call-graph output. Embeddable in cants — no JVM, no sidecar.

Known friction (from evaluation):

  • CLI-only, no public library API → vendor src/ and hook the solver internals.
  • Parses with Babel (type-stripping, not type-aware), whereas our level-1 uses the tsc resolver → two foundations to reconcile.
  • Output is location-based; our call_graph is keyed by canonical callable signatures that byte-match symbol_table keys → needs an identity-mapping layer.
  • Jelly gives call graph + points-to, not taint → we build the taint layer.
  • Node >=22 engine / Bun compatibility unverified → must test under Bun.

Tasks

  • Spike: install Jelly, run it on test/fixtures/sample-app, capture its cg.json schema, locate the constraint-graph data structures (FragmentState/AnalysisState subset edges + tokens), and confirm it runs under Bun.
  • Vendor Jelly (src/) under e.g. src/semantic_analysis/jelly/; preserve its BSD-3-Clause notice.
  • Build the identity-mapping layer from Jelly's location-based nodes → our canonical signature keys (shared by both the call-graph merge and taint).
  • Merge Jelly's call graph into the existing call_graph (provenance jelly), via mergeEdges.
  • Verify end-to-end inside the compiled cants binary (in-process, no external runtime).

Taint analysis layer (build on Jelly's value-flow graph)

Taint = labeled reachability over Jelly's inclusion/points-to edges, seeded at sources, blocked at sanitizers, checked at sinks. Flow-insensitive + context-insensitive → cheap and over-approximate (sound-leaning, FP-prone), which is the desired default; precision knobs below.

  • Models-as-data spec for sources / sinks / sanitizers / passthrough — full CLI + schema design in the Configurable taint models subsection below.
  • Built-in propagators: string +, template literals, property read/write.
  • Labeled monotone worklist fixpoint over the value-flow graph (bitset of taint kinds per node); source-kind ↔ sink-kind matching.
  • Path reconstruction (BFS over predecessor edges) for source→sink explanations.
  • Over-approximation knobs (flags): --taint-assume-passthrough (unmodeled call ⇒ taint passes through), collapse containers, field-insensitive mode, optional ignore-sanitizers.
  • New taint_flows section in analysis.json (source/sink models, locations, path, sanitized flag); optionally tag carrier call_graph edges.

Configurable taint models (sources / sinks / sanitizers as args)

Models are data the analyzer consumes, never hardcoded: ship sensible built-ins that the user can extend, override, or fully replace at invocation. The engine (fixpoint) is decoupled from where a model came from.

CLI surface (extends the commander setup in src/cli.ts):

--taint                    enable taint (or implied by -a 2)
--taint-config <path>      JSON model file ('-' = stdin)
--source <spec>            inline source     (repeatable)
--sink <spec>              inline sink        (repeatable)
--sanitizer <spec>         inline sanitizer   (repeatable)
--taint-builtins <on|off>  load the bundled default pack (default on)

Precedence (later extends/overrides earlier): built-in pack → --taint-config file → inline --source/--sink/… flags. File = real, versionable specs; inline flags = quick one-offs.

Spec schema:

{
  "sources":     [{ "id": "express-query", "kind": "user-input",
                    "match": { "accessPath": "express.request.query" } }],
  "sinks":       [{ "id": "child-exec", "kind": "command-injection",
                    "match": { "callee": "child_process.exec", "args": [0] } }],
  "sanitizers":  [{ "match": { "callee": "encodeURIComponent" } }],
  "passthrough": [{ "match": { "callee": "JSON.parse" }, "from": [0], "to": "return" }],
  "rules":       [{ "from": "user-input", "to": "command-injection", "id": "cmd-injection" }],
  "disable":     ["builtin:eval-sink"]
}

match keys — dual identity so users can target both library and first-party code:

match key targets resolved via
accessPath library values/props (express.request.query) Jelly access paths
callee (wildcards ok, *.query) called functions Jelly call graph → resolved target
qualifiedName / signature first-party funcs the user marks our symbol_table canonical keys
args / receiver which param/this is the tainted slot argument position at the call
  • Define + JSON-Schema-validate the spec (fail fast — it's user input).
  • Resolver: map each match to Jelly constraint-graph nodes (access path → token; callee → call-arg nodes; signature → mapped node) to seed the fixpoint.
  • Merge precedence + disable-by-id; ship a built-in default model pack.
  • --taint-config - (stdin) so the python-sdk can forward user-defined models without a temp file; report the matching source/sink/rule id per flow for explainability.

Suggested staging

  1. MVP — call-graph-only taint: intraprocedural def-use from ts-morph (already in-tree) + Jelly's call graph for inter-procedural arg→param/return→caller. Ships without Jelly-internals surgery.
  2. v2 — points-to-backed propagation over Jelly's constraint graph (alias-aware).
  3. v3 — models-as-data so sources/sinks are user-extensible without recompiling (Configurable taint models above).

Implementation guide & learning path

The order below keeps each step independently testable; the reading list is so the design rests on first principles rather than copied recipes.

Build order:

  1. Spike Jelly (task above) — get its call graph + constraint graph in hand on a real fixture.
  2. Identity-mapping layer (Jelly locations ↔ symbol_table signatures).
  3. MVP taint — call-graph-only propagation (no points-to); prove one source→sink flow end-to-end on a fixture.
  4. Swap the substrate to Jelly's points-to constraint graph (alias-aware).
  5. Configurable models: CLI flags + schema + validator (above).
  6. Precision knobs + path reconstruction + taint_flows output.

Concepts worth reading first:

  • Andersen-style (inclusion-based) points-to analysis — the substrate Jelly computes; taint rides the same subset edges. (Start here; it's the mental model for everything else.)
  • Taint as graph reachability / IFDS — Reps–Horwitz–Sagiv, "Precise Interprocedural Dataflow Analysis via Graph Reachability" (POPL '95). The canonical framing even though our MVP is flow-insensitive.
  • Access paths — how Jelly models library values (express.request.query) without analyzing the library; this is what your match.accessPath keys off.
  • Flow- vs context-sensitivity tradeoffs — why flow-insensitive = cheap + over-approximate (more FPs, sound-leaning).
  • Jelly's lineage — the JAM / TAPIR / ACG papers linked from its README: what its points-to actually models and its known unsoundness (dynamic eval, reflection, etc.).
  • Reference model designs (for what to model, not how): CodeQL's JS/TS sources & sinks and Semgrep's taint mode.

Reading Jelly's source: begin at the solver/constraint state (FragmentState / AnalysisState) — its subset edges + tokens are the value-flow graph you'll propagate taint over.


2. Drop CodeQL mentions

CodeQL was only ever a stub; retarget all references to Jelly.

  • src/semantic_analysis/codeql/codeql.tsjelly/ (rename dir + buildCodeqlCallGraphbuildJellyCallGraph).
  • src/core.ts — import + the level-2 enrich call/comment (lines ~4, ~29, ~31).
  • src/cli.ts--analysis-level help text (2 = + CodeQL enrichment → Jelly).
  • src/options/options.ts — level-2 doc comment.
  • src/semantic_analysis/callGraph.ts and src/semantic_analysis/index.ts — comments.
  • README.md--help block + "Deeper analysis" example + level-2 prose (lines ~77, ~115, ~122).

3. Add a LICENSE (and resolve the licensing thread)

The LICENSE file is missing — the README links to ./LICENSE and both package.json and pyproject.toml declare Apache-2.0, but no file exists.

  • Add a verbatim Apache-2.0 LICENSE (matches the codeanalyzer-python/-java siblings).
  • No restrictive-license caveat needed. It was prompted by CodeQL's proprietary terms; switching level-2 to Jelly (BSD-3-Clause) removes that concern entirely — we'd be Apache-2.0 invoking/embedding permissively-licensed code.
  • When Jelly is vendored, include its BSD-3-Clause notice (e.g. a NOTICE or third-party license file) for attribution.

Context: this supersedes the original CodeQL level-2 plan. The single-binary goal is the driver — Jelly is the only credible engine that runs in pure JS and can live inside the cants binary (Joern/CodeQL are external JVM/proprietary tools).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions