Level-2: integrate Jelly (drop CodeQL), build taint analysis, add LICENSE

## Summary

Replace the (stubbed, never-implemented) CodeQL-based level-2 enrichment with **[Jelly](https://github.com/cs-au-dk/jelly)** (Aarhus University), and build taint analysis on top of it. Jelly is a pure-TypeScript static analyzer (call graph + flow-insensitive points-to + access paths) — **BSD-3-Clause, zero native dependencies** — so unlike CodeQL or Joern it can be embedded **in-process** inside the `bun --compile` `cants` binary, preserving the single-self-contained-binary goal.

This issue tracks three pieces of work: (1) full Jelly integration with a taint layer, (2) removing CodeQL from the codebase/docs, and (3) adding the missing `LICENSE`.

---

## 1. Full Jelly integration (level-2 enrichment + taint)

**Why Jelly:** pure JS/TS (Babel-based), no node-gyp/native addons, BSD-3-Clause (compatible with our Apache-2.0), direct TypeScript support, JSON call-graph output. Embeddable in `cants` — no JVM, no sidecar.

**Known friction (from evaluation):**
- CLI-only, **no public library API** → vendor `src/` and hook the solver internals.
- Parses with **Babel (type-stripping, not type-aware)**, whereas our level-1 uses the **tsc resolver** → two foundations to reconcile.
- Output is **location-based**; our `call_graph` is keyed by **canonical callable signatures** that byte-match `symbol_table` keys → needs an **identity-mapping layer**.
- Jelly gives call graph + points-to, **not taint** → we build the taint layer.
- Node `>=22` engine / **Bun compatibility unverified** → must test under Bun.

### Tasks
- [ ] **Spike:** install Jelly, run it on `test/fixtures/sample-app`, capture its `cg.json` schema, locate the constraint-graph data structures (`FragmentState`/`AnalysisState` subset edges + tokens), and confirm it runs under Bun.
- [ ] Vendor Jelly (`src/`) under e.g. `src/semantic_analysis/jelly/`; preserve its BSD-3-Clause notice.
- [ ] Build the **identity-mapping layer** from Jelly's location-based nodes → our canonical signature keys (shared by both the call-graph merge and taint).
- [ ] Merge Jelly's call graph into the existing `call_graph` (provenance `jelly`), via `mergeEdges`.
- [ ] Verify end-to-end inside the compiled `cants` binary (in-process, no external runtime).

### Taint analysis layer (build on Jelly's value-flow graph)

Taint = **labeled reachability over Jelly's inclusion/points-to edges**, seeded at sources, blocked at sanitizers, checked at sinks. Flow-insensitive + context-insensitive → cheap and **over-approximate** (sound-leaning, FP-prone), which is the desired default; precision knobs below.

- [ ] **Models-as-data** spec for sources / sinks / sanitizers / passthrough — full CLI + schema design in the *Configurable taint models* subsection below.
- [ ] Built-in propagators: string `+`, template literals, property read/write.
- [ ] **Labeled monotone worklist fixpoint** over the value-flow graph (bitset of taint kinds per node); source-kind ↔ sink-kind matching.
- [ ] Path reconstruction (BFS over predecessor edges) for source→sink explanations.
- [ ] Over-approximation knobs (flags): `--taint-assume-passthrough` (unmodeled call ⇒ taint passes through), collapse containers, field-insensitive mode, optional ignore-sanitizers.
- [ ] New `taint_flows` section in `analysis.json` (source/sink models, locations, path, sanitized flag); optionally tag carrier `call_graph` edges.

### Configurable taint models (sources / sinks / sanitizers as args)

Models are **data the analyzer consumes**, never hardcoded: ship sensible built-ins that the user can extend, override, or fully replace at invocation. The engine (fixpoint) is decoupled from where a model came from.

**CLI surface (extends the commander setup in `src/cli.ts`):**
```text
--taint                    enable taint (or implied by -a 2)
--taint-config <path>      JSON model file ('-' = stdin)
--source <spec>            inline source     (repeatable)
--sink <spec>              inline sink        (repeatable)
--sanitizer <spec>         inline sanitizer   (repeatable)
--taint-builtins <on|off>  load the bundled default pack (default on)
```
Precedence (later extends/overrides earlier): **built-in pack → `--taint-config` file → inline `--source`/`--sink`/… flags**. File = real, versionable specs; inline flags = quick one-offs.

**Spec schema:**
```jsonc
{
  "sources":     [{ "id": "express-query", "kind": "user-input",
                    "match": { "accessPath": "express.request.query" } }],
  "sinks":       [{ "id": "child-exec", "kind": "command-injection",
                    "match": { "callee": "child_process.exec", "args": [0] } }],
  "sanitizers":  [{ "match": { "callee": "encodeURIComponent" } }],
  "passthrough": [{ "match": { "callee": "JSON.parse" }, "from": [0], "to": "return" }],
  "rules":       [{ "from": "user-input", "to": "command-injection", "id": "cmd-injection" }],
  "disable":     ["builtin:eval-sink"]
}
```

**`match` keys — dual identity so users can target both library and first-party code:**

| `match` key | targets | resolved via |
|-------------|---------|--------------|
| `accessPath` | library values/props (`express.request.query`) | Jelly access paths |
| `callee` (wildcards ok, `*.query`) | called functions | Jelly call graph → resolved target |
| `qualifiedName` / `signature` | **first-party** funcs the user marks | our `symbol_table` canonical keys |
| `args` / `receiver` | which param/`this` is the tainted slot | argument position at the call |

- [ ] Define + JSON-Schema-**validate** the spec (fail fast — it's user input).
- [ ] Resolver: map each `match` to Jelly constraint-graph nodes (access path → token; callee → call-arg nodes; signature → mapped node) to seed the fixpoint.
- [ ] Merge precedence + disable-by-id; ship a built-in default model pack.
- [ ] `--taint-config -` (stdin) so the python-sdk can forward user-defined models without a temp file; report the matching source/sink/rule `id` per flow for explainability.

### Suggested staging
1. **MVP** — call-graph-only taint: intraprocedural def-use from ts-morph (already in-tree) + Jelly's call graph for inter-procedural arg→param/return→caller. Ships without Jelly-internals surgery.
2. **v2** — points-to-backed propagation over Jelly's constraint graph (alias-aware).
3. **v3** — models-as-data so sources/sinks are user-extensible without recompiling (*Configurable taint models* above).

### Implementation guide & learning path

The order below keeps each step independently testable; the reading list is so the design rests on first principles rather than copied recipes.

**Build order:**
1. Spike Jelly (task above) — get its call graph + constraint graph in hand on a real fixture.
2. Identity-mapping layer (Jelly locations ↔ `symbol_table` signatures).
3. **MVP taint** — call-graph-only propagation (no points-to); prove one source→sink flow end-to-end on a fixture.
4. Swap the substrate to Jelly's points-to constraint graph (alias-aware).
5. Configurable models: CLI flags + schema + validator (above).
6. Precision knobs + path reconstruction + `taint_flows` output.

**Concepts worth reading first:**
- **Andersen-style (inclusion-based) points-to analysis** — the substrate Jelly computes; taint rides the *same* subset edges. (Start here; it's the mental model for everything else.)
- **Taint as graph reachability / IFDS** — Reps–Horwitz–Sagiv, *"Precise Interprocedural Dataflow Analysis via Graph Reachability"* (POPL '95). The canonical framing even though our MVP is flow-insensitive.
- **Access paths** — how Jelly models library values (`express.request.query`) without analyzing the library; this is what your `match.accessPath` keys off.
- **Flow- vs context-sensitivity** tradeoffs — why flow-insensitive = cheap + over-approximate (more FPs, sound-leaning).
- **Jelly's lineage** — the JAM / TAPIR / ACG papers linked from its README: what its points-to actually models and its known unsoundness (dynamic `eval`, reflection, etc.).
- **Reference model designs** (for *what* to model, not how): CodeQL's JS/TS sources & sinks and Semgrep's taint mode.

**Reading Jelly's source:** begin at the solver/constraint state (`FragmentState` / `AnalysisState`) — its subset edges + tokens *are* the value-flow graph you'll propagate taint over.

---

## 2. Drop CodeQL mentions

CodeQL was only ever a stub; retarget all references to Jelly.

- [ ] `src/semantic_analysis/codeql/codeql.ts` → `jelly/` (rename dir + `buildCodeqlCallGraph` → `buildJellyCallGraph`).
- [ ] `src/core.ts` — import + the level-2 enrich call/comment (lines ~4, ~29, ~31).
- [ ] `src/cli.ts` — `--analysis-level` help text (`2 = + CodeQL enrichment` → Jelly).
- [ ] `src/options/options.ts` — level-2 doc comment.
- [ ] `src/semantic_analysis/callGraph.ts` and `src/semantic_analysis/index.ts` — comments.
- [ ] `README.md` — `--help` block + "Deeper analysis" example + level-2 prose (lines ~77, ~115, ~122).

---

## 3. Add a LICENSE (and resolve the licensing thread)

The `LICENSE` file is **missing** — the README links to `./LICENSE` and both `package.json` and `pyproject.toml` declare `Apache-2.0`, but no file exists.

- [ ] Add a verbatim **Apache-2.0 `LICENSE`** (matches the `codeanalyzer-python`/`-java` siblings).
- [ ] **No restrictive-license caveat needed.** It was prompted by CodeQL's proprietary terms; switching level-2 to **Jelly (BSD-3-Clause)** removes that concern entirely — we'd be Apache-2.0 invoking/embedding permissively-licensed code.
- [ ] When Jelly is vendored, include its **BSD-3-Clause** notice (e.g. a `NOTICE` or third-party license file) for attribution.

---

_Context: this supersedes the original CodeQL level-2 plan. The single-binary goal is the driver — Jelly is the only credible engine that runs in pure JS and can live inside the `cants` binary (Joern/CodeQL are external JVM/proprietary tools)._




`match` key	targets	resolved via
`accessPath`	library values/props (`express.request.query`)	Jelly access paths
`callee` (wildcards ok, `*.query`)	called functions	Jelly call graph → resolved target
`qualifiedName` / `signature`	first-party funcs the user marks	our `symbol_table` canonical keys
`args` / `receiver`	which param/`this` is the tainted slot	argument position at the call

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Level-2: integrate Jelly (drop CodeQL), build taint analysis, add LICENSE #1

Summary

1. Full Jelly integration (level-2 enrichment + taint)

Tasks

Taint analysis layer (build on Jelly's value-flow graph)

Configurable taint models (sources / sinks / sanitizers as args)

Suggested staging

Implementation guide & learning path

2. Drop CodeQL mentions

3. Add a LICENSE (and resolve the licensing thread)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Level-2: integrate Jelly (drop CodeQL), build taint analysis, add LICENSE #1

Description

Summary

1. Full Jelly integration (level-2 enrichment + taint)

Tasks

Taint analysis layer (build on Jelly's value-flow graph)

Configurable taint models (sources / sinks / sanitizers as args)

Suggested staging

Implementation guide & learning path

2. Drop CodeQL mentions

3. Add a LICENSE (and resolve the licensing thread)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions