Skip to content

Rework profiler - initial PR#9157

Open
abadams wants to merge 75 commits into
mainfrom
abadams/profile_recompute_without_counters
Open

Rework profiler - initial PR#9157
abadams wants to merge 75 commits into
mainfrom
abadams/profile_recompute_without_counters

Conversation

@abadams
Copy link
Copy Markdown
Member

@abadams abadams commented May 27, 2026

I have a rather large rewrite of the profiler in a branch. I'd like to start merging it but it's too big for one PR. The features are:

  • Funcs are now displayed in .stmt order with .stmt nesting (i.e. nested under what they're compute_at). Things like allocations placed somewhere other than the production (i.e. sliding window) get their own entries, as do copy-to-host/copy-to-device (instead of being reported in a separate block).
  • A system of cheap counters to report things like recompute ratios and hypothetical scheduling alternatives without incurring significant runtime cost
  • A set of performance warnings based on those counters - very useful for LLMs scheduling Halide code
  • Profiling of inlined Funcs (just counters - they can't actually be sampled)

This PR is just the first feature, and lays the groundwork for the rest. It was made by deleting those features from the full branch. Example output:

CPU:

--------------------------------------------------------------------------------------------------------
nl_means
 total time: 94.709183 ms  samples: 66  runs: 11  time per run: 8.609925 ms
 average threads used: 53.181820
 heap allocations: 10560  peak heap usage: 30M
  name                   │ time     percent │ active│ heap │ peak │ avg  │
                         │                  │threads│allocs│  mem │  mem │
  thread idle            │   0.47ms ( 5.5%) │  6.33 │      │      │      │
  malloc                 │   0.00ms ( 0.0%) │       │      │      │      │
  free                   │   0.00ms ( 0.0%) │       │      │      │      │
  non_local_means        │   0.00ms ( 0.0%) │ 57.86 │      │      │      │
  ├repeat_edge           │   0.08ms ( 1.0%) │  2.00 │  320 │   24M│  372K│
  ├d.in()                │   1.68ms (19.5%) │ 60.09 │  320 │ 3864K│   60K│
  │└d                    │   2.51ms (29.2%) │ 60.29 │      │   64 │      │
  ├blur_d_y.in()         │   1.69ms (19.7%) │ 62.31 │  320 │ 2208K│   34K│
  │└blur_d_y             │   0.68ms ( 7.9%) │ 63.59 │      │   64 │      │
  └non_local_means_sum   │   1.15ms (13.4%) │ 51.53 │      │  256 │      │
   └blur_d.in()          │   0.00ms ( 0.0%) │ 64.00 │      │   64 │      │
    └blur_d              │   0.30ms ( 3.5%) │ 64.00 │      │   64 │      │
--------------------------------------------------------------------------------------------------------

GPU:

--------------------------------------------------------------------------------------------------------
nl_means
 total time: 147.357864 ms  samples: 140  runs: 11  time per run: 13.396170 ms
 heap allocations: 0  peak heap usage: 0
  name                   │ time     percent │ active│ heap │ peak │ avg  │
                         │                  │threads│allocs│  mem │  mem │
  non_local_means_sum    │   5.20ms (38.8%) │  0.02 │      │      │      │
  ├input (copy to device)│   0.20ms ( 1.5%) │  1.00 │      │      │      │
  └blur_d                │   7.80ms (58.2%) │  0.00 │      │      │      │
   ├sum.0                │   (allocation)   │       │      │      │      │
   ├sum$1.1              │   (allocation)   │       │      │      │      │
   ├sum$2.2              │   (allocation)   │       │      │      │      │
   ├d$1                  │   0.00ms ( 0.0%) │       │      │      │      │
   │└sum                 │   0.00ms ( 0.0%) │       │      │      │      │
   ├blur_d_y             │   0.00ms ( 0.0%) │       │      │      │      │
   │└sum$1               │   0.00ms ( 0.0%) │       │      │      │      │
   └sum$2                │   0.00ms ( 0.0%) │       │      │      │      │
  non_local_means        │   0.18ms ( 1.3%) │  0.00 │      │      │      │
  └non_local_means (copy │   0.00ms ( 0.0%) │       │      │      │      │
--------------------------------------------------------------------------------------------------------


There's quite of lot of empty space to the right. It will be used for some new stats in the later PRs. In some respects this loses features from main (can't sort Funcs by time anymore), but this unlocks the features from the PRs to come.

abadams and others added 30 commits April 30, 2026 10:23
Adds per-instance counters (Realizations, points_required_at_realization /
_production / _root, points_computed) so the profile distinguishes
multiple appearances of the same Func and can diagnose recompute. Wires
the corresponding markers (declare_box_required_at_*, declare_inlined,
declare_stage) through ScheduleFunctions, BoundsInference, and Inline,
and consumes them in a new InjectCounters pass.

Adds GPU support: marker-billed counters track Funcs inside GPU kernels,
hoisting non-uniform contributions out via bounds_of_expr_in_scope (or
Let / Select / max as appropriate) and flagging the Func with
counters_approximated when it has to do so. Injects a halide_device_sync
at the end of every GPU kernel launch under -profile so kernel time
gets billed to the launching producer rather than the next blocking
host operation.

Splits per-Func time aggregation from counter aggregation: runs that
complete between two sampler ticks contribute to per-Func counters but
not to per-Func time, with a separate billed_runs field for the time
denominator.

Threads halide_copy_to_host/device synthetic instances into the timeline
view (parented to the producer they sit inside) instead of pulling them
into separate sections at the end, and counts each copy invocation via
a new halide_profiler_count_host_device_copy runtime helper.

Adds report rules:
  - mid-Func host<->device bouncing (forgotten device schedule on an
    update def)
  - counters_approximated note (per-Func) and unaccounted-runs
    summary (per-pipeline)
  - sliding-window-failure / RoundUp / GuardWithIf via the
    realization/production/root/computed counters

Adds test/generator/profiler_instances_{generator,aottest}.cpp covering
the per-instance machinery, recompute counters, the impure-condition
approximation path, and host/device copy synthetics (the GPU pieces are
gated on get_target().has_gpu_feature()).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds halide_profiler_func_kind (func / overhead / thread_idle / malloc /
free / copy_to_host / copy_to_device) and a buffer_func_id field on
halide_profiler_func_stats. The runtime gets the kind/buffer_func_id
arrays through halide_profiler_instance_start alongside the existing
names/parents/canonical_ids.

Replaces three sites that previously did the equivalent work by parsing
names or hardcoding bookkeeping slot indices:
  - Filtering empty bookkeeping rows from the table (used `i < 4` index
    checks)
  - Skipping bookkeeping and copy synthetics in the rules loop (used
    `idx < 4` plus strstr for the copy-name suffix)
  - The "stages computing on different devices" rule looking for both
    directions of copy of a given Func's buffer (used name-prefix match)

The expensive_free pipeline-level check now looks up the free slot by
kind rather than p->funcs[3]. JSON dump now emits kind, buffer_func_id,
and canonical_id. The dead suffix_cut argument on print_func_row /
emit_name (left over from the section-header refactor) is gone.

The IR-side *_id constants in Profiling.cpp stay (the bookkeeping slot
indices still matter for stack tracking and the set_current_func
emission) — they're no longer how the report identifies the slots.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
halide_profiler_instance_state already uses "instance" to mean
one in-flight pipeline invocation; in src/Profiling.cpp the same word was
also doing duty for "one row in the per-Func stats array" (one
appearance of a Func in the schedule, distinguishing repeated
inlining sites or separately-realized update defs). The two senses are
unrelated and the collision was confusing.

Renames the per-row sense to "entry" throughout Profiling.cpp and the
test files:

  IdInfo            -> EntryInfo
  id_info           -> entry_info
  id_for_instance   -> id_for_entry
  instance_map      -> entry_map
  instances_by_name -> entries_by_name
  approximated_instances -> approximated_entries
  resolve_instance_id    -> resolve_entry_id
  PreAllocateInstances   -> PreAllocateEntries
  get_func_instance_id   -> get_func_entry_id

Test helper instances_of -> entries_of, plus the per-row scenario
assertions and comments. The runtime API (halide_profiler_instance_*,
the `instance` variable referring to halide_profiler_instance_state)
keeps the "instance" word — it's where it belongs. The test pipeline's
filename and registered Generator name stay (those drive the AOT build
path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removes a "Clearing func stats" debug-level-0 print left over from
earlier development.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e_box

declare_box_touched is a real annotation that bounds inference's
box_touched analysis follows: its first arg has to be a
Variable<Handle>(func.name()) so passes that substitute on names of
in-scope buffers transform it correctly. The new
declare_box_required_at_{realization,production,root} intrinsics are
profiler-only markers whose first arg is just a label for the report
and must not be confused with an in-scope reference (so it's a
StringImm).

The declare_box helper in ScheduleFunctions.cpp now picks the right
shape for the first arg based on the intrinsic. Restores the
extern-stage handling (correctness/extern_producer,
correctness/extern_output_expansion,
generator_aot_nested_externs_root/_inner) that regressed when the
refactor uniformly switched all declare_box_* intrinsics to StringImm.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d loops

The previous version widened *every* impure Call in a vectorized loop,
which broadcast halide_trace_helper / make_struct calls to vector
types and produced LLVM-level signature mismatches (the assertion in
CallInst::init complaining about a bad signature) — for instance, the
vectorize_inlined subtest of correctness/compute_with would assert
during codegen.

The widening was only ever needed for the profiler's counter markers,
which encode per-lane counter contributions in the intrinsic's
type.lanes() and so have to be widened to match the surrounding loop's
lane count even when their args don't reference any vectorized var.
Restrict the special case to those:
  - declare_inlined (bills InlinedCalls per lane)
  - declare_box_required_at_realization / _production / _root
    (bills points_required_at_* per lane via box_total)

inline_marker is gone after resolve_inline_markers, and declare_stage
is idempotent (no lane-count dependence), so neither needs the widening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resolve_inline_markers walks the IR looking for inline_markers that
need to be replaced with declare_inlined intrinsics. The expectation
was that every chain of markers sits inside some Provide (the
production being billed), but extern stages can have markers in their
call args (e.g. nested_externs_root reads an Input scalar Func through
inline_marker as part of the args to halide_copy_to_device /
nested_externs_inner) — those have no surrounding Provide.

BuildInlineGraph already strips the inline_marker intrinsics during
its walk, so the rewritten Stmt is well-formed. The previous code
asserted on the missing Provide name; now we just return the rewritten
Stmt without emitting a declare_inlined. The inlined work still
happens; the profiler simply doesn't bill it to an entry, which is
the right call for extern-stage arg evaluation anyway.

Fixes generator_aot_nested_externs_root and _inner under
HL_TARGET=host-profile.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline.cpp wraps every inlined call site in an inline_marker so the
profiler can later stamp down a declare_inlined for its surrounding
Provide. Extern stages don't have a surrounding Provide — their
ScheduleFunctions-emitted IR is a LetStmt whose value is the extern
call. Without an anchor, resolve_inline_markers asserted when
inline_markers appeared in extern call args (which happens whenever a
Func is inlined into an extern stage's scalar args).

Wrap each extern call's value in a new pure intrinsic
extern_stage_marker(name, value) under -profile. BuildInlineGraph
recognizes the wrapper and uses the extern stage's name as the
billing target for any inline_markers inside it, exactly as it would
for a Provide. resolve_inline_markers gains an Evaluate handler so a
top-level Evaluate of an inline_marker-bearing expression is treated
as its own subtree (mirroring the Provide and LetStmt cases). The
inline_markers may live in any combination of CSE-hoisted LetStmts
above the extern call or directly in its args; both paths now route
through process_inlining_subtree.

Test case in profiler_instances inlines a Func into an extern stage's
scalar arg and asserts that the inlined entry's parent in the report
is the extern stage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The BoundsInference Inliner had keep_inlined_calls=true under -profile,
wrapping each inlined call site in return_second(call, body) so
boxes_required could still see the original call when recording the box
required for the inlined Func. The wrapper served as a no-op at runtime
(return_second discards the first arg), but the wrapped Halide call
referenced a Func with no buffer or producer — once storage flattening
turned it into a load, codegen would fail with "Name not in Scope".

The leak path wasn't limited to extern bounds-query args: any time
boxes_required walked an expression produced by the Inliner, the args
of the original Halide calls could end up in the resulting Box's min/max
intervals (e.g. f(f(x)) with f inlined → box of outer f contains the
inlined-form of inner f(x), wrappers and all), and those intervals get
baked into the `.s0.x.min` / `.s0.x.max` LetStmts that flow into
runtime IR.

Switch the wrapper to inline_marker(call, body) — same dual-role
semantics (boxes_required sees the call; bounds-of-expr returns body),
but typed for our purpose. Strip every inline_marker unconditionally
in a single pass right after BoundsInference finishes; by then the
marker has done its job and the inlined Func has no codegen role.
Teach the bounds-of-expr rule in Bounds.cpp to treat inline_marker
like return_second (collapsed into the same is_intrinsic({...}) check
alongside if_then_else, which already shared the rule).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CSE-hoisted Lets inside a Provide can chain such that each let's RHS
references the previous let's variable. Walking such a chain, the
Variable visitor splices the referenced let's roots into the current
let's roots. With a vector container, multiple Variable uses of the
same prior let duplicate that let's entire roots vector each time, so
chains of "let tk = t_{k-1}*t_{k-1} + t_{k-1}" snowball as 3^N. This
exhausted RAM on a real lookup-table-heavy pipeline.

Switch the let_roots containers to std::set<int>, which is the right
shape anyway — the roots of a let are a set of inlining-graph node ids.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Under -profile every inlined Func gets a per-stage entry in the
BoundsInference stages list so its recompute counters can be tracked.
For a chain of N inlined Funcs with non-trivial bounds two distinct
quadratic costs appeared:

1) The construction of N declare_box_required_at_root intrinsics each
   carried a copy of the bounds chain in its args (no sharing across
   declarations). For a chain of length N each declaration's
   expressions had size O(N-k), so total IR text was O(N^2) — exposing
   any downstream pass (and HL_DEBUG_CODEGEN=2 printing) to O(N^2)
   work even though the underlying Expr DAG was O(N) via refcount.

2) The boxes_required walk ran once per stage in the inference loop.
   For inlined consumers in a chain each walk visited an O(N)-sized
   shared suffix of the inlined IR, accumulating to O(N^2) total
   construction work in BoundsInference itself. Notably the `boxes`
   map produced for inlined consumers was immediately discarded —
   there's a `continue` before the "expand to producers" loop, so
   the per-stage walk was pure waste.

Two changes:

- Stage::define_bounds for inlined Funcs no longer emits the full
  declare_box_required_at_root in place. Instead it stashes the box in
  a side map keyed by Func name and stamps a single-arg marker (the
  intrinsic with just the StringImm name) at the same scope. A
  post-pass RewriteDeferredRootBoxMarkers walks the resulting Stmt,
  finds runs of consecutive markers in each Block, joint-CSEs the
  corresponding box expressions, peels outer Lets into LetStmts that
  wrap the block of declarations, and rewrites each marker in place to
  a full declaration referencing the lifted let-bound subexpressions.
  Net effect: O(N) IR text instead of O(N^2), and the joint CSE itself
  runs on the (linear) shared DAG.

- In the main BoundsInference relationship-computing loop, move the
  `if (consumer.inlined) continue;` check above the per-stage
  boxes_required block. Inlined consumers' producer bounds are picked
  up transitively through the outermost (non-inlined) consumer's walk
  (inline_marker's args[0] carries the original Halide call), so the
  per-inlined-consumer walk is redundant. Skipping it removes the
  O(N) work-per-stage that summed to O(N^2) over the chain.

Together these bring a chain of N=512 inlined Funcs with select-based
bounds from 653 s under -profile to 14 s, and N=128 from 3.6 s to 0.9 s.
The remaining time is mostly LLVM codegen on legitimately N-sized IR
and isn't profile-specific.

Also filter profiler_instances.rungen out of GENERATOR_BUILD_RUNGEN_TESTS
since its aottest provides a test_extern_stage callback that rungen
doesn't link.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two more places in BoundsInference were doing O(N) work per stage and
summing to O(N^2) across a chain of N inlined Funcs:

- The pure-inlining loop ran inliner.do_inlining() on every stage's
  exprs, even inlined stages whose exprs are never consulted (the
  relationship-computing loop already skips them via continue). Each
  call does mutate() + common_subexpression_elimination(), and the
  CSE alone is O(unique nodes) per call. Skip inlined stages here.

- Inliner::get_qualified_body, called recursively as the chain
  expands, ran CSE on the result at every level. The recursion shape
  means the level-k cache entry includes the fully-inlined chain from
  level k+1 downward; running CSE at each level walks the same shared
  sub-DAG over and over. Defer CSE to the public do_inlining entry
  point — once at the top of the recursion is sufficient. mutate()
  still recurses through visit(Call) into get_qualified_body for
  each inlined sub-call, so the chain still resolves.

Combined with the existing marker-deferral and the consumer.inlined
skip in the relationship loop, this brings the N=512 chain test from
653 s (pre-fixes) to 10 s under -profile. Some residual super-linear
cost remains in the joint-CSE-on-bundle marker rewrite — CSE's
use_map uses IRGraphDeepCompare which is O(structure size) per
comparison, so deep chains still pay there — but it's tolerable now
(~22% of compile time at N=512). That's a separate follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Describe what the code does without forensic notes about alternatives
or scaling reasoning. The latter belongs in commit messages and PR
descriptions, not the source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
In SkipStages::visit(Select), the two branches' per-Func .used / .loaded
predicates were combined as `(t_used && cond) || (f_used && !cond)`. When
both branches contributed the same Expr -- which is exactly what
happens when both branches read the same let-stashed FuncInfo from an
outer let -- make_or could not recognise the And nodes as equivalent
(they aren't same_as even when their operands are), so the predicate
roughly doubled in size at every nested Select. A long chain of CSE'd
lets where each let value contains a Select then drove the predicate
size to 2^N, well past the point where allocating the IR is feasible.

Combine the two branches with `select(cond, t, f)` instead, and add a
make_select helper that collapses `select(c, X, X) -> X` and the
constant-cond cases. When both branches contributed the same Expr,
make_select drops the condition immediately and the chain stays linear.

The new correctness test (many_inlined_selects.cpp) constructs a 500-
element CSE'd let chain whose values each carry a Param<bool>-gated
Select, then feeds the chain into a final Select. With the bug present
this test would not terminate -- skip_stages would crash allocating
~2^500 IR nodes long before any reasonable timeout fired.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When an id is only touched on one branch of the Select, the previous
code passed an undefined Expr to a `combine` helper that then turned
`undefined` into const_false and built a `select(cond, X, false)` --
which is just `X && cond` dressed up as a select. Call make_and directly
in those cases and keep make_select for the both-branches case, where
the `select(c, X, X) -> X` collapse is the whole point. Also factor the
"merge into old" body into a small helper to remove the duplication.

No behaviour change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lowercase the Func name, drop the unnecessary top-level select and
output schedule, and make each chain entry depend on chain.back() so
nothing gets eliminated as dead. The test still reproduces the pre-fix
exponential blow-up (verified by reverting the fix: it times out at
30s on a 500-element chain).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bounds inference for inlined Funcs can produce root-box bounds whose
arithmetic doesn't fit in int32 -- e.g. an index expression of shape
(c1 - c2) * c3 over a wide interval, where simplify materialises a
signed_integer_overflow intrinsic for the offending product. Those
markers feed the recompute-ratio report and are nice-to-have only;
letting them reach codegen turns into a user_error and breaks the
whole compile for what is otherwise a profiling-only stat.

Add a small pre-pass in inject_profiling that walks the IR with a
Scope of "poisoned" let-binding names (a binding is poisoned if its
value transitively contains a signed_integer_overflow intrinsic or
a reference to another poisoned binding) and rewrites any
declare_box_required_at_root whose args touch the poison set to
make_zero. The subsequent simplify() in lower then drops the now-
dead lets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an inlined Func whose bounds-inferred interval doesn't fit in
int32 -- shape (uint16 - uint16) * c -- and feeds it through a buffer
index. Without the poison-drop pre-pass in inject_profiling the
generator user_errors during codegen; with it the test compiles and
the wide_scaled Func still shows up in the profiler report as an
inlined entry (we just lose the root-box count for table16, which is
the bug we worked around).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
abadams and others added 16 commits May 26, 2026 09:45
Deletion-only pass to shrink the initial PR. Strips:

- The inline_marker, declare_inlined, and extern_stage_marker
  intrinsics, their visitors in Bounds.cpp/VectorizeLoops.cpp, and
  Inline.cpp's leave_decl plumbing.
- The Inliner class's keep_inlined_calls flag and inline_marker
  emission/assertion in BoundsInference.cpp.
- ScheduleFunctions.cpp's Target::Profile flag to inline_function and
  the extern_stage_marker wrap around extern calls.
- Profiling.cpp's PreAllocateEntries handling of declare_inlined,
  the BuildInlineGraph class, resolve_inline_markers,
  process_inlining_subtree, DropPoisonedBoxRequired, and the
  InjectCounters declare_inlined branch.
- The InlinedCalls counter enum entry and its plumbing through to
  halide_profiler_update_counters.
- The inlined_calls field on halide_profiler_func_stats and all its
  consumers in profiler_common.cpp (the (inlined) row template, the
  recompute-ratio inlined fallback, JSON output, etc.).
- profiler_instances test cases that exercised inlined Funcs (multi
  -inlined, chain_a/b/c, cse_shared, diamond_*, forced_*, tab*,
  in_target_*, extern_inlined, input wrapper, Func::in wrappers,
  poisoned-root-box-dropped). Non-inlined cases (roundup, guard,
  unrolled with update, compute_with, tiled stencil, sliding window
  pass/fail, extern, inwards counter) are preserved.

A few stale comments still reference the removed concepts; will be
cleaned up before the PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The branch's profiler emits separate (allocation) entries with names
like "argmin.0" for hoist_storage sites. These match the test's
strncmp("argmin", 6) check but have stack_peak=0, so the assertion
fired on the synthetic entry instead of the real "argmin" Func.
Filter to kind==func before checking stack/heap stats.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Delete the heuristic warning rule machinery (allocs-in-parallel-loop,
poor-thread-utilization, high-recompute, etc.) along with the notes
column in the report and the auxiliary checks (too_few_samples,
too_many_anon_funcs, expensive_free).

With warnings gone, several counters had no remaining consumer in the
table or JSON output. Drop them from halide_profiler_func_stats and
the corresponding emission paths in InjectCounters /
halide_profiler_update_counters:
  - realizations, productions
  - points_required_at_realization, points_required_at_production
  - points_required_inwards, productions_if_inwards
  - scalar/vector_loads, gathers, bytes_loaded
  - scalar/vector_stores, scatters, bytes_stored
  - flags (and halide_profiler_func_flag_counters_approximated)

Removing the inwards counter also lets us delete:
  - declare_box_required_inwards (IR intrinsic)
  - the inwards_levels plumbing in ScheduleFunctions
  - the inner_productions inwards-decl handling in BoundsInference

Likewise declare_box_required_at_realization and
declare_box_required_at_production are no longer emitted or consumed.

halide_profiler_count_host_device_copy only existed to bump the now-
deleted realizations counter on copy synthetics, so drop it.

Counters that survive (table + recompute ratio + JSON):
  time, memory_*, stack_peak, active_threads_*, num_allocs,
  parallel_loops, parallel_tasks, points_required_at_root,
  points_computed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
All the inlining-related machinery here — keeping inlined stages in the
stages vector, the deferred_root_boxes detour, RewriteDeferredRootBoxMarkers,
the at_root tracking, the inwards-decl handler — only existed to support
profiling of inlined Funcs. With that feature removed in dede142, none
of it is needed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Drop the two "Joint CSE across RHSs" TODOs in InjectCounters::flush:
  irrelevant now that only four counters survive.
- Drop the "inlined Funcs ... multiple parents" TODO in profiler_common:
  obsolete since inlined-Func profiling was removed.
- Drop the sampling-token-ordering TODO; not worth restructuring for now.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The profiler-display-name override added in 6942823 was never wired
through halide_ir.fbs, so round-tripping a Function through serialization
loses it. Add a profiler_display_name string to the Func table, the
matching ser/des calls, and the update_with_deserialization signature.

I don't have flatc locally to regenerate halide_ir.fbs.h or to compile
WITH_SERIALIZATION, so this is unverified at compile time on this
machine; the changes mirror the adjacent fields exactly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This stress test for inliner cached-body blow-up belongs with the
inliner-speedup work on a separate branch, not the profiler branch.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
It was intentionally removed; no need to keep the dead line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When a Func has a profiler_display_name override (e.g. the wrapper minted
by .in() or .clone_in()), the entry's display name differs from its
IR-level name. Two places were comparing the IR name against
EntryInfo::name (the display name) and missing:

- InjectCounters::visit(Store) decided whether the current producer was
  for this Func by comparing producer's entry name to the IR name. The
  mismatch caused it to fall through to id_for_name, minting a stray
  root-level entry for the wrapper.
- InjectCounters::entries_by_name was keyed on display name, so
  declare_box_required_at_root's IR-name lookup couldn't find the
  wrapper's entries — the wrapper's recompute denominator never got
  billed.

Store the IR name alongside the display name in EntryInfo and use it
for both comparisons.

Visible in interpolate: the second `down[0].clone_in(downx[1])` row
(all-zero stats at root) is gone, and the surviving clone row gets a
recompute ratio.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
InjectCounters and its emitted counters (parallel_loops, parallel_tasks,
points_required_at_root, points_computed) were carrying everything
downstream — the heuristic warnings, the recompute column, the
per-counter aggregation in the runtime, plus the
declare_box_required_at_root / declare_stage IR intrinsics and the
ScheduleFunctions/VectorizeLoops handling of them. None of those are
needed for the only remaining feature on this branch: showing Funcs in
IR order in the report.

Drop:
- The InjectCounters pass and halide_profiler_update_counters.
- parallel_loops / parallel_tasks / points_required_at_root /
  points_computed fields on halide_profiler_func_stats.
- native_vector_bytes on the pipeline stats (only existed for the
  narrow-store warning).
- The L/K/R columns and the "parallel loops/tasks" summary line.
- declare_box_required_at_root and declare_stage intrinsics and their
  emission sites.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Revert orthogonal whitespace/refactor changes in Bounds.cpp, IR.h,
  Inline.cpp, Schedule.cpp, StorageFolding.cpp.
- Revert the inject_profiling move (and added simplify) in Lower.cpp;
  with InjectCounters gone, the original position works.
- Drop unused includes from Profiling.cpp (Bounds.h, DeviceInterface.h,
  ExprUsesVar.h, FindCalls.h, cstdlib).
- Drop the unused Target arg from inject_profiling.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A pass through Profiling.cpp, profiler_common.cpp, and HalideRuntime.h
to collapse multi-paragraph comments that mostly restated what the code
already said. Also dropped the now-unused handle_name helper (its
declare_box_required_at_root / declare_stage callers are gone) and
reworded the Entries header to "An unscheduled Func with an update def"
per review.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 83.15412% with 47 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@7e2ecf2). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/Profiling.cpp 81.77% 17 Missing and 24 partials ⚠️
src/JITModule.cpp 0.00% 6 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #9157   +/-   ##
=======================================
  Coverage        ?   69.38%           
=======================================
  Files           ?      254           
  Lines           ?    78334           
  Branches        ?    18734           
=======================================
  Hits            ?    54352           
  Misses          ?    18478           
  Partials        ?     5504           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

abadams and others added 7 commits May 28, 2026 11:10
Both performance/profiler.cpp and performance/memory_profiler.cpp used
to scrape the profile report's textual output. The reformatted table
layout broke their sscanf patterns. Rewrite both to install a
trace_pipeline hook on halide_trace_end_pipeline, look up
halide_profiler_get_state in the JIT shared runtime (on first call,
when the runtime is guaranteed to exist), and read per-Func counters
directly from halide_profiler_state::instances. The trace event fires
inside the pipeline body — before the register_destructor for
instance_end runs — so per-instance counters are populated but
JITCache::finish_profiling hasn't yet reset the global state.

Add JITSharedRuntime::find_symbol for the JIT-shared-runtime symbol
lookup; both tests use it, and so does test/correctness/custom_cuda_context.cpp
(updated to drop the open-coded loop).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Both tests now hook halide_trace_end_pipeline, read counters straight
from the running halide_profiler_instance_state (taking the head of
state->instances since only one pipeline runs at a time), and skip the
idempotency / cached-symbol dance that was working around problems
that no longer exist.

memory_profiler.cpp additionally:
- Packs the four captured fields into a Stats struct.
- Replaces the two check_error functions with one `check` that takes a
  [min, max] heap range; the function exits on mismatch so callers
  don't have to thread return codes.
- Wraps each test block in a `run_case(desc, lambda)` helper.
- Table-drives the four-way toggle test.

profiler.cpp adds a per-run suffix to Func/Pipeline names so the two
run_test invocations produce identical (Func::name())-derived strings
to compare against the profiler entries.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- profiler_instances_generator: the pre-generate assert was Profile-only;
  let ProfileByTimer through too, since everything past the assert works
  the same for both target features.
- user_context_generator: drop the assert that excluded Profile. The
  comment claimed "the profiler insists on calling malloc with nullptr
  user context", but the profiler runtime only uses bare libc malloc for
  its bookkeeping — it doesn't go through halide_malloc at all, so the
  test's custom_malloc never sees a profiler call.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
CI flagged them as unused after the earlier check pruning.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two padding issues that caused bus errors on 32-bit builds:

- kind was a uint8_t followed by an int (buffer_func_id), which inserts
  3 bytes of implicit padding. Use the enum type itself (an int) so the
  field is exactly 4 bytes with no implicit padding.
- With a 4-byte pointer at the start (name), the uint64_t counter region
  would land at a 4-aligned but not 8-aligned offset on 32-bit x86 (the
  i386 ABI allows uint64_t to be 4-aligned). Atomic 64-bit ops on
  cmpxchg8b require 8-byte alignment, so this would bus-error. Move the
  non-counter fields (parent, canonical_id, kind, buffer_func_id) to
  precede `time`, and mark `time` explicitly with HALIDE_ATTRIBUTE_ALIGN(8)
  so the compiler always inserts the needed padding regardless of target.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
find_or_create_pipeline took const uint64_t *func_names and indexed it
with an 8-byte stride. Halide's IR-side make_struct of the per-Func
Handle pointers codegens as [N x ptr], so on 32-bit each entry is only
4 bytes — the runtime would read every other entry as garbage past the
first half of the array. Main avoided this by using
Allocate(Handle(), {num_funcs}) + Stores (Halide's Handle type is fixed
at 64 bits regardless of target), but this branch uses make_struct.

Change the runtime signature to const char *const *func_names so the
stride matches the actual pointer width emitted by make_struct.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@mcourteaux
Copy link
Copy Markdown
Contributor

Given the column width of the name is short right now, I'd do a few things:

  • lengthen it? compute the required width before printing?
  • mark device to host copies with a prefix [D2H]
  • mark host to device copies with a prefix [H2D]

@abadams
Copy link
Copy Markdown
Member Author

abadams commented May 29, 2026

The follow-up PRs use up all that space, so I'd rather not mess with formatting right now, and instead tweak it once the whole thing is in-place.

abadams and others added 3 commits May 29, 2026 09:08
Allocation-group buffers (created by FuseGPUThreadLoops when it fuses
shared/heap allocations) have names that don't match any Func, so the
previous code skipped tracking them — the runtime would have aborted on
func_id == -1, and even if it didn't, the bytes would have been dropped
on the floor.

Mint an allocation-kind entry for those buffers and bill the
memory_allocate/memory_free calls to it. Render the row as the
participating Funcs (e.g. "f1$0.0,f2$0.1.buffer") by splitting on the
"allocgroup__" tag and joining with commas.

The rendering is a placeholder pending a follow-up PR that splits the
size across the participants via the counter machinery; the TODO in
id_for_entry sketches the plan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants