Rework profiler - initial PR by abadams · Pull Request #9157 · halide/Halide

abadams · 2026-05-27T16:36:39Z

I have a rather large rewrite of the profiler in a branch. I'd like to start merging it but it's too big for one PR. The features are:

Funcs are now displayed in .stmt order with .stmt nesting (i.e. nested under what they're compute_at). Things like allocations placed somewhere other than the production (i.e. sliding window) get their own entries, as do copy-to-host/copy-to-device (instead of being reported in a separate block).
A system of cheap counters to report things like recompute ratios and hypothetical scheduling alternatives without incurring significant runtime cost
A set of performance warnings based on those counters - very useful for LLMs scheduling Halide code
Profiling of inlined Funcs (just counters - they can't actually be sampled)

This PR is just the first feature, and lays the groundwork for the rest. It was made by deleting those features from the full branch. Example output:

CPU:

--------------------------------------------------------------------------------------------------------
nl_means
 total time: 94.709183 ms  samples: 66  runs: 11  time per run: 8.609925 ms
 average threads used: 53.181820
 heap allocations: 10560  peak heap usage: 30M
  name                   │ time     percent │ active│ heap │ peak │ avg  │
                         │                  │threads│allocs│  mem │  mem │
  thread idle            │   0.47ms ( 5.5%) │  6.33 │      │      │      │
  malloc                 │   0.00ms ( 0.0%) │       │      │      │      │
  free                   │   0.00ms ( 0.0%) │       │      │      │      │
  non_local_means        │   0.00ms ( 0.0%) │ 57.86 │      │      │      │
  ├repeat_edge           │   0.08ms ( 1.0%) │  2.00 │  320 │   24M│  372K│
  ├d.in()                │   1.68ms (19.5%) │ 60.09 │  320 │ 3864K│   60K│
  │└d                    │   2.51ms (29.2%) │ 60.29 │      │   64 │      │
  ├blur_d_y.in()         │   1.69ms (19.7%) │ 62.31 │  320 │ 2208K│   34K│
  │└blur_d_y             │   0.68ms ( 7.9%) │ 63.59 │      │   64 │      │
  └non_local_means_sum   │   1.15ms (13.4%) │ 51.53 │      │  256 │      │
   └blur_d.in()          │   0.00ms ( 0.0%) │ 64.00 │      │   64 │      │
    └blur_d              │   0.30ms ( 3.5%) │ 64.00 │      │   64 │      │
--------------------------------------------------------------------------------------------------------

GPU:

--------------------------------------------------------------------------------------------------------
nl_means
 total time: 147.357864 ms  samples: 140  runs: 11  time per run: 13.396170 ms
 heap allocations: 0  peak heap usage: 0
  name                   │ time     percent │ active│ heap │ peak │ avg  │
                         │                  │threads│allocs│  mem │  mem │
  non_local_means_sum    │   5.20ms (38.8%) │  0.02 │      │      │      │
  ├input (copy to device)│   0.20ms ( 1.5%) │  1.00 │      │      │      │
  └blur_d                │   7.80ms (58.2%) │  0.00 │      │      │      │
   ├sum.0                │   (allocation)   │       │      │      │      │
   ├sum$1.1              │   (allocation)   │       │      │      │      │
   ├sum$2.2              │   (allocation)   │       │      │      │      │
   ├d$1                  │   0.00ms ( 0.0%) │       │      │      │      │
   │└sum                 │   0.00ms ( 0.0%) │       │      │      │      │
   ├blur_d_y             │   0.00ms ( 0.0%) │       │      │      │      │
   │└sum$1               │   0.00ms ( 0.0%) │       │      │      │      │
   └sum$2                │   0.00ms ( 0.0%) │       │      │      │      │
  non_local_means        │   0.18ms ( 1.3%) │  0.00 │      │      │      │
  └non_local_means (copy │   0.00ms ( 0.0%) │       │      │      │      │
--------------------------------------------------------------------------------------------------------

There's quite of lot of empty space to the right. It will be used for some new stats in the later PRs. In some respects this loses features from main (can't sort Funcs by time anymore), but this unlocks the features from the PRs to come.

…pute

Adds per-instance counters (Realizations, points_required_at_realization / _production / _root, points_computed) so the profile distinguishes multiple appearances of the same Func and can diagnose recompute. Wires the corresponding markers (declare_box_required_at_*, declare_inlined, declare_stage) through ScheduleFunctions, BoundsInference, and Inline, and consumes them in a new InjectCounters pass. Adds GPU support: marker-billed counters track Funcs inside GPU kernels, hoisting non-uniform contributions out via bounds_of_expr_in_scope (or Let / Select / max as appropriate) and flagging the Func with counters_approximated when it has to do so. Injects a halide_device_sync at the end of every GPU kernel launch under -profile so kernel time gets billed to the launching producer rather than the next blocking host operation. Splits per-Func time aggregation from counter aggregation: runs that complete between two sampler ticks contribute to per-Func counters but not to per-Func time, with a separate billed_runs field for the time denominator. Threads halide_copy_to_host/device synthetic instances into the timeline view (parented to the producer they sit inside) instead of pulling them into separate sections at the end, and counts each copy invocation via a new halide_profiler_count_host_device_copy runtime helper. Adds report rules: - mid-Func host<->device bouncing (forgotten device schedule on an update def) - counters_approximated note (per-Func) and unaccounted-runs summary (per-pipeline) - sliding-window-failure / RoundUp / GuardWithIf via the realization/production/root/computed counters Adds test/generator/profiler_instances_{generator,aottest}.cpp covering the per-instance machinery, recompute counters, the impure-condition approximation path, and host/device copy synthetics (the GPU pieces are gated on get_target().has_gpu_feature()). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds halide_profiler_func_kind (func / overhead / thread_idle / malloc / free / copy_to_host / copy_to_device) and a buffer_func_id field on halide_profiler_func_stats. The runtime gets the kind/buffer_func_id arrays through halide_profiler_instance_start alongside the existing names/parents/canonical_ids. Replaces three sites that previously did the equivalent work by parsing names or hardcoding bookkeeping slot indices: - Filtering empty bookkeeping rows from the table (used `i < 4` index checks) - Skipping bookkeeping and copy synthetics in the rules loop (used `idx < 4` plus strstr for the copy-name suffix) - The "stages computing on different devices" rule looking for both directions of copy of a given Func's buffer (used name-prefix match) The expensive_free pipeline-level check now looks up the free slot by kind rather than p->funcs[3]. JSON dump now emits kind, buffer_func_id, and canonical_id. The dead suffix_cut argument on print_func_row / emit_name (left over from the section-header refactor) is gone. The IR-side *_id constants in Profiling.cpp stay (the bookkeeping slot indices still matter for stack tracking and the set_current_func emission) — they're no longer how the report identifies the slots. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

halide_profiler_instance_state already uses "instance" to mean one in-flight pipeline invocation; in src/Profiling.cpp the same word was also doing duty for "one row in the per-Func stats array" (one appearance of a Func in the schedule, distinguishing repeated inlining sites or separately-realized update defs). The two senses are unrelated and the collision was confusing. Renames the per-row sense to "entry" throughout Profiling.cpp and the test files: IdInfo -> EntryInfo id_info -> entry_info id_for_instance -> id_for_entry instance_map -> entry_map instances_by_name -> entries_by_name approximated_instances -> approximated_entries resolve_instance_id -> resolve_entry_id PreAllocateInstances -> PreAllocateEntries get_func_instance_id -> get_func_entry_id Test helper instances_of -> entries_of, plus the per-row scenario assertions and comments. The runtime API (halide_profiler_instance_*, the `instance` variable referring to halide_profiler_instance_state) keeps the "instance" word — it's where it belongs. The test pipeline's filename and registered Generator name stay (those drive the AOT build path). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Removes a "Clearing func stats" debug-level-0 print left over from earlier development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e_box declare_box_touched is a real annotation that bounds inference's box_touched analysis follows: its first arg has to be a Variable<Handle>(func.name()) so passes that substitute on names of in-scope buffers transform it correctly. The new declare_box_required_at_{realization,production,root} intrinsics are profiler-only markers whose first arg is just a label for the report and must not be confused with an in-scope reference (so it's a StringImm). The declare_box helper in ScheduleFunctions.cpp now picks the right shape for the first arg based on the intrinsic. Restores the extern-stage handling (correctness/extern_producer, correctness/extern_output_expansion, generator_aot_nested_externs_root/_inner) that regressed when the refactor uniformly switched all declare_box_* intrinsics to StringImm. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d loops The previous version widened *every* impure Call in a vectorized loop, which broadcast halide_trace_helper / make_struct calls to vector types and produced LLVM-level signature mismatches (the assertion in CallInst::init complaining about a bad signature) — for instance, the vectorize_inlined subtest of correctness/compute_with would assert during codegen. The widening was only ever needed for the profiler's counter markers, which encode per-lane counter contributions in the intrinsic's type.lanes() and so have to be widened to match the surrounding loop's lane count even when their args don't reference any vectorized var. Restrict the special case to those: - declare_inlined (bills InlinedCalls per lane) - declare_box_required_at_realization / _production / _root (bills points_required_at_* per lane via box_total) inline_marker is gone after resolve_inline_markers, and declare_stage is idempotent (no lane-count dependence), so neither needs the widening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

resolve_inline_markers walks the IR looking for inline_markers that need to be replaced with declare_inlined intrinsics. The expectation was that every chain of markers sits inside some Provide (the production being billed), but extern stages can have markers in their call args (e.g. nested_externs_root reads an Input scalar Func through inline_marker as part of the args to halide_copy_to_device / nested_externs_inner) — those have no surrounding Provide. BuildInlineGraph already strips the inline_marker intrinsics during its walk, so the rewritten Stmt is well-formed. The previous code asserted on the missing Provide name; now we just return the rewritten Stmt without emitting a declare_inlined. The inlined work still happens; the profiler simply doesn't bill it to an entry, which is the right call for extern-stage arg evaluation anyway. Fixes generator_aot_nested_externs_root and _inner under HL_TARGET=host-profile. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Inline.cpp wraps every inlined call site in an inline_marker so the profiler can later stamp down a declare_inlined for its surrounding Provide. Extern stages don't have a surrounding Provide — their ScheduleFunctions-emitted IR is a LetStmt whose value is the extern call. Without an anchor, resolve_inline_markers asserted when inline_markers appeared in extern call args (which happens whenever a Func is inlined into an extern stage's scalar args). Wrap each extern call's value in a new pure intrinsic extern_stage_marker(name, value) under -profile. BuildInlineGraph recognizes the wrapper and uses the extern stage's name as the billing target for any inline_markers inside it, exactly as it would for a Provide. resolve_inline_markers gains an Evaluate handler so a top-level Evaluate of an inline_marker-bearing expression is treated as its own subtree (mirroring the Provide and LetStmt cases). The inline_markers may live in any combination of CSE-hoisted LetStmts above the extern call or directly in its args; both paths now route through process_inlining_subtree. Test case in profiler_instances inlines a Func into an extern stage's scalar arg and asserts that the inlined entry's parent in the report is the extern stage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The BoundsInference Inliner had keep_inlined_calls=true under -profile, wrapping each inlined call site in return_second(call, body) so boxes_required could still see the original call when recording the box required for the inlined Func. The wrapper served as a no-op at runtime (return_second discards the first arg), but the wrapped Halide call referenced a Func with no buffer or producer — once storage flattening turned it into a load, codegen would fail with "Name not in Scope". The leak path wasn't limited to extern bounds-query args: any time boxes_required walked an expression produced by the Inliner, the args of the original Halide calls could end up in the resulting Box's min/max intervals (e.g. f(f(x)) with f inlined → box of outer f contains the inlined-form of inner f(x), wrappers and all), and those intervals get baked into the `.s0.x.min` / `.s0.x.max` LetStmts that flow into runtime IR. Switch the wrapper to inline_marker(call, body) — same dual-role semantics (boxes_required sees the call; bounds-of-expr returns body), but typed for our purpose. Strip every inline_marker unconditionally in a single pass right after BoundsInference finishes; by then the marker has done its job and the inlined Func has no codegen role. Teach the bounds-of-expr rule in Bounds.cpp to treat inline_marker like return_second (collapsed into the same is_intrinsic({...}) check alongside if_then_else, which already shared the rule). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pute

…/Halide into abadams/profile_recompute

CSE-hoisted Lets inside a Provide can chain such that each let's RHS references the previous let's variable. Walking such a chain, the Variable visitor splices the referenced let's roots into the current let's roots. With a vector container, multiple Variable uses of the same prior let duplicate that let's entire roots vector each time, so chains of "let tk = t_{k-1}*t_{k-1} + t_{k-1}" snowball as 3^N. This exhausted RAM on a real lookup-table-heavy pipeline. Switch the let_roots containers to std::set<int>, which is the right shape anyway — the roots of a let are a set of inlining-graph node ids. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Under -profile every inlined Func gets a per-stage entry in the BoundsInference stages list so its recompute counters can be tracked. For a chain of N inlined Funcs with non-trivial bounds two distinct quadratic costs appeared: 1) The construction of N declare_box_required_at_root intrinsics each carried a copy of the bounds chain in its args (no sharing across declarations). For a chain of length N each declaration's expressions had size O(N-k), so total IR text was O(N^2) — exposing any downstream pass (and HL_DEBUG_CODEGEN=2 printing) to O(N^2) work even though the underlying Expr DAG was O(N) via refcount. 2) The boxes_required walk ran once per stage in the inference loop. For inlined consumers in a chain each walk visited an O(N)-sized shared suffix of the inlined IR, accumulating to O(N^2) total construction work in BoundsInference itself. Notably the `boxes` map produced for inlined consumers was immediately discarded — there's a `continue` before the "expand to producers" loop, so the per-stage walk was pure waste. Two changes: - Stage::define_bounds for inlined Funcs no longer emits the full declare_box_required_at_root in place. Instead it stashes the box in a side map keyed by Func name and stamps a single-arg marker (the intrinsic with just the StringImm name) at the same scope. A post-pass RewriteDeferredRootBoxMarkers walks the resulting Stmt, finds runs of consecutive markers in each Block, joint-CSEs the corresponding box expressions, peels outer Lets into LetStmts that wrap the block of declarations, and rewrites each marker in place to a full declaration referencing the lifted let-bound subexpressions. Net effect: O(N) IR text instead of O(N^2), and the joint CSE itself runs on the (linear) shared DAG. - In the main BoundsInference relationship-computing loop, move the `if (consumer.inlined) continue;` check above the per-stage boxes_required block. Inlined consumers' producer bounds are picked up transitively through the outermost (non-inlined) consumer's walk (inline_marker's args[0] carries the original Halide call), so the per-inlined-consumer walk is redundant. Skipping it removes the O(N) work-per-stage that summed to O(N^2) over the chain. Together these bring a chain of N=512 inlined Funcs with select-based bounds from 653 s under -profile to 14 s, and N=128 from 3.6 s to 0.9 s. The remaining time is mostly LLVM codegen on legitimately N-sized IR and isn't profile-specific. Also filter profiler_instances.rungen out of GENERATOR_BUILD_RUNGEN_TESTS since its aottest provides a test_extern_stage callback that rungen doesn't link. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two more places in BoundsInference were doing O(N) work per stage and summing to O(N^2) across a chain of N inlined Funcs: - The pure-inlining loop ran inliner.do_inlining() on every stage's exprs, even inlined stages whose exprs are never consulted (the relationship-computing loop already skips them via continue). Each call does mutate() + common_subexpression_elimination(), and the CSE alone is O(unique nodes) per call. Skip inlined stages here. - Inliner::get_qualified_body, called recursively as the chain expands, ran CSE on the result at every level. The recursion shape means the level-k cache entry includes the fully-inlined chain from level k+1 downward; running CSE at each level walks the same shared sub-DAG over and over. Defer CSE to the public do_inlining entry point — once at the top of the recursion is sufficient. mutate() still recurses through visit(Call) into get_qualified_body for each inlined sub-call, so the chain still resolves. Combined with the existing marker-deferral and the consumer.inlined skip in the relationship loop, this brings the N=512 chain test from 653 s (pre-fixes) to 10 s under -profile. Some residual super-linear cost remains in the joint-CSE-on-bundle marker rewrite — CSE's use_map uses IRGraphDeepCompare which is O(structure size) per comparison, so deep chains still pay there — but it's tolerable now (~22% of compile time at N=512). That's a separate follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Describe what the code does without forensic notes about alternatives or scaling reasoning. The latter belongs in commit messages and PR descriptions, not the source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

In SkipStages::visit(Select), the two branches' per-Func .used / .loaded predicates were combined as `(t_used && cond) || (f_used && !cond)`. When both branches contributed the same Expr -- which is exactly what happens when both branches read the same let-stashed FuncInfo from an outer let -- make_or could not recognise the And nodes as equivalent (they aren't same_as even when their operands are), so the predicate roughly doubled in size at every nested Select. A long chain of CSE'd lets where each let value contains a Select then drove the predicate size to 2^N, well past the point where allocating the IR is feasible. Combine the two branches with `select(cond, t, f)` instead, and add a make_select helper that collapses `select(c, X, X) -> X` and the constant-cond cases. When both branches contributed the same Expr, make_select drops the condition immediately and the chain stays linear. The new correctness test (many_inlined_selects.cpp) constructs a 500- element CSE'd let chain whose values each carry a Param<bool>-gated Select, then feeds the chain into a final Select. With the bug present this test would not terminate -- skip_stages would crash allocating ~2^500 IR nodes long before any reasonable timeout fired. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When an id is only touched on one branch of the Select, the previous code passed an undefined Expr to a `combine` helper that then turned `undefined` into const_false and built a `select(cond, X, false)` -- which is just `X && cond` dressed up as a select. Call make_and directly in those cases and keep make_select for the both-branches case, where the `select(c, X, X) -> X` collapse is the whole point. Also factor the "merge into old" body into a small helper to remove the duplication. No behaviour change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lowercase the Func name, drop the unnecessary top-level select and output schedule, and make each chain entry depend on chain.back() so nothing gets eliminated as dead. The test still reproduces the pre-fix exponential blow-up (verified by reverting the fix: it times out at 30s on a 500-element chain). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…le_recompute

Bounds inference for inlined Funcs can produce root-box bounds whose arithmetic doesn't fit in int32 -- e.g. an index expression of shape (c1 - c2) * c3 over a wide interval, where simplify materialises a signed_integer_overflow intrinsic for the offending product. Those markers feed the recompute-ratio report and are nice-to-have only; letting them reach codegen turns into a user_error and breaks the whole compile for what is otherwise a profiling-only stat. Add a small pre-pass in inject_profiling that walks the IR with a Scope of "poisoned" let-binding names (a binding is poisoned if its value transitively contains a signed_integer_overflow intrinsic or a reference to another poisoned binding) and rewrites any declare_box_required_at_root whose args touch the poison set to make_zero. The subsequent simplify() in lower then drops the now- dead lets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds an inlined Func whose bounds-inferred interval doesn't fit in int32 -- shape (uint16 - uint16) * c -- and feeds it through a buffer index. Without the poison-drop pre-pass in inject_profiling the generator user_errors during codegen; with it the test compiles and the wide_scaled Func still shows up in the profiler report as an inlined entry (we just lose the root-box count for table16, which is the bug we worked around). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pute

…write_level' into abadams/profile_recompute

Deletion-only pass to shrink the initial PR. Strips: - The inline_marker, declare_inlined, and extern_stage_marker intrinsics, their visitors in Bounds.cpp/VectorizeLoops.cpp, and Inline.cpp's leave_decl plumbing. - The Inliner class's keep_inlined_calls flag and inline_marker emission/assertion in BoundsInference.cpp. - ScheduleFunctions.cpp's Target::Profile flag to inline_function and the extern_stage_marker wrap around extern calls. - Profiling.cpp's PreAllocateEntries handling of declare_inlined, the BuildInlineGraph class, resolve_inline_markers, process_inlining_subtree, DropPoisonedBoxRequired, and the InjectCounters declare_inlined branch. - The InlinedCalls counter enum entry and its plumbing through to halide_profiler_update_counters. - The inlined_calls field on halide_profiler_func_stats and all its consumers in profiler_common.cpp (the (inlined) row template, the recompute-ratio inlined fallback, JSON output, etc.). - profiler_instances test cases that exercised inlined Funcs (multi -inlined, chain_a/b/c, cse_shared, diamond_*, forced_*, tab*, in_target_*, extern_inlined, input wrapper, Func::in wrappers, poisoned-root-box-dropped). Non-inlined cases (roundup, guard, unrolled with update, compute_with, tiled stencil, sliding window pass/fail, extern, inwards counter) are preserved. A few stale comments still reference the removed concepts; will be cleaned up before the PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The branch's profiler emits separate (allocation) entries with names like "argmin.0" for hoist_storage sites. These match the test's strncmp("argmin", 6) check but have stack_peak=0, so the assertion fired on the synthetic entry instead of the real "argmin" Func. Filter to kind==func before checking stack/heap stats. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Delete the heuristic warning rule machinery (allocs-in-parallel-loop, poor-thread-utilization, high-recompute, etc.) along with the notes column in the report and the auxiliary checks (too_few_samples, too_many_anon_funcs, expensive_free). With warnings gone, several counters had no remaining consumer in the table or JSON output. Drop them from halide_profiler_func_stats and the corresponding emission paths in InjectCounters / halide_profiler_update_counters: - realizations, productions - points_required_at_realization, points_required_at_production - points_required_inwards, productions_if_inwards - scalar/vector_loads, gathers, bytes_loaded - scalar/vector_stores, scatters, bytes_stored - flags (and halide_profiler_func_flag_counters_approximated) Removing the inwards counter also lets us delete: - declare_box_required_inwards (IR intrinsic) - the inwards_levels plumbing in ScheduleFunctions - the inner_productions inwards-decl handling in BoundsInference Likewise declare_box_required_at_realization and declare_box_required_at_production are no longer emitted or consumed. halide_profiler_count_host_device_copy only existed to bump the now- deleted realizations counter on copy synthetics, so drop it. Counters that survive (table + recompute ratio + JSON): time, memory_*, stack_peak, active_threads_*, num_allocs, parallel_loops, parallel_tasks, points_required_at_root, points_computed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

All the inlining-related machinery here — keeping inlined stages in the stages vector, the deferred_root_boxes detour, RewriteDeferredRootBoxMarkers, the at_root tracking, the inwards-decl handler — only existed to support profiling of inlined Funcs. With that feature removed in dede142, none of it is needed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Drop the two "Joint CSE across RHSs" TODOs in InjectCounters::flush: irrelevant now that only four counters survive. - Drop the "inlined Funcs ... multiple parents" TODO in profiler_common: obsolete since inlined-Func profiling was removed. - Drop the sampling-token-ordering TODO; not worth restructuring for now. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The profiler-display-name override added in 6942823 was never wired through halide_ir.fbs, so round-tripping a Function through serialization loses it. Add a profiler_display_name string to the Func table, the matching ser/des calls, and the update_with_deserialization signature. I don't have flatc locally to regenerate halide_ir.fbs.h or to compile WITH_SERIALIZATION, so this is unverified at compile time on this machine; the changes mirror the adjacent fields exactly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This stress test for inliner cached-body blow-up belongs with the inliner-speedup work on a separate branch, not the profiler branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

It was intentionally removed; no need to keep the dead line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

When a Func has a profiler_display_name override (e.g. the wrapper minted by .in() or .clone_in()), the entry's display name differs from its IR-level name. Two places were comparing the IR name against EntryInfo::name (the display name) and missing: - InjectCounters::visit(Store) decided whether the current producer was for this Func by comparing producer's entry name to the IR name. The mismatch caused it to fall through to id_for_name, minting a stray root-level entry for the wrapper. - InjectCounters::entries_by_name was keyed on display name, so declare_box_required_at_root's IR-name lookup couldn't find the wrapper's entries — the wrapper's recompute denominator never got billed. Store the IR name alongside the display name in EntryInfo and use it for both comparisons. Visible in interpolate: the second `down[0].clone_in(downx[1])` row (all-zero stats at root) is gone, and the surviving clone row gets a recompute ratio. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

InjectCounters and its emitted counters (parallel_loops, parallel_tasks, points_required_at_root, points_computed) were carrying everything downstream — the heuristic warnings, the recompute column, the per-counter aggregation in the runtime, plus the declare_box_required_at_root / declare_stage IR intrinsics and the ScheduleFunctions/VectorizeLoops handling of them. None of those are needed for the only remaining feature on this branch: showing Funcs in IR order in the report. Drop: - The InjectCounters pass and halide_profiler_update_counters. - parallel_loops / parallel_tasks / points_required_at_root / points_computed fields on halide_profiler_func_stats. - native_vector_bytes on the pipeline stats (only existed for the narrow-store warning). - The L/K/R columns and the "parallel loops/tasks" summary line. - declare_box_required_at_root and declare_stage intrinsics and their emission sites. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Revert orthogonal whitespace/refactor changes in Bounds.cpp, IR.h, Inline.cpp, Schedule.cpp, StorageFolding.cpp. - Revert the inject_profiling move (and added simplify) in Lower.cpp; with InjectCounters gone, the original position works. - Drop unused includes from Profiling.cpp (Bounds.h, DeviceInterface.h, ExprUsesVar.h, FindCalls.h, cstdlib). - Drop the unused Target arg from inject_profiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pute_without_counters

A pass through Profiling.cpp, profiler_common.cpp, and HalideRuntime.h to collapse multi-paragraph comments that mostly restated what the code already said. Also dropped the now-unused handle_name helper (its declare_box_required_at_root / declare_stage callers are gone) and reworded the Entries header to "An unscheduled Func with an update def" per review. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

codecov · 2026-05-27T17:51:00Z

Codecov Report

❌ Patch coverage is 83.15412% with 47 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@7e2ecf2). Learn more about missing BASE report.

Files with missing lines	Patch %	Lines
src/Profiling.cpp	81.77%	17 Missing and 24 partials ⚠️
src/JITModule.cpp	0.00%	6 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #9157   +/-   ##
=======================================
  Coverage        ?   69.38%           
=======================================
  Files           ?      254           
  Lines           ?    78334           
  Branches        ?    18734           
=======================================
  Hits            ?    54352           
  Misses          ?    18478           
  Partials        ?     5504

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Both performance/profiler.cpp and performance/memory_profiler.cpp used to scrape the profile report's textual output. The reformatted table layout broke their sscanf patterns. Rewrite both to install a trace_pipeline hook on halide_trace_end_pipeline, look up halide_profiler_get_state in the JIT shared runtime (on first call, when the runtime is guaranteed to exist), and read per-Func counters directly from halide_profiler_state::instances. The trace event fires inside the pipeline body — before the register_destructor for instance_end runs — so per-instance counters are populated but JITCache::finish_profiling hasn't yet reset the global state. Add JITSharedRuntime::find_symbol for the JIT-shared-runtime symbol lookup; both tests use it, and so does test/correctness/custom_cuda_context.cpp (updated to drop the open-coded loop). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Both tests now hook halide_trace_end_pipeline, read counters straight from the running halide_profiler_instance_state (taking the head of state->instances since only one pipeline runs at a time), and skip the idempotency / cached-symbol dance that was working around problems that no longer exist. memory_profiler.cpp additionally: - Packs the four captured fields into a Stats struct. - Replaces the two check_error functions with one `check` that takes a [min, max] heap range; the function exits on mismatch so callers don't have to thread return codes. - Wraps each test block in a `run_case(desc, lambda)` helper. - Table-drives the four-way toggle test. profiler.cpp adds a per-run suffix to Func/Pipeline names so the two run_test invocations produce identical (Func::name())-derived strings to compare against the profiler entries. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- profiler_instances_generator: the pre-generate assert was Profile-only; let ProfileByTimer through too, since everything past the assert works the same for both target features. - user_context_generator: drop the assert that excluded Profile. The comment claimed "the profiler insists on calling malloc with nullptr user context", but the profiler runtime only uses bare libc malloc for its bookkeeping — it doesn't go through halide_malloc at all, so the test's custom_malloc never sees a profiler call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CI flagged them as unused after the earlier check pruning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two padding issues that caused bus errors on 32-bit builds: - kind was a uint8_t followed by an int (buffer_func_id), which inserts 3 bytes of implicit padding. Use the enum type itself (an int) so the field is exactly 4 bytes with no implicit padding. - With a 4-byte pointer at the start (name), the uint64_t counter region would land at a 4-aligned but not 8-aligned offset on 32-bit x86 (the i386 ABI allows uint64_t to be 4-aligned). Atomic 64-bit ops on cmpxchg8b require 8-byte alignment, so this would bus-error. Move the non-counter fields (parent, canonical_id, kind, buffer_func_id) to precede `time`, and mark `time` explicitly with HALIDE_ATTRIBUTE_ALIGN(8) so the compiler always inserts the needed padding regardless of target. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

find_or_create_pipeline took const uint64_t *func_names and indexed it with an 8-byte stride. Halide's IR-side make_struct of the per-Func Handle pointers codegens as [N x ptr], so on 32-bit each entry is only 4 bytes — the runtime would read every other entry as garbage past the first half of the array. Main avoided this by using Allocate(Handle(), {num_funcs}) + Stores (Halide's Handle type is fixed at 64 bits regardless of target), but this branch uses make_struct. Change the runtime signature to const char *const *func_names so the stride matches the actual pointer width emitted by make_struct. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

mcourteaux · 2026-05-29T09:59:28Z

Given the column width of the name is short right now, I'd do a few things:

lengthen it? compute the required width before printing?
mark device to host copies with a prefix [D2H]
mark host to device copies with a prefix [H2D]

abadams · 2026-05-29T15:41:05Z

The follow-up PRs use up all that space, so I'd rather not mess with formatting right now, and instead tweak it once the whole thing is in-place.

Allocation-group buffers (created by FuseGPUThreadLoops when it fuses shared/heap allocations) have names that don't match any Func, so the previous code skipped tracking them — the runtime would have aborted on func_id == -1, and even if it didn't, the bytes would have been dropped on the floor. Mint an allocation-kind entry for those buffers and bill the memory_allocate/memory_free calls to it. Render the row as the participating Funcs (e.g. "f1$0.0,f2$0.1.buffer") by splitting on the "allocgroup__" tag and joining with commas. The rendering is a placeholder pending a follow-up PR that splits the size across the participants via the counter machinery; the TODO in id_for_entry sketches the plan. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…pute_without_counters

abadams and others added 30 commits April 30, 2026 10:23

Add a system of performance counters. WIP.

f8de2ce

Add performance warnings and Func nesting

69b6f59

Add more performance warnings; better printing

842222d

Add warning for large expensive frees

1c9fab9

Merge remote-tracking branch 'origin/main' into abadams/profile_recom…

12f238d

…pute

Move profiling later so flatten nested ramps is accounted for

a963da7

Add warning for allocation in a serial loop

189be1f

Profiler: drop stray debug(0) print

c1e0ee9

Removes a "Clearing func stats" debug-level-0 print left over from earlier development. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into abadams/profile_recom…

79b1549

…pute

Target is a struct, not a class

1b54a18

Merge branch 'abadams/profile_recompute' of https://github.com/halide…

3e575c2

…/Halide into abadams/profile_recompute

Delete dead code; clang-format

9d10419

Merge branch 'abadams/fix_exponential_skip_stages' into abadams/profi…

d7b52e7

…le_recompute

abadams and others added 16 commits May 26, 2026 09:45

Merge remote-tracking branch 'origin/main' into abadams/profile_recom…

18321b3

…pute

Merge remote-tracking branch 'origin/abadams/sliding_window_bounds_re…

e94034c

…write_level' into abadams/profile_recompute

Remove inlined_doubling_chain test

e74f39d

This stress test for inliner cached-body blow-up belongs with the inliner-speedup work on a separate branch, not the profiler branch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Remove stale commented-out abort_if_false in profiler reset

4adb721

It was intentionally removed; no need to keep the dead line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Merge remote-tracking branch 'origin/main' into abadams/profile_recom…

8826b8d

…pute_without_counters

Apply pre-commit auto-fixes

9dc238e

abadams and others added 7 commits May 28, 2026 11:10

Add profiler_instances build to CMake

76b8424

profiler_instances aottest: drop unused parent_name/parent_id_of helpers

9b3d3a4

CI flagged them as unused after the earlier check pruning. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

abadams and others added 3 commits May 29, 2026 09:08

Merge remote-tracking branch 'origin/main' into abadams/profile_recom…

1f6e526

…pute_without_counters

Apply pre-commit auto-fixes

ff230c2

abadams mentioned this pull request May 29, 2026

Profiler: Add counters and recompute statistics #9164

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework profiler - initial PR#9157

Rework profiler - initial PR#9157
abadams wants to merge 75 commits into
mainfrom
abadams/profile_recompute_without_counters

abadams commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 •

edited

Loading

Uh oh!

mcourteaux commented May 29, 2026

Uh oh!

abadams commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

abadams commented May 27, 2026

Uh oh!

codecov Bot commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mcourteaux commented May 29, 2026

Uh oh!

abadams commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov Bot commented May 27, 2026 •

edited

Loading