Feature/issue 3311 test thread tbb exp by drezap · Pull Request #3314 · stan-dev/math

drezap · 2026-04-29T21:17:21Z

Summary

I wrote a class that contains an operator for exp, which allows use to use tbb for parallelization of a for loop. It looks like at lower number of observations, the parallelization is marginal, but at higher number of observations the parallelism of the for loop, using tbb::parallel_for, for example, at ~=32,000 there seems to be a speed up at 4 threads that sustains as we increase the size of the Container.

Tests

I tested for numerical accuracy, which checks out. Moreover, I did the following performance tests:

Low number of observations with threading, no threading, and scaling the number of threads (seems to vary based on number of processes running on my computer but marginal speed-up
N=10mm, scaling number of threads. Does not crash, but after a certain amount of threads the speedup plateaus and there is no gain from adding additional threads.
Fix Scale N, fix number of threads. After a certain amount of observations (2^15) definite speed up at even 4 threads. At 2 threads, we don't start to see an advantage until N=2^30, but it kicks in with higher number of threads at lower number of observations.

Side Effects

Yes. If we kick in threads too early, there's actually a slow down in computing exp on a vector with a lower number of observations. May be it would be good if there was a default min threads, or have them kick in only when dataset is a certain size. Moreover, this is just one function, so the result may be different when we have a composite function (Gaussian). I think this may be advantageous at lower number observations, but have not evaluated this.
What I've done is added a directive that runs the multithreaded code for only vector, and calls the original code (but it's copy pasted into the STAN_THREADS section) accordingly if the function is not threaded for exp. I'd be open to a quick re-factor if we wanted to set it up like openCL, and have a threads directory under stan\math\prim.

Release notes

?

Checklist

Copyright holder: (Andre Zapico, Likely LLC, 2026)

The copyright holder is typically you or your assignee, such as a university or company. By submitting this pull request, the copyright holder is agreeing to the license the submitted work under the following licenses:
- Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
- Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)
the basic tests are passing
- unit tests pass (to run, use: ./runTests.py test/unit)
- header checks pass, (make test-headers)
- dependencies checks pass, (make test-math-dependencies)
- docs build, (make doxygen)
- code passes the built in C++ standards checks (make cpplint)
the code is written in idiomatic C++ and changes are documented in the doxygen
the new changes are tested

parallel_for, blocked range compiles for stan::math::exp compiling blocked_range works fine some progress, now a type deduction issue? ok something closer... implement struct version for parallel_for... uncompiled begin new class to use parallel for almost compiles... getting close, have template deduction failed which we can figure out almost compiles hold on compiles remove dead code compiled parallel_for, blocked_range for stan::math::exp compiled parallel_for, blocked_range for stan::math::exp

drezap · 2026-04-29T21:18:41Z

Hold on, sorry I should re-base. I have some questions, wondering if anyone had comments or is this all on me? Refactor, and using threads at lower number of observations.

…rezap/math into feature/issue-3311-test-thread-tbb-exp

SteveBronder · 2026-04-30T21:45:56Z

Do you have a graph that shows the speedup? Overall I'd be kind of cautious introducing lower level threading like this. Like you saw, whether you get a speedup or slowdown depends a lot on the number of observations. So for every vector operations we would have to have a check that the size exceeded some threshold. That threshold is going to vary a lot per computer and I think I think if we are not careful could make the codebase kind of funky.

The other piece here is that this works for prim functions of double type, but parallelism is much harder for reverse mode which is the main piece of the math library we worry about. The main issue is handling how the global AD tape should sync when we have jobs across N threads. @andrjohns thought for a long while trying to figure out how to do a nice parallel map(...) style function for reverse mode autodiff. I'm not sure he came up with something he found satisfying. I have not either honestly. Essentially you need to shard the operation over N shards which will have N autodiff stacks, then once the parallel computation is done you have to pass those autodiff stacks back and put them onto the main thread's stack. So there you would get performance benefits for setting up the forward pass in parallel, but then the reverse pass would still be serial and you pay the cost of the sharding and thread startup. I'm very certain there is a way to do it so you can do the forward and reverse pass in parallel, but nothing has ever come to me for this problem.

drezap · 2026-04-30T22:05:06Z

I’m thinking about, haven’t thought too far ahead yet, thank you. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Thu, Apr 30, 2026 at 5:46 PM Steve Bronder ***@***.***> wrote: *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> Do you have a graph that shows the speedup? Overall I'd be kind of cautious introducing lower level threading like this. Like you saw, whether you get a speedup or slowdown depends a lot on the number of observations. So for every vector operations we would have to have a check that the size exceeded some threshold. That threshold is going to vary a lot per computer and I think I think if we are not careful could make the codebase kind of funky. The other piece here is that this works for prim functions of double type, but parallelism is much harder for reverse mode which is the main piece of the math library we worry about. The main issue is handling how the global AD tape should sync when we have jobs across N threads. @andrjohns <https://github.com/andrjohns> thought for a long while trying to figure out how to do a nice parallel map(...) style function for reverse mode autodiff. I'm not sure he came up with something he found satisfying. I have not either honestly. Essentially you need to shard the operation over N shards which will have N autodiff stacks, then once the parallel computation is done you have to pass those autodiff stacks back and put them onto the main thread's stack. So there you would get performance benefits for setting up the forward pass in parallel, but then the reverse pass would still be serial and you pay the cost of the sharding and thread startup. I'm very certain there is a way to do it so you can do the forward and reverse pass in parallel, but nothing has ever come to me for this problem. — Reply to this email directly, view it on GitHub <#3314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY543AUG7YY66QH7E5MAKL4YPCSVAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DGNJWGM3DSNBVGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

drezap · 2026-05-01T20:26:11Z

I'm doing continuous integration tests, it looks like it's mostly passing now.
Remaining:

I haven't thought about rev autodiff yet.
I want to see what's going to happen with posteriordb tests
I'll do a run with some Stan models on this branch locally so nothing breaks
Refactor so that the threaded code is in it's own directory like openCL. What I did was add a declarative and copy pasted the prim unthreaded code and then just threaded the part that was vectorized, this is kinda sloppy.
I also just #if0 #endif'd the complex tests and code in the threaded version. I guess I could template it if it's really desired, but to speed it up I just didn't compile complex number support. If it's used a lot, I can fix it.

And I need to consider threading the rev autodiff stack, that would be cool, if different threads could build different expression trees, I think that's what Steve was saying.

But if this adds incremental speed increase, why not?

WRT Steves comment I can think about it, but here I'm not parallelizing anything on the stack, just evaluation of the computation of exp, so that's a bit of a different topic.

stan-buildbot · 2026-05-02T00:26:56Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_regr/gp_regr.stan	0.1	0.09	1.12	10.42% faster
gp_regr/gen_gp_data.stan	0.03	0.02	1.12	10.64% faster
arK/arK.stan	2.01	1.73	1.16	13.67% faster
eight_schools/eight_schools.stan	0.06	0.05	1.13	11.44% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	9.35	8.41	1.11	10.11% faster
pkpd/one_comp_mm_elim_abs.stan	20.42	18.56	1.1	9.14% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.27	0.24	1.11	9.86% faster
sir/sir.stan	75.1	72.27	1.04	3.76% faster
gp_pois_regr/gp_pois_regr.stan	3.05	2.94	1.04	3.61% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.83	2.79	1.02	1.56% faster
irt_2pl/irt_2pl.stan	4.54	4.41	1.03	2.86% faster
arma/arma.stan	0.32	0.31	1.01	1.22% faster
garch/garch.stan	0.48	0.46	1.04	3.4% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.05	4.86% faster
performance.compilation	221.46	228.05	0.97	-2.98% slower
Mean result: 1.069128278114437

Jenkins Console Log
Blue Ocean
Commit hash: e0729e1cdec40e8ec3da60b40b20a2cfc223fc94

Machine information

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 2400.000
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 1.3 MiB
L1i cache: 1.3 MiB
L2 cache: 40 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Vmscape: Mitigation; IBPB before exit to userspace
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++:
g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang:
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

drezap · 2026-05-02T02:18:30Z

Not sure why Jenkins emailed me SUCCESS when there's so many errors? I'm not seeing these locally.

I also named the branch wrong, but I'll just leave it until it's closed...

stan-buildbot · 2026-05-02T09:53:31Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_regr/gp_regr.stan	0.1	0.1	0.99	-1.25% slower
gp_regr/gen_gp_data.stan	0.03	0.02	1.02	1.91% faster
arK/arK.stan	1.89	1.89	1.0	-0.06% slower
eight_schools/eight_schools.stan	0.06	0.06	1.02	1.5% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	9.1	9.2	0.99	-1.08% slower
pkpd/one_comp_mm_elim_abs.stan	20.18	20.28	0.99	-0.51% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.26	0.26	0.99	-0.98% slower
sir/sir.stan	74.16	74.3	1.0	-0.19% slower
gp_pois_regr/gp_pois_regr.stan	2.89	2.91	1.0	-0.48% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	2.82	2.79	1.01	1.0% faster
irt_2pl/irt_2pl.stan	4.52	4.47	1.01	1.22% faster
arma/arma.stan	0.31	0.31	1.01	0.53% faster
garch/garch.stan	0.46	0.45	1.01	1.4% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.13	11.68% faster
performance.compilation	234.12	225.05	1.04	3.87% faster
Mean result: 1.0136082132108795

Jenkins Console Log
Blue Ocean
Commit hash: e0729e1cdec40e8ec3da60b40b20a2cfc223fc94

Machine information

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 2400.000
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 1.3 MiB
L1i cache: 1.3 MiB
L2 cache: 40 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Vmscape: Mitigation; IBPB before exit to userspace
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++:
g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang:
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

stan-buildbot · 2026-05-06T03:19:47Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_regr/gp_regr.stan	0.1	0.09	1.12	10.83% faster
gp_regr/gen_gp_data.stan	0.03	0.02	1.1	8.74% faster
arK/arK.stan	2.02	1.81	1.12	10.37% faster
eight_schools/eight_schools.stan	0.06	0.06	1.11	10.25% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	9.78	8.93	1.1	8.72% faster
pkpd/one_comp_mm_elim_abs.stan	21.57	19.97	1.08	7.38% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.29	0.26	1.1	8.85% faster
sir/sir.stan	81.14	72.68	1.12	10.42% faster
gp_pois_regr/gp_pois_regr.stan	3.25	2.89	1.12	11.09% faster
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.03	2.77	1.1	8.8% faster
irt_2pl/irt_2pl.stan	5.74	4.46	1.29	22.31% faster
arma/arma.stan	0.45	0.31	1.47	31.98% faster
garch/garch.stan	1.37	0.45	3.06	67.32% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.01	0.01	1.18	15.48% faster
performance.compilation	237.47	233.99	1.01	1.46% faster
Mean result: 1.2714977998013617

Jenkins Console Log
Blue Ocean
Commit hash: e0729e1cdec40e8ec3da60b40b20a2cfc223fc94

Machine information

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.3 LTS Release: 20.04 Codename: focal

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 80
On-line CPU(s) list: 0-79
Thread(s) per core: 2
Core(s) per socket: 20
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
Stepping: 4
CPU MHz: 3270.084
CPU max MHz: 3700.0000
CPU min MHz: 1000.0000
BogoMIPS: 4800.00
Virtualization: VT-x
L1d cache: 1.3 MiB
L1i cache: 1.3 MiB
L2 cache: 40 MiB
L3 cache: 55 MiB
NUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78
NUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP conditional; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Vmscape: Mitigation; IBPB before exit to userspace
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd mba ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx smap clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req pku ospke md_clear flush_l1d arch_capabilities

G++:
g++ (Ubuntu 9.4.0-1ubuntu1~20.04) 9.4.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Clang:
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

drezap · 2026-05-07T19:47:13Z

Docker/Jenkins I'm getting set up.

But it looks like it's passed on jenkins: https://jenkins.flatironinstitute.org/blue/organizations/jenkins/Stan%2FMath/detail/PR-3314/11/pipeline

But not showing up on github? Am I doing something wrong?

WardBrian · 2026-05-07T20:06:10Z

The most recent commit appears to have passed on Jenkins and failed about half of the Github Actions checks, seemingly all the Windows-based ones.

SteveBronder · 2026-05-08T16:48:53Z

@drezap it would be better to test your new code via Google benchmark instead of running jenkins a bunch. I have a repo below with everything setup to run Google benchmark and Stan with specific branches.

https://github.com/SteveBronder/stan-perf

drezap · 2026-05-08T18:00:42Z

@SteveBronder Thank you, very helpful. And I'm taking a closer look at Jenkins: https://jenkins.flatironinstitute.org/blue/organizations/jenkins/Stan%2FMath/activity?branch=PR-3314,

Am I correct to say that push 6837a52 passed? I don't intend on pushing this, but if it's passing this is helpful information. That push was faster. But I'm seeing a bunch of red on github, and usually the red X will turn into a green check-mark.

Thank you.

SteveBronder · 2026-05-08T21:28:58Z

Am I correct to say that push 6837a52 passed? I don't intend on pushing this, but if it's passing this is helpful information.

Looking at the jenkins it seems like all of the commits after 6837a52 passed, though it seems like your current commit is passing jenkins while failing on the other CI. So I'm not sure if those commits passed the other CI either

That push was faster. But I'm seeing a bunch of red on github, and usually the red X will turn into a green check-mark.

Do you mean the results from the stan build-bot running the performance regression tests? For a lower level change like this those performance tests are probably too high level for us to be able to reason about the effects. You should set this up as a google benchmark test. That should be nice to run and analyze the results locally.

I also want to reference my earlier comment #3314 (comment) . "When will parallelism be worth it" is going to be a pretty hard question to decide at runtime and I'm not confident we want that complexity just for speeding up the prim functions. If you are interested in this I would focus on seeing if there are ways you can do parallelism on the reverse mode code. That is a pretty hard problem though that I have not found a nice answer to yet.

drezap · 2026-05-09T09:29:58Z

Looking at the jenkins it seems
Ok, let me check travis CI. I have to change local software to match so I can reproduce.

Do you mean the results from the stan build-bot running the performance regression tests?
No, I mean the local benchmarks, only in C++ were objectively faster in suspected cases, using internal print statements. Here, I attached a file, but it's not fun to sift through. There's only one iteration in non-threaded, because we're not scaling, but initially I tried scaling by threads, blocks, dataset size, etc. It's a repeatable experiment.
benchmarking_multhreading.txt. The non-perfect forwarding is faster. I can do more robust tests.

EDIT:
And then the benchmarks were run with these scripts, on this branch: `test/unit/math/prim/fun/exp_test.cpp'. But I was modifying the typing, and you can reproduce it via the pushes. If not, I'm down to do a screen share and I can just show you. May be 10-15 minutes.

Reading some literature today, this agreed with some my thoughts about speed when distributing and collecting threads. I was looking at, C++ Concurrency in Action: Practical Multithreading, Williams 2012. But when I removed const and passed by reference and added an lvalue instantiation type &blah = a; it made it way slower. You guys would probably know. In the last commit, I did an lvalue instatiation in with my_a, not const and then also initialized within the class, within a function, another lvalue instantiation, and it made it slower. I think I'm making extra copies somewhere in memory? But commit 6837a62 was the one that was fastest and agreed with some literature. (I.E. too many threads caused a slowdown but with the right amount of threads this was faster, and this also passed all jenkins tests).

If you are interested in this I would focus on seeing if there are ways you can do parallelism on the reverse mode code. That is a pretty hard problem though that I have not found a nice answer to yet.

Ok, but if concurrency (multithreading) simple stuff adds performance gains, worth adding, add it, if not, I'm not offended.

For reverse mode autodiff, can you give me a more formal project spec in an issue? Then I'll look into it. You're seeing if, given a functions_i, f(.), we can send a different thread through each function to build the expression tree in parallel? So suppose we need to compute derivatives for f(.) and g(.), we want to build two expression trees at once using concurrency? I'm trying to specify the problem more clearly. Perhaps I'm not understanding.

EDIT: WRT Travis CI:
https://app.travis-ci.com/github/stan-dev/stan-dev.github.io/builds/259743285

The last updates I'm seeing are from 3 years ago? Am I missing something?

SteveBronder · 2026-05-11T18:43:18Z

And then the benchmarks were run with these scripts, on this branch: `test/unit/math/prim/fun/exp_test.cpp'. But I was modifying the typing, and you can reproduce it via the pushes. If not, I'm down to do a screen share and I can just show you. May be 10-15 minutes.

I would recommend forking the stan-perf repo. That has everything setup for benchmarking for Stan. When you use googlebench via that repo it is also easy to export the results of the benchmarks to json or csv . Plus google benchmark has a lot of tools for testing multiple matrix sizes and the number of threads

https://github.com/SteveBronder/stan-perf

Ok, but if concurrency (multithreading) simple stuff adds performance gains, worth adding, add it, if not, I'm not offended.

The main issue is, when does the overhead of threading make it worth running in parallel? i.e. if a user has a vector of 100 elements, how many threads should be used for exp(x)? None? Two? Threading has a decently high overhead cost and for each instantiation of threading you pay for that. So for small problems the answer is most likely single threaded. When we are trying to add automatic parallelization we will need to ask at runtime "how many threads should this operation use given the data size?" that requires understanding a lot of information about how fast a users particular machine can calculate a function and is a pretty hard problem. Eigen does this, but only for matrix multiplication as they have very good runtime logic to detect if sharding a large matrix multiply across threads is worth it.

For reverse mode autodiff, can you give me a more formal project spec in an issue? Then I'll look into it. You're seeing if, given a functions_i, f(.), we can send a different thread through each function to build the expression tree in parallel? So suppose we need to compute derivatives for f(.) and g(.), we want to build two expression trees at once using concurrency? I'm trying to specify the problem more clearly. Perhaps I'm not understanding.

I'll try to do a writeup this week or next. There was a previous discussion here on doing this which shows a bit of the scope.

drezap · 2026-05-26T11:56:56Z

I am talking to myself but computer was stolen, I plan on continuing this please do not delete this PR. I'm going to revert to the faster version and then try to use Steve's test suite. But no computer. Cheers everyone, happy developing. :) Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Mon, May 11, 2026, 2:43 PM Steve Bronder ***@***.***> wrote: *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> And then the benchmarks were run with these scripts, on this branch: `test/unit/math/prim/fun/exp_test.cpp'. But I was modifying the typing, and you can reproduce it via the pushes. If not, I'm down to do a screen share and I can just show you. May be 10-15 minutes. I would recommend forking the stan-perf repo. That has everything setup for benchmarking for Stan. When you use googlebench via that repo it is also easy to export the results of the benchmarks to json or csv . Plus google benchmark has a lot of tools for testing multiple matrix sizes and the number of threads https://github.com/SteveBronder/stan-perf Ok, but if concurrency (multithreading) simple stuff adds performance gains, worth adding, add it, if not, I'm not offended. The main issue is, when does the overhead of threading make it worth running in parallel? i.e. if a user has a vector of 100 elements, how many threads should be used for exp(x)? None? Two? Threading has a decently high overhead cost and for each instantiation of threading you pay for that. So for small problems the answer is most likely single threaded. When we are trying to add automatic parallelization we will need to ask at runtime "how many threads should this operation use given the data size?" that requires understanding a lot of information about how fast a users particular machine can calculate a function and is a pretty hard problem. Eigen does this, but only for matrix multiplication as they have very good runtime logic to detect if sharding a large matrix multiply across threads is worth it. For reverse mode autodiff, can you give me a more formal project spec in an issue? Then I'll look into it. You're seeing if, given a functions_i, f(.), we can send a different thread through each function to build the expression tree in parallel? So suppose we need to compute derivatives for f(.) and g(.), we want to build two expression trees at once using concurrency? I'm trying to specify the problem more clearly. Perhaps I'm not understanding. I'll try to do a writeup this week or next. There was a previous discussion here <#1918> on doing this which shows a bit of the scope. — Reply to this email directly, view it on GitHub <#3314 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY543EN5OL7WN2X76AXZ2L42INN3AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DIMRTG42DGMJXGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SteveBronder · 2026-05-27T16:26:42Z

Sorry to hear your computer was stolen! Until we have a better idea of what this PR should cover and the benchmarks are more clear I think we should close the PR. Which, to be clear, will not delete the branch with your code. All code will still be accessible via the branch issue-3311-test-thread-tbb-exp

drezap · 2026-05-27T16:44:19Z

If you'd like to more clearly define what should be covered, via discourse, that would be great. Until then, I have no reason to see why this should be closed. What would you like to see in order for merge to happen? Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Wed, May 27, 2026, 12:27 PM Steve Bronder ***@***.***> wrote: *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> Sorry to hear your computer was stolen! Until we have a better idea of what this PR should cover and the benchmarks are more clear I think we should close the PR. Which, to be clear, will not delete the branch with your code. All code will still be accessible via the branch issue-3311-test-thread-tbb-exp <https://github.com/drezap/math/tree/feature/issue-3311-test-thread-tbb-exp> — Reply to this email directly, view it on GitHub <#3314?email_source=notifications&email_token=ACY543D2KEX7LWSQVZQLBI3444JNTA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVGY2DQNJRGAYKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4556485100>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY543BL3L7UBGGOVELCB7D444JNTAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJWGQ4DKMJQGA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

drezap · 2026-05-27T17:11:52Z

I have clearly shown that this increases speed. I'm open to suggestions as to what might increase execution speed. I couldn't set up your benchmark report quickly enough. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Wed, May 27, 2026, 12:44 PM Andre Zapico ***@***.***> wrote: If you'd like to more clearly define what should be covered, via discourse, that would be great. Until then, I have no reason to see why this should be closed. What would you like to see in order for merge to happen? Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017 On Wed, May 27, 2026, 12:27 PM Steve Bronder ***@***.***> wrote: > *SteveBronder* left a comment (stan-dev/math#3314) > <#3314 (comment)> > > Sorry to hear your computer was stolen! Until we have a better idea of > what this PR should cover and the benchmarks are more clear I think we > should close the PR. Which, to be clear, will not delete the branch with > your code. All code will still be accessible via the branch > issue-3311-test-thread-tbb-exp > <https://github.com/drezap/math/tree/feature/issue-3311-test-thread-tbb-exp> > > — > Reply to this email directly, view it on GitHub > <#3314?email_source=notifications&email_token=ACY543D2KEX7LWSQVZQLBI3444JNTA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVGY2DQNJRGAYKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4556485100>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACY543BL3L7UBGGOVELCB7D444JNTAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJWGQ4DKMJQGA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

SteveBronder · 2026-05-27T17:50:52Z

I'm reposting my reply from here

The main issue is, when does the overhead of threading make it worth running in parallel? i.e. if a user has a vector of 100 elements, how many threads should be used for exp(x)? None? Two? Threading has a decently high overhead cost and for each instantiation of threading you pay for that. So for small problems the answer is most likely single threaded. When we are trying to add automatic parallelization we will need to ask at runtime "how many threads should this operation use given the data size?" that requires understanding a lot of information about how fast a users particular machine can calculate a function and is a pretty hard problem. Eigen does this, but only for matrix multiplication as they have very good runtime logic to detect if sharding a large matrix multiply across threads is worth it.

The logic for deciding at runtime whether a particular function is worth moving over to the parallel cpu version is going to be a lot of developer and runtime overhead. imo I think the maintanence would not be worth it.

This has been attempted previously be @andrjohns (and I took a crack at it myself). You can see that whole conversation here

EDIT: Accidentally said gpu

drezap · 2026-05-27T21:13:30Z

It's like some declarations, which are essentially just if statements that determine whether a certain area of code will be compiled or not. I'm skimming this I'm waiting on a bootloader for a free Mac I got. I'm not seeing any direct comparisons between threaded and non threaded code, and there seems to be a discrepancy between concurrency and running a process on different cores. I'm with Bob: #1918 (comment) Instead of chatting, let's come up with a concrete way of determining whether something is faster. WRT maintenance, it's like 3 lines of code and some declaratives. Easy to maintain. I seem to have accidentally discovered Ahmdal's law. So I propose we come up with concrete objectives to benchmarks and if it's faster we proceed. Also, typing matters. And I'm not sure about how reliable the posteriorDB estimates are, but locally in Stan/math parallelization with tbb was faster within limits (#threads matters, etc) but if running this on exp with many evaluations of a gaussian distribution for example for thousands of iterations this could be worth it. But to play devil's advocate, recollecting threads could also also slow it down. Again, I'm handicapped no computer. But in summary, I don't think the linked thread effectively evaluates whether this is faster or not. All HPC devs use threading, no? Any ringers we can bring in? But wrt maintenance, easy. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***> wrote: *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> I'm reposting my reply from here <#3314 (comment)> The main issue is, when does the overhead of threading make it worth running in parallel? i.e. if a user has a vector of 100 elements, how many threads should be used for exp(x)? None? Two? Threading has a decently high overhead cost and for each instantiation of threading you pay for that. So for small problems the answer is most likely single threaded. When we are trying to add automatic parallelization we will need to ask at runtime "how many threads should this operation use given the data size?" that requires understanding a lot of information about how fast a users particular machine can calculate a function and is a pretty hard problem. Eigen does this, but only for matrix multiplication as they have very good runtime logic to detect if sharding a large matrix multiply across threads is worth it. The logic for deciding at runtime whether a particular function is worth moving over to the gpu is going to be a lot of developer and runtime overhead. imo I think the maintanence would not be worth it. This has been attempted previously be @andrjohns <https://github.com/andrjohns> (and I took a crack at it myself). You can see that whole conversation here <#1918> — Reply to this email directly, view it on GitHub <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

drezap · 2026-05-27T21:19:05Z

And again, in AnderJohns thread I'm not seeing any direct comparisons between threaded and unthreaded. I.e. there's no control and treatment group. We can't just guess. We need to systematically evaluate it. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote: It's like some declarations, which are essentially just if statements that determine whether a certain area of code will be compiled or not. I'm skimming this I'm waiting on a bootloader for a free Mac I got. I'm not seeing any direct comparisons between threaded and non threaded code, and there seems to be a discrepancy between concurrency and running a process on different cores. I'm with Bob: #1918 (comment) Instead of chatting, let's come up with a concrete way of determining whether something is faster. WRT maintenance, it's like 3 lines of code and some declaratives. Easy to maintain. I seem to have accidentally discovered Ahmdal's law. So I propose we come up with concrete objectives to benchmarks and if it's faster we proceed. Also, typing matters. And I'm not sure about how reliable the posteriorDB estimates are, but locally in Stan/math parallelization with tbb was faster within limits (#threads matters, etc) but if running this on exp with many evaluations of a gaussian distribution for example for thousands of iterations this could be worth it. But to play devil's advocate, recollecting threads could also also slow it down. Again, I'm handicapped no computer. But in summary, I don't think the linked thread effectively evaluates whether this is faster or not. All HPC devs use threading, no? Any ringers we can bring in? But wrt maintenance, easy. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017 On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***> wrote: > *SteveBronder* left a comment (stan-dev/math#3314) > <#3314 (comment)> > > I'm reposting my reply from here > <#3314 (comment)> > > The main issue is, when does the overhead of threading make it worth > running in parallel? i.e. if a user has a vector of 100 elements, how many > threads should be used for exp(x)? None? Two? Threading has a decently high > overhead cost and for each instantiation of threading you pay for that. So > for small problems the answer is most likely single threaded. When we are > trying to add automatic parallelization we will need to ask at runtime "how > many threads should this operation use given the data size?" that requires > understanding a lot of information about how fast a users particular > machine can calculate a function and is a pretty hard problem. Eigen does > this, but only for matrix multiplication as they have very good runtime > logic to detect if sharding a large matrix multiply across threads is worth > it. > > The logic for deciding at runtime whether a particular function is worth > moving over to the gpu is going to be a lot of developer and runtime > overhead. imo I think the maintanence would not be worth it. > > This has been attempted previously be @andrjohns > <https://github.com/andrjohns> (and I took a crack at it myself). You > can see that whole conversation here > <#1918> > > — > Reply to this email directly, view it on GitHub > <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

drezap · 2026-05-27T21:25:31Z

And I answered these questions with my benchmarks, so it's not a big mystery: Amdahl's law seems to apply. *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> I'm reposting my reply from here <#3314 (comment)> The main issue is, when does the overhead of threading make it worth running in parallel? i.e. if a user has a vector of 100 elements, how many threads should be used for exp(x)? None? Two? Threading has a decently high overhead cost and for each instantiation of threading you pay for that. So for small problems the answer is Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote: And again, in AnderJohns thread I'm not seeing any direct comparisons between threaded and unthreaded. I.e. there's no control and treatment group. We can't just guess. We need to systematically evaluate it. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017 On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote: > It's like some declarations, which are essentially just if statements > that determine whether a certain area of code will be compiled or not. > > I'm skimming this I'm waiting on a bootloader for a free Mac I got. > > I'm not seeing any direct comparisons between threaded and non threaded > code, and there seems to be a discrepancy between concurrency and running a > process on different cores. > > I'm with Bob: > #1918 (comment) > > Instead of chatting, let's come up with a concrete way of determining > whether something is faster. > > WRT maintenance, it's like 3 lines of code and some declaratives. Easy to > maintain. > > I seem to have accidentally discovered Ahmdal's law. So I propose we come > up with concrete objectives to benchmarks and if it's faster we proceed. > Also, typing matters. > > And I'm not sure about how reliable the posteriorDB estimates are, but > locally in Stan/math parallelization with tbb was faster within limits > (#threads matters, etc) but if running this on exp with many evaluations of > a gaussian distribution for example for thousands of iterations this could > be worth it. But to play devil's advocate, recollecting threads could also > also slow it down. > > Again, I'm handicapped no computer. > > But in summary, I don't think the linked thread effectively evaluates > whether this is faster or not. All HPC devs use threading, no? Any ringers > we can bring in? > > But wrt maintenance, easy. > > Best, > > > Andre Zapico > linkedin.com/in/andre-zapico > gitub.com/drezap > > > ME Information and Communication Engineering > University of Electronic Science and Technology of China > > Consultant, Owner > likely llc > likelyllc.com > > Stan Developer > mc-stan.org > > BS Mathematical Sciences: Probabilistic Methods > BS Statistics > University of Michigan, Ann Arbor 2017 > > On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***> > wrote: > >> *SteveBronder* left a comment (stan-dev/math#3314) >> <#3314 (comment)> >> >> I'm reposting my reply from here >> <#3314 (comment)> >> >> The main issue is, when does the overhead of threading make it worth >> running in parallel? i.e. if a user has a vector of 100 elements, how many >> threads should be used for exp(x)? None? Two? Threading has a decently high >> overhead cost and for each instantiation of threading you pay for that. So >> for small problems the answer is most likely single threaded. When we are >> trying to add automatic parallelization we will need to ask at runtime "how >> many threads should this operation use given the data size?" that requires >> understanding a lot of information about how fast a users particular >> machine can calculate a function and is a pretty hard problem. Eigen does >> this, but only for matrix multiplication as they have very good runtime >> logic to detect if sharding a large matrix multiply across threads is worth >> it. >> >> The logic for deciding at runtime whether a particular function is worth >> moving over to the gpu is going to be a lot of developer and runtime >> overhead. imo I think the maintanence would not be worth it. >> >> This has been attempted previously be @andrjohns >> <https://github.com/andrjohns> (and I took a crack at it myself). You >> can see that whole conversation here >> <#1918> >> >> — >> Reply to this email directly, view it on GitHub >> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>, >> or unsubscribe >> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ> >> . >> You are receiving this because you were mentioned.Message ID: >> ***@***.***> >> > > > > >

drezap · 2026-05-28T00:07:58Z

What I am suggesting is we ignore MCMC for now, and just go with runtime at evaluating prob distributions. Pretty much all of them use an exponential. So if there's a slight gain on evaluating computations then it's totally worth it to add, no? I think a lot of developers do this under the hood but don't expose it to users. Do the threads navigate through composite functions (i.e. normal distribution)? no idea. but the tests I ran seemed to improve performance, if we're not considering auto diff. they passed tests. I am going for performance, not fancy publications if that makes sense. But I'm sure devs do this under the hood for game dev etc. The code I added was only a few lines, and some declaratives. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote: And I answered these questions with my benchmarks, so it's not a big mystery: Amdahl's law seems to apply. *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> I'm reposting my reply from here <#3314 (comment)> The main issue is, when does the overhead of threading make it worth running in parallel? i.e. if a user has a vector of 100 elements, how many threads should be used for exp(x)? None? Two? Threading has a decently high overhead cost and for each instantiation of threading you pay for that. So for small problems the answer is Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017 On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote: > And again, in AnderJohns thread I'm not seeing any direct comparisons > between threaded and unthreaded. I.e. there's no control and treatment > group. We can't just guess. We need to systematically evaluate it. > > Best, > > > Andre Zapico > linkedin.com/in/andre-zapico > gitub.com/drezap > > > ME Information and Communication Engineering > University of Electronic Science and Technology of China > > Consultant, Owner > likely llc > likelyllc.com > > Stan Developer > mc-stan.org > > BS Mathematical Sciences: Probabilistic Methods > BS Statistics > University of Michigan, Ann Arbor 2017 > > On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> wrote: > >> It's like some declarations, which are essentially just if statements >> that determine whether a certain area of code will be compiled or not. >> >> I'm skimming this I'm waiting on a bootloader for a free Mac I got. >> >> I'm not seeing any direct comparisons between threaded and non threaded >> code, and there seems to be a discrepancy between concurrency and running a >> process on different cores. >> >> I'm with Bob: >> #1918 (comment) >> >> Instead of chatting, let's come up with a concrete way of determining >> whether something is faster. >> >> WRT maintenance, it's like 3 lines of code and some declaratives. Easy >> to maintain. >> >> I seem to have accidentally discovered Ahmdal's law. So I propose we >> come up with concrete objectives to benchmarks and if it's faster we >> proceed. Also, typing matters. >> >> And I'm not sure about how reliable the posteriorDB estimates are, but >> locally in Stan/math parallelization with tbb was faster within limits >> (#threads matters, etc) but if running this on exp with many evaluations of >> a gaussian distribution for example for thousands of iterations this could >> be worth it. But to play devil's advocate, recollecting threads could also >> also slow it down. >> >> Again, I'm handicapped no computer. >> >> But in summary, I don't think the linked thread effectively evaluates >> whether this is faster or not. All HPC devs use threading, no? Any ringers >> we can bring in? >> >> But wrt maintenance, easy. >> >> Best, >> >> >> Andre Zapico >> linkedin.com/in/andre-zapico >> gitub.com/drezap >> >> >> ME Information and Communication Engineering >> University of Electronic Science and Technology of China >> >> Consultant, Owner >> likely llc >> likelyllc.com >> >> Stan Developer >> mc-stan.org >> >> BS Mathematical Sciences: Probabilistic Methods >> BS Statistics >> University of Michigan, Ann Arbor 2017 >> >> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***> >> wrote: >> >>> *SteveBronder* left a comment (stan-dev/math#3314) >>> <#3314 (comment)> >>> >>> I'm reposting my reply from here >>> <#3314 (comment)> >>> >>> The main issue is, when does the overhead of threading make it worth >>> running in parallel? i.e. if a user has a vector of 100 elements, how many >>> threads should be used for exp(x)? None? Two? Threading has a decently high >>> overhead cost and for each instantiation of threading you pay for that. So >>> for small problems the answer is most likely single threaded. When we are >>> trying to add automatic parallelization we will need to ask at runtime "how >>> many threads should this operation use given the data size?" that requires >>> understanding a lot of information about how fast a users particular >>> machine can calculate a function and is a pretty hard problem. Eigen does >>> this, but only for matrix multiplication as they have very good runtime >>> logic to detect if sharding a large matrix multiply across threads is worth >>> it. >>> >>> The logic for deciding at runtime whether a particular function is >>> worth moving over to the gpu is going to be a lot of developer and runtime >>> overhead. imo I think the maintanence would not be worth it. >>> >>> This has been attempted previously be @andrjohns >>> <https://github.com/andrjohns> (and I took a crack at it myself). You >>> can see that whole conversation here >>> <#1918> >>> >>> — >>> Reply to this email directly, view it on GitHub >>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>, >>> or unsubscribe >>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ> >>> . >>> You are receiving this because you were mentioned.Message ID: >>> ***@***.***> >>> >> >> >> >> >>

drezap · 2026-05-28T01:05:10Z

Here, I found this informative. ***@***.***/parallel-reduction-in-cuda-bba5e3d124b9 Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Wed, May 27, 2026, 8:07 PM Andre Zapico ***@***.***> wrote: What I am suggesting is we ignore MCMC for now, and just go with runtime at evaluating prob distributions. Pretty much all of them use an exponential. So if there's a slight gain on evaluating computations then it's totally worth it to add, no? I think a lot of developers do this under the hood but don't expose it to users. Do the threads navigate through composite functions (i.e. normal distribution)? no idea. but the tests I ran seemed to improve performance, if we're not considering auto diff. they passed tests. I am going for performance, not fancy publications if that makes sense. But I'm sure devs do this under the hood for game dev etc. The code I added was only a few lines, and some declaratives. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017 On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote: > And I answered these questions with my benchmarks, so it's not a big > mystery: Amdahl's law seems to apply. > > > *SteveBronder* left a comment (stan-dev/math#3314) > <#3314 (comment)> > > I'm reposting my reply from here > <#3314 (comment)> > > The main issue is, when does the overhead of threading make it worth > running in parallel? i.e. if a user has a vector of 100 elements, how many > threads should be used for exp(x)? None? Two? Threading has a decently high > overhead cost and for each instantiation of threading you pay for that. So > for small problems the answer is > > > Best, > > > Andre Zapico > linkedin.com/in/andre-zapico > gitub.com/drezap > > > ME Information and Communication Engineering > University of Electronic Science and Technology of China > > Consultant, Owner > likely llc > likelyllc.com > > Stan Developer > mc-stan.org > > BS Mathematical Sciences: Probabilistic Methods > BS Statistics > University of Michigan, Ann Arbor 2017 > > On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> wrote: > >> And again, in AnderJohns thread I'm not seeing any direct comparisons >> between threaded and unthreaded. I.e. there's no control and treatment >> group. We can't just guess. We need to systematically evaluate it. >> >> Best, >> >> >> Andre Zapico >> linkedin.com/in/andre-zapico >> gitub.com/drezap >> >> >> ME Information and Communication Engineering >> University of Electronic Science and Technology of China >> >> Consultant, Owner >> likely llc >> likelyllc.com >> >> Stan Developer >> mc-stan.org >> >> BS Mathematical Sciences: Probabilistic Methods >> BS Statistics >> University of Michigan, Ann Arbor 2017 >> >> On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> >> wrote: >> >>> It's like some declarations, which are essentially just if statements >>> that determine whether a certain area of code will be compiled or not. >>> >>> I'm skimming this I'm waiting on a bootloader for a free Mac I got. >>> >>> I'm not seeing any direct comparisons between threaded and non threaded >>> code, and there seems to be a discrepancy between concurrency and running a >>> process on different cores. >>> >>> I'm with Bob: >>> #1918 (comment) >>> >>> Instead of chatting, let's come up with a concrete way of determining >>> whether something is faster. >>> >>> WRT maintenance, it's like 3 lines of code and some declaratives. Easy >>> to maintain. >>> >>> I seem to have accidentally discovered Ahmdal's law. So I propose we >>> come up with concrete objectives to benchmarks and if it's faster we >>> proceed. Also, typing matters. >>> >>> And I'm not sure about how reliable the posteriorDB estimates are, but >>> locally in Stan/math parallelization with tbb was faster within limits >>> (#threads matters, etc) but if running this on exp with many evaluations of >>> a gaussian distribution for example for thousands of iterations this could >>> be worth it. But to play devil's advocate, recollecting threads could also >>> also slow it down. >>> >>> Again, I'm handicapped no computer. >>> >>> But in summary, I don't think the linked thread effectively evaluates >>> whether this is faster or not. All HPC devs use threading, no? Any ringers >>> we can bring in? >>> >>> But wrt maintenance, easy. >>> >>> Best, >>> >>> >>> Andre Zapico >>> linkedin.com/in/andre-zapico >>> gitub.com/drezap >>> >>> >>> ME Information and Communication Engineering >>> University of Electronic Science and Technology of China >>> >>> Consultant, Owner >>> likely llc >>> likelyllc.com >>> >>> Stan Developer >>> mc-stan.org >>> >>> BS Mathematical Sciences: Probabilistic Methods >>> BS Statistics >>> University of Michigan, Ann Arbor 2017 >>> >>> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***> >>> wrote: >>> >>>> *SteveBronder* left a comment (stan-dev/math#3314) >>>> <#3314 (comment)> >>>> >>>> I'm reposting my reply from here >>>> <#3314 (comment)> >>>> >>>> The main issue is, when does the overhead of threading make it worth >>>> running in parallel? i.e. if a user has a vector of 100 elements, how many >>>> threads should be used for exp(x)? None? Two? Threading has a decently high >>>> overhead cost and for each instantiation of threading you pay for that. So >>>> for small problems the answer is most likely single threaded. When we are >>>> trying to add automatic parallelization we will need to ask at runtime "how >>>> many threads should this operation use given the data size?" that requires >>>> understanding a lot of information about how fast a users particular >>>> machine can calculate a function and is a pretty hard problem. Eigen does >>>> this, but only for matrix multiplication as they have very good runtime >>>> logic to detect if sharding a large matrix multiply across threads is worth >>>> it. >>>> >>>> The logic for deciding at runtime whether a particular function is >>>> worth moving over to the gpu is going to be a lot of developer and runtime >>>> overhead. imo I think the maintanence would not be worth it. >>>> >>>> This has been attempted previously be @andrjohns >>>> <https://github.com/andrjohns> (and I took a crack at it myself). You >>>> can see that whole conversation here >>>> <#1918> >>>> >>>> — >>>> Reply to this email directly, view it on GitHub >>>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>, >>>> or unsubscribe >>>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ> >>>> . >>>> You are receiving this because you were mentioned.Message ID: >>>> ***@***.***> >>>> >>> >>> >>> >>> >>> > >

drezap · 2026-05-28T03:09:29Z

Ok, I am reading through the old threading discussion a bit more thoroughly. It's cool but many degrees of freedom and would be better to specifically define what we're trying to thread? Something as simple as concurrency in an operation that requires a lot of FLOPS could potentially add some speed. And then isolate auto diff later? The conversation is going in a bunch of different direction and it's not concrete as to what we're trying to do. But systematically threading simple functions and evaluations of values for PDFs might be a starting point. if that adds speed, sure. But then threading auto diff is a different problem. But starting simple on an iterative algorithm could add cumulative gains. See what I'm saying? so there's a concrete gain as opposed to a convoluted research question? So ok, thread this, benchmark on all PDFs, and then continue. Just evaluation, not gradients, then we could mess with auto diff more. Not sure what percentage or proportion within stans HMC is purely just evaluation but I think it's non negligible and could speed up. And then focus on auto diff after. Sound stupid? Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Wed, May 27, 2026, 9:04 PM Andre Zapico ***@***.***> wrote: Here, I found this informative. ***@***.***/parallel-reduction-in-cuda-bba5e3d124b9 Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017 On Wed, May 27, 2026, 8:07 PM Andre Zapico ***@***.***> wrote: > What I am suggesting is we ignore MCMC for now, and just go with runtime > at evaluating prob distributions. Pretty much all of them use an > exponential. So if there's a slight gain on evaluating computations then > it's totally worth it to add, no? I think a lot of developers do this under > the hood but don't expose it to users. Do the threads navigate through > composite functions (i.e. normal distribution)? no idea. but the tests I > ran seemed to improve performance, if we're not considering auto diff. they > passed tests. I am going for performance, not fancy publications if that > makes sense. But I'm sure devs do this under the hood for game dev etc. The > code I added was only a few lines, and some declaratives. > > Best, > > > Andre Zapico > linkedin.com/in/andre-zapico > gitub.com/drezap > > > ME Information and Communication Engineering > University of Electronic Science and Technology of China > > Consultant, Owner > likely llc > likelyllc.com > > Stan Developer > mc-stan.org > > BS Mathematical Sciences: Probabilistic Methods > BS Statistics > University of Michigan, Ann Arbor 2017 > > On Wed, May 27, 2026, 5:25 PM Andre Zapico ***@***.***> wrote: > >> And I answered these questions with my benchmarks, so it's not a big >> mystery: Amdahl's law seems to apply. >> >> >> *SteveBronder* left a comment (stan-dev/math#3314) >> <#3314 (comment)> >> >> I'm reposting my reply from here >> <#3314 (comment)> >> >> The main issue is, when does the overhead of threading make it worth >> running in parallel? i.e. if a user has a vector of 100 elements, how many >> threads should be used for exp(x)? None? Two? Threading has a decently high >> overhead cost and for each instantiation of threading you pay for that. So >> for small problems the answer is >> >> >> Best, >> >> >> Andre Zapico >> linkedin.com/in/andre-zapico >> gitub.com/drezap >> >> >> ME Information and Communication Engineering >> University of Electronic Science and Technology of China >> >> Consultant, Owner >> likely llc >> likelyllc.com >> >> Stan Developer >> mc-stan.org >> >> BS Mathematical Sciences: Probabilistic Methods >> BS Statistics >> University of Michigan, Ann Arbor 2017 >> >> On Wed, May 27, 2026, 5:18 PM Andre Zapico ***@***.***> >> wrote: >> >>> And again, in AnderJohns thread I'm not seeing any direct comparisons >>> between threaded and unthreaded. I.e. there's no control and treatment >>> group. We can't just guess. We need to systematically evaluate it. >>> >>> Best, >>> >>> >>> Andre Zapico >>> linkedin.com/in/andre-zapico >>> gitub.com/drezap >>> >>> >>> ME Information and Communication Engineering >>> University of Electronic Science and Technology of China >>> >>> Consultant, Owner >>> likely llc >>> likelyllc.com >>> >>> Stan Developer >>> mc-stan.org >>> >>> BS Mathematical Sciences: Probabilistic Methods >>> BS Statistics >>> University of Michigan, Ann Arbor 2017 >>> >>> On Wed, May 27, 2026, 5:13 PM Andre Zapico ***@***.***> >>> wrote: >>> >>>> It's like some declarations, which are essentially just if statements >>>> that determine whether a certain area of code will be compiled or not. >>>> >>>> I'm skimming this I'm waiting on a bootloader for a free Mac I got. >>>> >>>> I'm not seeing any direct comparisons between threaded and non >>>> threaded code, and there seems to be a discrepancy between concurrency and >>>> running a process on different cores. >>>> >>>> I'm with Bob: >>>> #1918 (comment) >>>> >>>> Instead of chatting, let's come up with a concrete way of determining >>>> whether something is faster. >>>> >>>> WRT maintenance, it's like 3 lines of code and some declaratives. Easy >>>> to maintain. >>>> >>>> I seem to have accidentally discovered Ahmdal's law. So I propose we >>>> come up with concrete objectives to benchmarks and if it's faster we >>>> proceed. Also, typing matters. >>>> >>>> And I'm not sure about how reliable the posteriorDB estimates are, but >>>> locally in Stan/math parallelization with tbb was faster within limits >>>> (#threads matters, etc) but if running this on exp with many evaluations of >>>> a gaussian distribution for example for thousands of iterations this could >>>> be worth it. But to play devil's advocate, recollecting threads could also >>>> also slow it down. >>>> >>>> Again, I'm handicapped no computer. >>>> >>>> But in summary, I don't think the linked thread effectively evaluates >>>> whether this is faster or not. All HPC devs use threading, no? Any ringers >>>> we can bring in? >>>> >>>> But wrt maintenance, easy. >>>> >>>> Best, >>>> >>>> >>>> Andre Zapico >>>> linkedin.com/in/andre-zapico >>>> gitub.com/drezap >>>> >>>> >>>> ME Information and Communication Engineering >>>> University of Electronic Science and Technology of China >>>> >>>> Consultant, Owner >>>> likely llc >>>> likelyllc.com >>>> >>>> Stan Developer >>>> mc-stan.org >>>> >>>> BS Mathematical Sciences: Probabilistic Methods >>>> BS Statistics >>>> University of Michigan, Ann Arbor 2017 >>>> >>>> On Wed, May 27, 2026, 1:51 PM Steve Bronder ***@***.***> >>>> wrote: >>>> >>>>> *SteveBronder* left a comment (stan-dev/math#3314) >>>>> <#3314 (comment)> >>>>> >>>>> I'm reposting my reply from here >>>>> <#3314 (comment)> >>>>> >>>>> The main issue is, when does the overhead of threading make it worth >>>>> running in parallel? i.e. if a user has a vector of 100 elements, how many >>>>> threads should be used for exp(x)? None? Two? Threading has a decently high >>>>> overhead cost and for each instantiation of threading you pay for that. So >>>>> for small problems the answer is most likely single threaded. When we are >>>>> trying to add automatic parallelization we will need to ask at runtime "how >>>>> many threads should this operation use given the data size?" that requires >>>>> understanding a lot of information about how fast a users particular >>>>> machine can calculate a function and is a pretty hard problem. Eigen does >>>>> this, but only for matrix multiplication as they have very good runtime >>>>> logic to detect if sharding a large matrix multiply across threads is worth >>>>> it. >>>>> >>>>> The logic for deciding at runtime whether a particular function is >>>>> worth moving over to the gpu is going to be a lot of developer and runtime >>>>> overhead. imo I think the maintanence would not be worth it. >>>>> >>>>> This has been attempted previously be @andrjohns >>>>> <https://github.com/andrjohns> (and I took a crack at it myself). >>>>> You can see that whole conversation here >>>>> <#1918> >>>>> >>>>> — >>>>> Reply to this email directly, view it on GitHub >>>>> <#3314?email_source=notifications&email_token=ACY543CVA5XOXLOLQ4A6QFL444TJFA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJVG4YTGNJRGE2KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4557135114>, >>>>> or unsubscribe >>>>> <https://github.com/notifications/unsubscribe-auth/ACY543AK4UYA7VQZ34JBDJ3444TJFAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNJXGEZTKMJRGQ> >>>>> . >>>>> You are receiving this because you were mentioned.Message ID: >>>>> ***@***.***> >>>>> >>>> >>>> >>>> >>>> >>>> >> >>

SteveBronder · 2026-05-28T16:45:47Z

I'm with Bob:
#1918 (comment)

Instead of chatting, let's come up with a concrete way of determining
whether something is faster.

Yes this is what stan-perf is specifically built for. It allows you to use google benchmark with the stan math library and a branch to see how performance varies. You can see in the examples in that repo I use multiple sizes of matrices and you can also use google benchmark to benchmark a varying number of threads / matrix sizes.

WRT maintenance, it's like 3 lines of code and some declaratives. Easy to
maintain.

What goes in the if statement is the question. For instance, in Eigen they have ways to query information about the cpu to determine ballparks for whether it is worth dispatching to parallel versions of matrix multiplication. To do this well we would need something similar and that is a lot of code that is very hairy. Then there is also the question of how many threads you should use for a given operation. The if statement will not just be if (N > some_number) -> parallel.

What I am suggesting is we ignore MCMC for now, and just go with runtime at
evaluating prob distributions.

For this we want to use googlebenchmark via stan-perf to test the exponential function performance directly.

See what I'm saying? so there's a concrete gain as opposed to a convoluted
research question?

I'm honestly rather confused about what you are looking to do. As I've said before, handling the edge cases around parallelization for simple unary and binary etc. operations is actually pretty difficult. Else we would have done this quite a while ago. And imo that level of parallelization is something that would be better to have in Eigen rather than Stan math. To make the overhead cost of spinning up threads worth it you would want to chain together many operations one that thread so that the computation is worth it (Amdahl's Law). That is the reason why we have reduce sum since it gives the user the ability to break an lpdf into smaller batches to compute on multiple threads. Doing that chunking automatically is a pretty large challenge.

drezap · 2026-05-28T18:22:08Z

I am not sure matrix operations are the best way to test whether parallelization is effective in the math library. Is Cholesky decomp parallelizable? No, it's recursive. Block diagonal Cholesky, sure, since you can decompose each block in parallel. Is gauss Jordan elimination parallelizable? I don't think so. Re: confusion: I'm looking to throw threads at anything possible. And not everything that's parallelizable requires a reduce sum. And it's also possible to abstract that away from the user. Re: how many threads. We can get an approximation via MC simulation, am I right? This is the point. What PDFs have you tried parallelizing which functions? Not sure how deep the threads travel. But if a log likelihood evaluation per iteration in an MCMC sampler has different parameters estimates every time (until convergence) not really parallelizable. I'm thinking every vectorized operation can be parallelized and we can evaluate #threads through simulation, as a proxy, not a proof, and it can be abstracted away from the user. And again, matrix operations are not the best "benchmark" for parallelization. I'm thinking anything vectorized (i.e. performing the same operation multiple times), you can't really parallelize something recursive, no? You're just adding overhead and collecting threads, etc. What all did you benchmark on your Stan perf report besides matrix factorizations? I can't look at this rn no computer. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Thu, May 28, 2026, 12:46 PM Steve Bronder ***@***.***> wrote: *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> I'm with Bob: #1918 (comment) <#1918 (comment)> Instead of chatting, let's come up with a concrete way of determining whether something is faster. Yes this is what stan-perf <https://github.com/SteveBronder/stan-perf> is specifically built for. It allows you to use google benchmark with the stan math library and a branch to see how performance varies. You can see in the examples in that repo I use multiple sizes of matrices and you can also use google benchmark to benchmark a varying number of threads / matrix sizes. WRT maintenance, it's like 3 lines of code and some declaratives. Easy to maintain. What goes in the if statement is the question. For instance, in Eigen they have ways to query information about the cpu to determine ballparks for whether it is worth dispatching to parallel versions of matrix multiplication. To do this well we would need something similar and that is a lot of code that is very hairy. Then there is also the question of how many threads you should use for a given operation. The if statement will not just be if (N > some_number) -> parallel. What I am suggesting is we ignore MCMC for now, and just go with runtime at evaluating prob distributions. For this we want to use googlebenchmark via stan-perf to test the exponential function performance directly. See what I'm saying? so there's a concrete gain as opposed to a convoluted research question? I'm honestly rather confused about what you are looking to do. As I've said before, handling the edge cases around parallelization for simple unary and binary etc. operations is actually pretty difficult. Else we would have done this quite a while ago. And imo that level of parallelization is something that would be better to have in Eigen rather than Stan math. To make the overhead cost of spinning up threads worth it you would want to chain together many operations one that thread so that the computation is worth it (Amdahl's Law). That is the reason why we have reduce sum since it gives the user the ability to break an lpdf into smaller batches to compute on multiple threads. Doing that chunking automatically is a pretty large challenge. — Reply to this email directly, view it on GitHub <#3314?email_source=notifications&email_token=ACY543FJA7BM6FZOW3YXATT45BUNDA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJWGYZDQNRQGQ4KM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4566286048>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY543HQD3XLNHZ43OQLKOL45BUNDAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNRWGI4DMMBUHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SteveBronder · 2026-05-28T18:56:09Z

I think it would be helpful if you responded inline with quotes as I'm having a hard time following your messages here

I am not sure matrix operations are the best way to test whether
parallelization is effective in the math library. Is Cholesky decomp
parallelizable? No, it's recursive. Block diagonal Cholesky, sure, since
you can decompose each block in parallel. Is gauss Jordan elimination
parallelizable? I don't think so.

I don't understand how this is related to your PR's goal. The matrix multiply example in stan-perf is an example. You can then add your own examples for parallelism.

Re: confusion: I'm looking to throw threads at anything possible.

There is a real cost to spinning up threads both in terms of overhead compute and hardware resources. We need to be very mindful about this.

And not everything that's parallelizable requires a reduce sum. And it's
also possible to abstract that away from the user.

Many useful cases, that we can do reverse mode on, require the use of reduce sum. As of now it is one of the few ways we know how to do reverse mode autodiff in parallel.

I think a simple place to start with your project is doing the google benchmark for the exponential function as you have it with varying thread and vector counts.

Re: how many threads. We can get an approximation via MC simulation, am I
right? This is the point.

No we cannot. The thing we care about is the literal CPU, how many cores it has, L1, L2 cache size, size of the vector, is it an arm, x86, etc. and what SIMD is available. If this was doing a few simulations and if statements I promise we would have done this already.

What PDFs have you tried parallelizing which functions? Not sure how deep
the threads travel. But if a log likelihood evaluation per iteration in an
MCMC sampler has different parameters estimates every time (until
convergence) not really parallelizable.

The reason we have reduce_sum is so that users can parallelize lpdf functions for their specific workload. Automatic parallelization is a rather hard problem and so we leave it to users to specify what batches of a problem should be done in parallel.

I'm thinking every vectorized operation can be parallelized and we can
evaluate #threads through simulation, as a proxy, not a proof, and it can
be abstracted away from the user.

See my above about the different levers in play here. This is a large task and imo not one that the Stan math library wants to maintain.

And again, matrix operations are not the best "benchmark" for
parallelization. I'm thinking anything vectorized (i.e. performing the same
operation multiple times), you can't really parallelize something
recursive, no? You're just adding overhead and collecting threads, etc.

The example in the stan-perf repo tests Struct of Array and Array of Struct matrices, it is not testing parallelism.

What all did you benchmark on your Stan perf report besides matrix
factorizations? I can't look at this rn no computer.

You can see the different benchmarks in the benchmarks folder.

I'm going to close this until we have a more definite idea of what we want to do. If you have interest in this I think it would be better to start with an issue along with google benchmark code that shows the performance of your idea. fyi this does not remove your code it is still at drezap:feature/issue-3311-test-thread-tbb-exp

drezap · 2026-05-28T19:39:07Z

I've already benchmarked it and answered several of your questions. I have an issue open already. Your matrix multiplication benchmarks are in no way comprehensive about answering whether parallelizing functions will scale. And nothing I have built for this project, or any project, has ever needed maintainenance. If you want to take a look at the "levers in play," you can take a look at me benchmarking certain parallelization parameters (i.e. #threads, block size) which will give some insight as to how the different levers perform. You can run multiple threads on one core. I am talking specifically about parallelism. So why are you pointing me to this report? And sure, if your "levers" are something you'd like to evaluate, please itemize them. I've already done some tests to show overhead of initiating threads. What levers would you like to see pulled? Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Thu, May 28, 2026, 2:56 PM Steve Bronder ***@***.***> wrote: Closed #3314 <#3314>. — Reply to this email directly, view it on GitHub <#3314?email_source=notifications&email_token=ACY543FRFK5X4I4JHHGVSPD45CDV7A5CNFSNUABQM5UWIORPF5TWS5BNNB2WEL2JONZXKZKFOZSW45CON52GSZTJMNQXI2LPNYXTENRQG43TQMRRGQ2TJJTSMVQXG33OU5WWK3TUNFXW5JLFOZSW45FMMZXW65DFOJPWG3DJMNVQ#event-26077821454>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY543BTNXNJESRPD3ED4DL45CDV7AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRWGA3TOOBSGE2DKNA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

drezap · 2026-05-28T20:19:12Z

@Steve Bronder ***@***.***> thoughts? I am not sure why you're talking about not parallelization when this is the main topic I'm talking about. Does not make sense. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Thu, May 28, 2026, 3:38 PM Andre Zapico ***@***.***> wrote: I've already benchmarked it and answered several of your questions. I have an issue open already. Your matrix multiplication benchmarks are in no way comprehensive about answering whether parallelizing functions will scale. And nothing I have built for this project, or any project, has ever needed maintainenance. If you want to take a look at the "levers in play," you can take a look at me benchmarking certain parallelization parameters (i.e. #threads, block size) which will give some insight as to how the different levers perform. You can run multiple threads on one core. I am talking specifically about parallelism. So why are you pointing me to this report? And sure, if your "levers" are something you'd like to evaluate, please itemize them. I've already done some tests to show overhead of initiating threads. What levers would you like to see pulled? Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017 On Thu, May 28, 2026, 2:56 PM Steve Bronder ***@***.***> wrote: > Closed #3314 <#3314>. > > — > Reply to this email directly, view it on GitHub > <#3314?email_source=notifications&email_token=ACY543FRFK5X4I4JHHGVSPD45CDV7A5CNFSNUABQM5UWIORPF5TWS5BNNB2WEL2JONZXKZKFOZSW45CON52GSZTJMNQXI2LPNYXTENRQG43TQMRRGQ2TJJTSMVQXG33OU5WWK3TUNFXW5JLFOZSW45FMMZXW65DFOJPWG3DJMNVQ#event-26077821454>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACY543BTNXNJESRPD3ED4DL45CDV7AVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMRWGA3TOOBSGE2DKNA> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

SteveBronder · 2026-05-28T20:27:57Z

I've already benchmarked it and answered several of your questions. I have
an issue open already. Your matrix multiplication benchmarks are in no way
comprehensive about answering whether parallelizing functions will scale.

Can you point me to them? I am not seeing google benchmarks in this thread.

And nothing I have built for this project, or any project, has ever needed
maintainenance.

Can you show me a file in Stan math you wrote that does not later have contributors? Looking at the git blame for the gp functions they have been refactored and rewritten over the years. All code adds maintenance.

If you want to take a look at the "levers in play," you can take a look at
me benchmarking certain parallelization parameters (i.e. #threads, block
size) which will give some insight as to how the different levers perform.

I want an actual google benchmark.

You can run multiple threads on one core.
I am talking specifically about parallelism. So why are you pointing me to
this report?

What report are you talking about? stan-perf is a repository with everything setup so that you can use google benchmark for building benchmark experiments. I feel like you are not reading or looking at the resources I am sending you.

And sure, if your "levers" are something you'd like to evaluate, please
itemize them.
I've already done some tests to show overhead of initiating threads.
What levers would you like to see pulled?

A good start would be experiments that check as the size of a vector and number of threads vary how does that affect performance relative to serial execution of the function. You can use the stan-perf repository to setup your benchmarks and execute them. You can fork the stan-perf repository, write your benchmarks, make some nice graphs, and share your code and results.

I've directed you several times to the stan-perf repository to setup your benchmarking code. Is there something I'm missing as to why you do not go and build the benchmarks via that repo and run them? Using google/benchmark will give you consistent and sharable results. You can also output the results to csv/json so you can make nice plots in R or python.

drezap · 2026-05-28T21:12:25Z

Give me a clear ordered list of what you would like to see. There has been minimal changes to the GP functions I've written. Yes, I am aware code requires maintainance. Go ahead and run .run tests.py test/unit/math/prim/fun/exp_test both with and without STAN_THREADS=TRUE, in make local and you will see a huge performance gain, as well as an example of Amdahl's law in practice. Again, not sure why you are referencing benchmarks that have nothing to do with parallelization when this is what we are discussing. Again, I am working on obtaining a functional machine for programming. Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Thu, May 28, 2026, 4:28 PM Steve Bronder ***@***.***> wrote: *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> I've already benchmarked it and answered several of your questions. I have an issue open already. Your matrix multiplication benchmarks are in no way comprehensive about answering whether parallelizing functions will scale. Can you point me to them? I am not seeing google benchmarks in this thread. And nothing I have built for this project, or any project, has ever needed maintainenance. Can you show me a file in Stan math you wrote that does not later have contributors? Looking at the git blame for the gp functions they have been refactored and rewritten over the years. All code adds maintenance. If you want to take a look at the "levers in play," you can take a look at me benchmarking certain parallelization parameters (i.e. #threads, block size) which will give some insight as to how the different levers perform. I want an actual google benchmark. You can run multiple threads on one core. I am talking specifically about parallelism. So why are you pointing me to this report? What report are you talking about? stan-perf is a repository with everything setup so that you can use google benchmark for building benchmark experiments. I feel like you are not reading or looking at the resources I am sending you. And sure, if your "levers" are something you'd like to evaluate, please itemize them. I've already done some tests to show overhead of initiating threads. What levers would you like to see pulled? A good start would be experiments that check as the size of a vector and number of threads vary how does that affect performance relative to serial execution of the function. You can use the stan-perf <https://github.com/SteveBronder/stan-perf> repository to setup your benchmarks and execute them. You can fork the stan-perf repository, write your benchmarks, make some nice graphs, and share your code and results. I've directed you several times to the stan-perf repository to setup your benchmarking code. Is there something I'm missing as to why you do not go and build the benchmarks via that repo and run them? Using google/benchmark will give you consistent and sharable results. You can also output the results to csv/json so you can make nice plots in R or python. — Reply to this email directly, view it on GitHub <#3314?email_source=notifications&email_token=ACY543HXC4X4PW4W54MA2OL45COOHA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJWG44TGMBQGQZKM4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4567930042>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY543EZC5GG2RFPE5LSPKT45COOHAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNRXHEZTAMBUGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SteveBronder · 2026-05-29T00:51:16Z

I'm very confused. Your benchmarks are very ad hoc and Im asking for more formal benchmarks. I feel I'm not being very heard. You can read the stan-perf repo and post clear benchmarks using the Google benchmark suite provided in that repo.

drezap · 2026-05-29T03:23:44Z

Give me list of parameters you'd like to be benchmarked. Keep in mind, when varied in conjunction this can effect benchmarking results. Can you provide an enum (an ordered enumerated list) of what you'd like to see benchmarked, so we're all not wasting time? Thanks! Best, Andre Zapico linkedin.com/in/andre-zapico gitub.com/drezap ME Information and Communication Engineering University of Electronic Science and Technology of China Consultant, Owner likely llc likelyllc.com Stan Developer mc-stan.org BS Mathematical Sciences: Probabilistic Methods BS Statistics University of Michigan, Ann Arbor 2017

…

On Thu, May 28, 2026, 8:51 PM Steve Bronder ***@***.***> wrote: *SteveBronder* left a comment (stan-dev/math#3314) <#3314 (comment)> I'm very confused. Your benchmarks are very ad hoc and Im asking for more eformal benchmarks. I feel I'm not being very heard. You can read the stan-perf repo and post clear benchmarks using the Google benchmark suite provided in that repo. — Reply to this email directly, view it on GitHub <#3314?email_source=notifications&email_token=ACY543G6HVSYS4WGIMMFATL45DNJXA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTINJWHE2TANZUGEZ2M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJLDGN5XXIZLSL5RWY2LDNM#issuecomment-4569507413>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACY543HZZSCM4ZCOMOW76G345DNJXAVCNFSM6AAAAACYLKOWDKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DKNRZGUYDONBRGM> . You are receiving this because you were mentioned.Message ID: ***@***.***>

SteveBronder · 2026-05-29T06:02:11Z

I would want vectors from size 2 to 32,768 by powers of 2 (so 2, 4, 8, 16,...). Then I would want threads from 1, 2, 4, ... up to the max threads on your computer. If you use stan-perf google benchmark has the tooling setup for this

drezap and others added 15 commits April 25, 2026 14:56

change return to referenced argument

a333398

intermediate commit playing with pointers

794e2a5

ok, numerical tests 1,..,10 pass _threading_

bee0559

remove dead code

99114d8

begin benchmarks

2db1474

try no threading

0d66c6f

scale to 10mm obs and 2^17 threads

73c3c76

add only 10 test for multithreading

c539a38

add some unthreaded tests, for varying N and numerical value of exp(10)

442e814

scale N tests

e9583fa

unthreaded tests compile

c1f263f

remove print statement

5b15382

Merge commit '105bfcc395c1ab824dcb588324dd57724a1cf527' into HEAD

b331dd8

[Jenkins] auto-formatting by clang-format version 10.0.0-4ubuntu1

b1c8b69

drezap added 4 commits April 29, 2026 21:14

fix unit test names, rusty, sorry

84675ab

Merge branch 'feature/issue-3311-test-thread-tbb-exp' of github.com:d…

aec6db9

…rezap/math into feature/issue-3311-test-thread-tbb-exp

fix return to satisfy compiler

94db162

change exp to std::exp for numerical accuracy

3fcf593

drezap added 3 commits May 1, 2026 10:59

investigate drift in tests

4b08e16

ifdef0endif don't compile mix tests, they're only supporting complex

d143a04

change investigate drift naming conventions

6837a52

perfect forwarding in parallelizing class

7ca2f6d

SteveBronder closed this May 28, 2026

Uh oh!

Conversation

drezap commented Apr 29, 2026

Summary

Tests

Side Effects

Release notes

Checklist

Uh oh!

drezap commented Apr 29, 2026

Uh oh!

SteveBronder commented Apr 30, 2026

Uh oh!

drezap commented Apr 30, 2026 via email

Uh oh!

drezap commented May 1, 2026

Uh oh!

stan-buildbot commented May 2, 2026

Uh oh!

drezap commented May 2, 2026

Uh oh!

stan-buildbot commented May 2, 2026

Uh oh!

stan-buildbot commented May 6, 2026

Uh oh!

drezap commented May 7, 2026

Uh oh!

WardBrian commented May 7, 2026

Uh oh!

SteveBronder commented May 8, 2026

Uh oh!

drezap commented May 8, 2026

Uh oh!

SteveBronder commented May 8, 2026

Uh oh!

drezap commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SteveBronder commented May 11, 2026

Uh oh!

drezap commented May 26, 2026 via email

Uh oh!

SteveBronder commented May 27, 2026

Uh oh!

drezap commented May 27, 2026 via email

Uh oh!

drezap commented May 27, 2026 via email

Uh oh!

SteveBronder commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drezap commented May 27, 2026 via email

Uh oh!

drezap commented May 27, 2026 via email

Uh oh!

drezap commented May 27, 2026 via email

Uh oh!

drezap commented May 28, 2026 via email

Uh oh!

drezap commented May 28, 2026 via email

Uh oh!

drezap commented May 28, 2026 via email

Uh oh!

SteveBronder commented May 28, 2026

Uh oh!

drezap commented May 28, 2026 via email

Uh oh!

SteveBronder commented May 28, 2026

Uh oh!

drezap commented May 28, 2026 via email

Uh oh!

drezap commented May 28, 2026 via email

Uh oh!

SteveBronder commented May 28, 2026

Uh oh!

drezap commented May 28, 2026 via email

Uh oh!

SteveBronder commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

drezap commented May 9, 2026 •

edited

Loading

SteveBronder commented May 27, 2026 •

edited

Loading

SteveBronder commented May 29, 2026 •

edited

Loading