Speed up scVelo recover_dynamics: up to 5.67× faster, identical output

Q: Why is scVelo recover_dynamics slow?

scVelo recover_dynamics is CPU-bound, and the stock implementation in scvelo leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 2.41 min where the AutoZyme path takes 25.52 s (5.67× faster).

Q: How do I make scVelo recover_dynamics faster?

Install AutoZyme and activate the scvelo patch, then keep using scVelo recover_dynamics exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 5.67× faster on the benchmark datasets, with no pipeline or API changes.

Q: Does the AutoZyme speedup change the scVelo recover_dynamics output?

No. The accelerated path returns bit-for-bit identical results to the original scvelo implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

Q: How do I install the scvelo speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scvelo"). The patch applies automatically the next time you call scvelo.tl.recover_dynamics.

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

gastrulation_erythroid

5.67×

pancreas

5.39×

bonemarrow

4.92×

pbmc68k_50k

3.81×

pancreas_subset_1500

2.99×

gastrulation_erythroidpancreasbonemarrowpbmc68k_50kpancreas_subset_1500

Thread sweep

Speedup across finalized thread counts on Windows

gastrulation_erythr…pancreasbonemarrowpbmc68k_50kpancreas_subset_1500

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets scvelo.tl.recover_dynamics in scvelo. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: RNA velocity, velocity, dynamical model, recover dynamics.

Supported scope

The fast path is a faithful structural mirror of upstream and handles ALL assignment_mode values correctly. Read full supported scope

The fast path is a faithful structural mirror of upstream and handles ALL assignment_mode values correctly. (1) fast_assign_tau (lines 172-200) reproduces upstream assign_tau branch-for-branch: the projection family (assignment_mode in {full_projection, partial_projection}, or projection with beta<gamma) is accelerated with a streaming-argmin numba kernel that drops the per-cell constant ||x_obs||^2 term from the squared-distance argmin (verified mathematically equivalent against upstream's 3D-broadcast argmin); every other mode (including 'projection' with beta>=gamma, and any non-projection mode) falls into the SAME else-branch as upstream, calling the unchanged original tau_inv -> bit-identical. (2) _fast_get_solution (lines 123-147) JITs the ODE solution only when t is 1D and u0/s0/alpha/beta/gamma are all scalars (the recover_dynamics per-gene fit path); it FALLS BACK to the captured upstream get_solution for 2D t, array initial_state, and per-cell array rate params (the get_divergence / velocity(mode='dynamical') path) -- guarded. (3) NumbaConn.dot routes 1D, 2-col, and 4-col matvecs to specialized kernels and any other n_cols to a correct generic kernel; an F-contiguous fast view is used only for n<=2000 with the rest copied to C-contig -- all branches covered. (4) fast_get_n_jobs only rewrites the worker count (parallelism), not the math; per-gene EM fits are independent so worker count does not affect results. All five overrides must be active together and the wrapper re-installs assign_tau + get_solution inside each loky worker so parallel runs match serial. Numerical caveat: _splicing_solve and the numba matvec/argmin kernels use fastmath=True (scalar exp vs upstream vectorized exp), so results match upstream only up to fp rounding (the task tolerates this with pearson>=0.99 / n_genes_fit_diff<=5 acceptance gates).

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 10 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`bonemarrow`	ood_large	Windows	16	1.32 min	16.14 s	4.92×	1.6 → 1.6 GB	—	pass
`gastrulation_erythroid`	large	Windows	16	2.41 min	25.52 s	5.67×	2.6 → 2.7 GB	—	pass
`pancreas`	medium	Windows	16	1.93 min	21.44 s	5.39×	1.3 → 1.3 GB	—	pass
`pancreas_subset_1500`	small	Windows	16	28.27 s	9.46 s	2.99×	0.7 → 0.8 GB	—	pass
`pbmc68k_50k`	ood_xlarge	Windows	16	2.33 min	36.78 s	3.81×	7.5 → 7.4 GB	—	pass
`bonemarrow`	ood_large	macOS	4	56.52 s	20.94 s	2.70×	—	—	pass
`gastrulation_erythroid`	large	macOS	1	5.96 min	36.05 s	9.92×	—	—	pass
`pancreas`	medium	macOS	1	5.47 min	30.69 s	10.7×	—	—	pass
`pancreas_subset_1500`	small	macOS	4	27.29 s	13.87 s	1.97×	—	—	pass
`pbmc68k_50k`	ood_xlarge	macOS	1	5.17 min	34.86 s	9.18×	—	—	pass

Frequently asked questions

Speeding up scVelo recover_dynamics

Why is scVelo recover_dynamics slow?

scVelo recover_dynamics is CPU-bound, and the stock implementation in scvelo leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 2.41 min where the AutoZyme path takes 25.52 s (5.67× faster).

How do I make scVelo recover_dynamics faster?

Install AutoZyme and activate the scvelo patch, then keep using scVelo recover_dynamics exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 5.67× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the scVelo recover_dynamics output?

No. The accelerated path returns bit-for-bit identical results to the original scvelo implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the scvelo speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scvelo"). The patch applies automatically the next time you call scvelo.tl.recover_dynamics.