scVelo recover_dynamics is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a
verified, drop-in patch that is up to 5.67× faster, returning bit-for-bit identical results with no change to how you call it.
Best speedup5.67×
Median speedup5.15×
Output equivalenceBit-exact
Best runtime baseline 2.41 min → optimized 25.52 s
Datasets5
Pass rate10/10
Benchmark charts
Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
gastrulation_erythroid
gastrulation_erythroid · large1 threads · 2.61× speedup9.19 min baseline → 3.52 min optimizedmemory 2.6 GB → 2.6 GBgastrulation_erythroid · large4 threads · 3.10× speedup3.39 min baseline → 1.08 min optimizedmemory 2.6 GB → 2.7 GBgastrulation_erythroid · large8 threads · 4.96× speedup2.81 min baseline → 34.00 s optimizedmemory 2.6 GB → 2.7 GBgastrulation_erythroid · large16 threads · 5.67× speedup2.41 min baseline → 25.52 s optimizedmemory 2.6 GB → 2.7 GB
5.67×
pancreas
pancreas · medium1 threads · 2.63× speedup7.55 min baseline → 2.87 min optimizedmemory 1.3 GB → 1.3 GBpancreas · medium4 threads · 2.59× speedup3.05 min baseline → 1.17 min optimizedmemory 1.3 GB → 1.3 GBpancreas · medium8 threads · 3.58× speedup2.31 min baseline → 38.78 s optimizedmemory 1.3 GB → 1.3 GBpancreas · medium16 threads · 5.39× speedup1.93 min baseline → 21.44 s optimizedmemory 1.3 GB → 1.3 GB
5.39×
bonemarrow
bonemarrow · ood_large1 threads · 2.48× speedup4.99 min baseline → 2.02 min optimizedmemory 1.6 GB → 1.6 GBbonemarrow · ood_large4 threads · 2.95× speedup1.69 min baseline → 34.26 s optimizedmemory 1.6 GB → 1.6 GBbonemarrow · ood_large8 threads · 3.84× speedup1.31 min baseline → 20.55 s optimizedmemory 1.6 GB → 1.6 GBbonemarrow · ood_large16 threads · 4.92× speedup1.32 min baseline → 16.14 s optimizedmemory 1.6 GB → 1.6 GB
4.92×
pbmc68k_50k
pbmc68k_50k · ood_xlarge1 threads · 2.14× speedup7.57 min baseline → 3.53 min optimizedmemory 7.5 GB → 7.3 GBpbmc68k_50k · ood_xlarge4 threads · 2.58× speedup2.73 min baseline → 1.06 min optimizedmemory 7.5 GB → 7.3 GBpbmc68k_50k · ood_xlarge8 threads · 2.74× speedup2.17 min baseline → 47.81 s optimizedmemory 7.5 GB → 7.4 GBpbmc68k_50k · ood_xlarge16 threads · 3.81× speedup2.33 min baseline → 36.78 s optimizedmemory 7.5 GB → 7.4 GB
3.81×
pancreas_subset_1500
pancreas_subset_1500 · small1 threads · 1.81× speedup2.21 min baseline → 1.22 min optimizedmemory 0.7 GB → 0.8 GBpancreas_subset_1500 · small4 threads · 1.89× speedup40.31 s baseline → 21.36 s optimizedmemory 0.7 GB → 0.8 GBpancreas_subset_1500 · small8 threads · 2.12× speedup28.32 s baseline → 13.37 s optimizedmemory 0.7 GB → 0.8 GBpancreas_subset_1500 · small16 threads · 2.99× speedup28.27 s baseline → 9.46 s optimizedmemory 0.7 GB → 0.8 GB
The public API stays the same; AutoZyme replaces only the supported fast path.
This task targets scvelo.tl.recover_dynamics in scvelo. The benchmarked result
preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.
Also searched as: RNA velocity, velocity, dynamical model, recover dynamics.
Supported scope
The fast path is a faithful structural mirror of upstream and handles ALL assignment_mode values correctly.Read full supported scope
The fast path is a faithful structural mirror of upstream and handles ALL assignment_mode values correctly. (1) fast_assign_tau (lines 172-200) reproduces upstream assign_tau branch-for-branch: the projection family (assignment_mode in {full_projection, partial_projection}, or projection with beta<gamma) is accelerated with a streaming-argmin numba kernel that drops the per-cell constant ||x_obs||^2 term from the squared-distance argmin (verified mathematically equivalent against upstream's 3D-broadcast argmin); every other mode (including 'projection' with beta>=gamma, and any non-projection mode) falls into the SAME else-branch as upstream, calling the unchanged original tau_inv -> bit-identical. (2) _fast_get_solution (lines 123-147) JITs the ODE solution only when t is 1D and u0/s0/alpha/beta/gamma are all scalars (the recover_dynamics per-gene fit path); it FALLS BACK to the captured upstream get_solution for 2D t, array initial_state, and per-cell array rate params (the get_divergence / velocity(mode='dynamical') path) -- guarded. (3) NumbaConn.dot routes 1D, 2-col, and 4-col matvecs to specialized kernels and any other n_cols to a correct generic kernel; an F-contiguous fast view is used only for n<=2000 with the rest copied to C-contig -- all branches covered. (4) fast_get_n_jobs only rewrites the worker count (parallelism), not the math; per-gene EM fits are independent so worker count does not affect results. All five overrides must be active together and the wrapper re-installs assign_tau + get_solution inside each loky worker so parallel runs match serial. Numerical caveat: _splicing_solve and the numba matvec/argmin kernels use fastmath=True (scalar exp vs upstream vectorized exp), so results match upstream only up to fp rounding (the task tolerates this with pearson>=0.99 / n_genes_fit_diff<=5 acceptance gates).
Out-of-scope behavior
silent fallback to upstream
Show detailed speedup table10 runs▾
Dataset
Tier
Platform
Threads
Baseline
Optimized
Speedup
Memory
Concordance
Pass
bonemarrow
ood_large
Windows
16
1.32 min
16.14 s
4.92×
1.6 → 1.6 GB
—
pass
gastrulation_erythroid
large
Windows
16
2.41 min
25.52 s
5.67×
2.6 → 2.7 GB
—
pass
pancreas
medium
Windows
16
1.93 min
21.44 s
5.39×
1.3 → 1.3 GB
—
pass
pancreas_subset_1500
small
Windows
16
28.27 s
9.46 s
2.99×
0.7 → 0.8 GB
—
pass
pbmc68k_50k
ood_xlarge
Windows
16
2.33 min
36.78 s
3.81×
7.5 → 7.4 GB
—
pass
bonemarrow
ood_large
macOS
4
56.52 s
20.94 s
2.70×
—
—
pass
gastrulation_erythroid
large
macOS
1
5.96 min
36.05 s
9.92×
—
—
pass
pancreas
medium
macOS
1
5.47 min
30.69 s
10.7×
—
—
pass
pancreas_subset_1500
small
macOS
4
27.29 s
13.87 s
1.97×
—
—
pass
pbmc68k_50k
ood_xlarge
macOS
1
5.17 min
34.86 s
9.18×
—
—
pass
Frequently asked questions
Speeding up scVelo recover_dynamics
Why is scVelo recover_dynamics slow?
scVelo recover_dynamics is CPU-bound, and the stock implementation in scvelo leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 2.41 min where the AutoZyme path takes 25.52 s (5.67× faster).
How do I make scVelo recover_dynamics faster?
Install AutoZyme and activate the scvelo patch, then keep using scVelo recover_dynamics exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 5.67× faster on the benchmark datasets, with no pipeline or API changes.
Does the AutoZyme speedup change the scVelo recover_dynamics output?
No. The accelerated path returns bit-for-bit identical results to the original scvelo implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.
How do I install the scvelo speedup?
In Python: pip install autozyme, then import autozyme and autozyme.activate("scvelo"). The patch applies automatically the next time you call scvelo.tl.recover_dynamics.