Scanpy rank_genes_groups is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a
verified, drop-in patch that is up to 97.4× faster, returning bit-for-bit identical results with no change to how you call it.
Best speedup97.4×
Median speedup45.3×
Output equivalenceBit-exact
Best runtime baseline 12.54 s → optimized 149 ms
Datasets6
Pass rate11/11
Benchmark charts
Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
log scale
pbmc68k
pbmc68k · small1 threads · 53.2× speedup21.74 s baseline → 410 ms optimizedmemory 0.7 GB → 0.8 GBpbmc68k · small4 threads · 64.5× speedup11.71 s baseline → 183 ms optimizedmemory 0.8 GB → 1.0 GBpbmc68k · small32 threads · 97.4× speedup12.54 s baseline → 149 ms optimizedmemory 0.8 GB → 0.9 GB
97.4×
heart_adult
heart_adult · large1 threads · 28.7× speedup20.62 min baseline → 43.15 s optimizedmemory 19 GB → 21 GBheart_adult · large4 threads · 50.2× speedup14.09 min baseline → 16.85 s optimizedmemory 19 GB → 21 GBheart_adult · large32 threads · 96.7× speedup13.86 min baseline → 8.60 s optimizedmemory 19 GB → 21 GB
96.7×
splitseq_rosenberg
splitseq_rosenberg · ood_large11 threads · 26.1× speedup3.25 min baseline → 7.48 s optimizedmemory 4.0 GB → 4.7 GBsplitseq_rosenberg · ood_large14 threads · 31.0× speedup1.66 min baseline → 3.21 s optimizedmemory 4.0 GB → 4.7 GBsplitseq_rosenberg · ood_large132 threads · 59.8× speedup1.59 min baseline → 1.60 s optimizedmemory 4.1 GB → 4.7 GB
59.8×
pbmc200k_glaucoma
pbmc200k_glaucoma · medium1 threads · 20.6× speedup5.21 min baseline → 15.19 s optimizedmemory 7.3 GB → 8.3 GBpbmc200k_glaucoma · medium4 threads · 26.6× speedup2.95 min baseline → 6.67 s optimizedmemory 7.3 GB → 8.3 GBpbmc200k_glaucoma · medium32 threads · 52.2× speedup2.84 min baseline → 3.27 s optimizedmemory 7.3 GB → 8.3 GB
52.2×
gastrulation_pijuansa…
gastrulation_pijuansala · ood_large31 threads · 8.70× speedup5.93 min baseline → 39.64 s optimizedmemory 15 GB → 17 GBgastrulation_pijuansala · ood_large34 threads · 16.3× speedup3.88 min baseline → 13.53 s optimizedmemory 15 GB → 17 GBgastrulation_pijuansala · ood_large332 threads · 33.1× speedup3.46 min baseline → 6.30 s optimizedmemory 15 GB → 17 GB
33.1×
tms_ss2
tms_ss2 · ood_large21 threads · 8.73× speedup2.74 min baseline → 18.87 s optimizedmemory 8.6 GB → 9.9 GBtms_ss2 · ood_large24 threads · 14.5× speedup1.94 min baseline → 7.88 s optimizedmemory 8.6 GB → 9.9 GBtms_ss2 · ood_large232 threads · 25.2× speedup1.71 min baseline → 4.08 s optimizedmemory 8.6 GB → 9.9 GB
The public API stays the same; AutoZyme replaces only the supported fast path.
This task targets rank_genes_groups in Scanpy. The benchmarked result
preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.
Fast path runs (and is benchmarked) only for: method=="wilcoxon" AND reference=="rest" AND tie_correct==False AND numba available AND X is a SciPy sparse matrix (CSBase, converted to CSC internally at line 508).Read full supported scope
Fast path runs (and is benchmarked) only for: method=="wilcoxon" AND reference=="rest" AND tie_correct==False AND numba available AND X is a SciPy sparse matrix (CSBase, converted to CSC internally at line 508). Within that: groups=="all" or an explicit sequence of group names; corr_method in {benjamini-hochberg, bonferroni, other(=no adjustment)}; n_genes=None (full output, descending-score sort) or n_genes<n_genes_total (top-N branch); rankby_abs True/False; pts True/False; use_raw/layer/mask_var/key_added/copy honored; groupby column must be categorical (uses .cat.categories/.cat.codes). logfoldchanges reverse the log1p transform via expm1, reading the log base from adata.uns['log1p']['base'] (defaults to natural log if absent).
Out-of-scope behavior
silent fallback to upstream
Show detailed speedup table11 runs▾
Dataset
Tier
Platform
Threads
Baseline
Optimized
Speedup
Memory
Concordance
Pass
gastrulation_pijuansala
ood_large3
Windows
32
3.46 min
6.30 s
33.1×
14.8 → 16.8 GB
—
pass
heart_adult
large
Windows
32
13.86 min
8.60 s
96.7×
19.0 → 21.3 GB
—
pass
pbmc200k_glaucoma
medium
Windows
32
2.84 min
3.27 s
52.2×
7.3 → 8.3 GB
—
pass
pbmc68k
small
Windows
32
12.54 s
149 ms
97.4×
0.8 → 0.9 GB
—
pass
splitseq_rosenberg
ood_large1
Windows
32
1.59 min
1.60 s
59.8×
4.1 → 4.7 GB
—
pass
tms_ss2
ood_large2
Windows
32
1.71 min
4.08 s
25.2×
8.6 → 9.9 GB
—
pass
gastrulation_pijuansala
ood_large3
macOS
14
1.88 min
6.21 s
18.1×
14.0 → 12.9 GB
—
pass
pbmc200k_glaucoma
medium
macOS
14
1.85 min
2.96 s
38.9×
10.2 → 10.3 GB
—
pass
pbmc68k
small
macOS
14
19.63 s
66 ms
293.3×
2.9 → 1.0 GB
—
pass
splitseq_rosenberg
ood_large1
macOS
14
59.53 s
1.33 s
45.3×
6.6 → 4.5 GB
—
pass
tms_ss2
ood_large2
macOS
14
1.06 min
3.13 s
20.2×
8.9 → 9.0 GB
—
pass
Frequently asked questions
Speeding up Scanpy rank_genes_groups
Why is Scanpy rank_genes_groups slow?
Scanpy rank_genes_groups is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 12.54 s where the AutoZyme path takes 149 ms (97.4× faster).
How do I make Scanpy rank_genes_groups faster?
Install AutoZyme and activate the Scanpy patch, then keep using Scanpy rank_genes_groups exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 97.4× faster on the benchmark datasets, with no pipeline or API changes.
Does the AutoZyme speedup change the Scanpy rank_genes_groups output?
No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.
How do I install the Scanpy speedup?
In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call rank_genes_groups.