Python Scanpy methods Scanpy

Speed up Scanpy rank_genes_groups

Scanpy rank_genes_groups is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a verified, drop-in patch that is up to 97.4× faster, returning bit-for-bit identical results with no change to how you call it.

Best speedup 97.4×
Median speedup 45.3×
Output equivalence Bit-exact
Best runtime baseline 12.54 s optimized 149 ms
Datasets 6
Pass rate 11/11

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
log scale
pbmc68kheart_adultsplitseq_rosenbergpbmc200k_glaucomagastrulation_pijuansa…tms_ss2
Thread sweep
Speedup across finalized thread counts on Windows
50×100×14full (32)pbmc68k · small1 threads · 53.2× speedup21.74 s baseline → 410 ms optimizedmemory 0.7 GB → 0.8 GBpbmc68k · small4 threads · 64.5× speedup11.71 s baseline → 183 ms optimizedmemory 0.8 GB → 1.0 GBpbmc68k · small32 threads · 97.4× speedup12.54 s baseline → 149 ms optimizedmemory 0.8 GB → 0.9 GBheart_adult · large1 threads · 28.7× speedup20.62 min baseline → 43.15 s optimizedmemory 19 GB → 21 GBheart_adult · large4 threads · 50.2× speedup14.09 min baseline → 16.85 s optimizedmemory 19 GB → 21 GBheart_adult · large32 threads · 96.7× speedup13.86 min baseline → 8.60 s optimizedmemory 19 GB → 21 GBsplitseq_rosenberg · ood_large11 threads · 26.1× speedup3.25 min baseline → 7.48 s optimizedmemory 4.0 GB → 4.7 GBsplitseq_rosenberg · ood_large14 threads · 31.0× speedup1.66 min baseline → 3.21 s optimizedmemory 4.0 GB → 4.7 GBsplitseq_rosenberg · ood_large132 threads · 59.8× speedup1.59 min baseline → 1.60 s optimizedmemory 4.1 GB → 4.7 GBpbmc200k_glaucoma · medium1 threads · 20.6× speedup5.21 min baseline → 15.19 s optimizedmemory 7.3 GB → 8.3 GBpbmc200k_glaucoma · medium4 threads · 26.6× speedup2.95 min baseline → 6.67 s optimizedmemory 7.3 GB → 8.3 GBpbmc200k_glaucoma · medium32 threads · 52.2× speedup2.84 min baseline → 3.27 s optimizedmemory 7.3 GB → 8.3 GBgastrulation_pijuansala · ood_large31 threads · 8.70× speedup5.93 min baseline → 39.64 s optimizedmemory 15 GB → 17 GBgastrulation_pijuansala · ood_large34 threads · 16.3× speedup3.88 min baseline → 13.53 s optimizedmemory 15 GB → 17 GBgastrulation_pijuansala · ood_large332 threads · 33.1× speedup3.46 min baseline → 6.30 s optimizedmemory 15 GB → 17 GBtms_ss2 · ood_large21 threads · 8.73× speedup2.74 min baseline → 18.87 s optimizedmemory 8.6 GB → 9.9 GBtms_ss2 · ood_large24 threads · 14.5× speedup1.94 min baseline → 7.88 s optimizedmemory 8.6 GB → 9.9 GBtms_ss2 · ood_large232 threads · 25.2× speedup1.71 min baseline → 4.08 s optimizedmemory 8.6 GB → 9.9 GB
pbmc68kheart_adultsplitseq_rosenbergpbmc200k_glaucomagastrulation_pijuan…tms_ss2
Memory
Baseline vs optimized peak memory on Windows
0.0 GB25 GB50 GBheart_adult1.12×gastrulation_piju…1.13×tms_ss21.14×pbmc200k_glaucoma1.14×splitseq_rosenberg1.16×pbmc68k1.08×heart_adult · largememory 19 GB → 21 GBoptimized / baseline 1.12×96.7× speedup · 32 threadsgastrulation_pijuansala · ood_large3memory 15 GB → 17 GBoptimized / baseline 1.13×33.1× speedup · 32 threadstms_ss2 · ood_large2memory 8.6 GB → 9.9 GBoptimized / baseline 1.14×25.2× speedup · 32 threadspbmc200k_glaucoma · mediummemory 7.3 GB → 8.3 GBoptimized / baseline 1.14×52.2× speedup · 32 threadssplitseq_rosenberg · ood_large1memory 4.1 GB → 4.7 GBoptimized / baseline 1.16×59.8× speedup · 32 threadspbmc68k · smallmemory 0.8 GB → 0.9 GBoptimized / baseline 1.08×97.4× speedup · 32 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets rank_genes_groups in Scanpy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: marker genes, find markers, FindAllMarkers, differential expression, DEG, rank genes, wilcoxon, tl.rank_genes_groups.

Supported scope

Fast path runs (and is benchmarked) only for: method=="wilcoxon" AND reference=="rest" AND tie_correct==False AND numba available AND X is a SciPy sparse matrix (CSBase, converted to CSC internally at line 508). Read full supported scope

Fast path runs (and is benchmarked) only for: method=="wilcoxon" AND reference=="rest" AND tie_correct==False AND numba available AND X is a SciPy sparse matrix (CSBase, converted to CSC internally at line 508). Within that: groups=="all" or an explicit sequence of group names; corr_method in {benjamini-hochberg, bonferroni, other(=no adjustment)}; n_genes=None (full output, descending-score sort) or n_genes<n_genes_total (top-N branch); rankby_abs True/False; pts True/False; use_raw/layer/mask_var/key_added/copy honored; groupby column must be categorical (uses .cat.categories/.cat.codes). logfoldchanges reverse the log1p transform via expm1, reading the log base from adata.uns['log1p']['base'] (defaults to natural log if absent).

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 11 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
gastrulation_pijuansala ood_large3 Windows 32 3.46 min 6.30 s 33.1× 14.8 → 16.8 GB pass
heart_adult large Windows 32 13.86 min 8.60 s 96.7× 19.0 → 21.3 GB pass
pbmc200k_glaucoma medium Windows 32 2.84 min 3.27 s 52.2× 7.3 → 8.3 GB pass
pbmc68k small Windows 32 12.54 s 149 ms 97.4× 0.8 → 0.9 GB pass
splitseq_rosenberg ood_large1 Windows 32 1.59 min 1.60 s 59.8× 4.1 → 4.7 GB pass
tms_ss2 ood_large2 Windows 32 1.71 min 4.08 s 25.2× 8.6 → 9.9 GB pass
gastrulation_pijuansala ood_large3 macOS 14 1.88 min 6.21 s 18.1× 14.0 → 12.9 GB pass
pbmc200k_glaucoma medium macOS 14 1.85 min 2.96 s 38.9× 10.2 → 10.3 GB pass
pbmc68k small macOS 14 19.63 s 66 ms 293.3× 2.9 → 1.0 GB pass
splitseq_rosenberg ood_large1 macOS 14 59.53 s 1.33 s 45.3× 6.6 → 4.5 GB pass
tms_ss2 ood_large2 macOS 14 1.06 min 3.13 s 20.2× 8.9 → 9.0 GB pass

Frequently asked questions

Speeding up Scanpy rank_genes_groups
Why is Scanpy rank_genes_groups slow?

Scanpy rank_genes_groups is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 12.54 s where the AutoZyme path takes 149 ms (97.4× faster).

How do I make Scanpy rank_genes_groups faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy rank_genes_groups exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 97.4× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Scanpy rank_genes_groups output?

No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call rank_genes_groups.