Speed up Scanpy rank_genes_groups: up to 97.4× faster, identical output

Q: Why is Scanpy rank_genes_groups slow?

Scanpy rank_genes_groups is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 12.54 s where the AutoZyme path takes 149 ms (97.4× faster).

Q: How do I make Scanpy rank_genes_groups faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy rank_genes_groups exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 97.4× faster on the benchmark datasets, with no pipeline or API changes.

Q: Does the AutoZyme speedup change the Scanpy rank_genes_groups output?

No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

Q: How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call rank_genes_groups.

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

log scale

pbmc68k

97.4×

heart_adult

96.7×

splitseq_rosenberg

59.8×

pbmc200k_glaucoma

52.2×

gastrulation_pijuansa…

33.1×

tms_ss2

25.2×

pbmc68kheart_adultsplitseq_rosenbergpbmc200k_glaucomagastrulation_pijuansa…tms_ss2

Thread sweep

Speedup across finalized thread counts on Windows

pbmc68kheart_adultsplitseq_rosenbergpbmc200k_glaucomagastrulation_pijuan…tms_ss2

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets rank_genes_groups in Scanpy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: marker genes, find markers, FindAllMarkers, differential expression, DEG, rank genes, wilcoxon, tl.rank_genes_groups.

Supported scope

Fast path runs (and is benchmarked) only for: method=="wilcoxon" AND reference=="rest" AND tie_correct==False AND numba available AND X is a SciPy sparse matrix (CSBase, converted to CSC internally at line 508). Read full supported scope

Fast path runs (and is benchmarked) only for: method=="wilcoxon" AND reference=="rest" AND tie_correct==False AND numba available AND X is a SciPy sparse matrix (CSBase, converted to CSC internally at line 508). Within that: groups=="all" or an explicit sequence of group names; corr_method in {benjamini-hochberg, bonferroni, other(=no adjustment)}; n_genes=None (full output, descending-score sort) or n_genes<n_genes_total (top-N branch); rankby_abs True/False; pts True/False; use_raw/layer/mask_var/key_added/copy honored; groupby column must be categorical (uses .cat.categories/.cat.codes). logfoldchanges reverse the log1p transform via expm1, reading the log base from adata.uns['log1p']['base'] (defaults to natural log if absent).

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 11 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`gastrulation_pijuansala`	ood_large3	Windows	32	3.46 min	6.30 s	33.1×	14.8 → 16.8 GB	—	pass
`heart_adult`	large	Windows	32	13.86 min	8.60 s	96.7×	19.0 → 21.3 GB	—	pass
`pbmc200k_glaucoma`	medium	Windows	32	2.84 min	3.27 s	52.2×	7.3 → 8.3 GB	—	pass
`pbmc68k`	small	Windows	32	12.54 s	149 ms	97.4×	0.8 → 0.9 GB	—	pass
`splitseq_rosenberg`	ood_large1	Windows	32	1.59 min	1.60 s	59.8×	4.1 → 4.7 GB	—	pass
`tms_ss2`	ood_large2	Windows	32	1.71 min	4.08 s	25.2×	8.6 → 9.9 GB	—	pass
`gastrulation_pijuansala`	ood_large3	macOS	14	1.88 min	6.21 s	18.1×	14.0 → 12.9 GB	—	pass
`pbmc200k_glaucoma`	medium	macOS	14	1.85 min	2.96 s	38.9×	10.2 → 10.3 GB	—	pass
`pbmc68k`	small	macOS	14	19.63 s	66 ms	293.3×	2.9 → 1.0 GB	—	pass
`splitseq_rosenberg`	ood_large1	macOS	14	59.53 s	1.33 s	45.3×	6.6 → 4.5 GB	—	pass
`tms_ss2`	ood_large2	macOS	14	1.06 min	3.13 s	20.2×	8.9 → 9.0 GB	—	pass

Frequently asked questions

Speeding up Scanpy rank_genes_groups

Why is Scanpy rank_genes_groups slow?

Scanpy rank_genes_groups is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 12.54 s where the AutoZyme path takes 149 ms (97.4× faster).

How do I make Scanpy rank_genes_groups faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy rank_genes_groups exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 97.4× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Scanpy rank_genes_groups output?

No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call rank_genes_groups.