Scanpy highly_variable_genes is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a
verified, drop-in patch that is up to 19.6× faster, returning bit-for-bit identical results with no change to how you call it.
Best speedup19.6×
Median speedup8.38×
Output equivalenceBit-exact
Best runtime baseline 283 ms → optimized 15 ms
Datasets6
Pass rate12/12
Benchmark charts
Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
log scale
pbmc68k
pbmc68k · small1 threads · 10.1× speedup313 ms baseline → 32 ms optimizedmemory 0.6 GB → 1.0 GBpbmc68k · small4 threads · 19.6× speedup283 ms baseline → 15 ms optimizedmemory 0.6 GB → 1.0 GBpbmc68k · small32 threads · 15.6× speedup238 ms baseline → 15 ms optimizedmemory 0.6 GB → 0.8 GB
19.6×
pbmc200k_glaucoma
pbmc200k_glaucoma · medium1 threads · 4.86× speedup4.67 s baseline → 1.04 s optimizedmemory 5.7 GB → 7.5 GBpbmc200k_glaucoma · medium1 threads · 2.02× speedup2.57 s baseline → 1.21 s optimizedmemory 8.9 GB → 7.7 GBpbmc200k_glaucoma · medium4 threads · 8.54× speedup5.14 s baseline → 592 ms optimizedmemory 9.3 GB → 9.3 GBpbmc200k_glaucoma · medium4 threads · 5.67× speedup2.43 s baseline → 413 ms optimizedmemory 8.9 GB → 7.7 GBpbmc200k_glaucoma · medium32 threads · 10.3× speedup2.21 s baseline → 216 ms optimizedmemory 8.9 GB → 7.6 GBpbmc200k_glaucoma · medium32 threads · 9.31× speedup5.05 s baseline → 543 ms optimizedmemory 9.3 GB → 9.3 GB
10.3×
heart_adult
heart_adult · large1 threads · 4.18× speedup10.75 s baseline → 2.57 s optimizedmemory 14 GB → 19 GBheart_adult · large1 threads · 1.87× speedup6.49 s baseline → 3.41 s optimizedmemory 23 GB → 19 GBheart_adult · large4 threads · 7.25× speedup10.26 s baseline → 1.48 s optimizedmemory 24 GB → 24 GBheart_adult · large4 threads · 4.57× speedup4.74 s baseline → 1.03 s optimizedmemory 23 GB → 19 GBheart_adult · large32 threads · 9.05× speedup5.16 s baseline → 570 ms optimizedmemory 23 GB → 19 GBheart_adult · large32 threads · 7.53× speedup10.42 s baseline → 1.43 s optimizedmemory 24 GB → 24 GB
9.05×
tms_ss2
tms_ss2 · small1 threads · 3.95× speedup5.18 s baseline → 1.52 s optimizedmemory 6.8 GB → 8.9 GBtms_ss2 · ood_large21 threads · 1.57× speedup3.37 s baseline → 2.08 s optimizedmemory 11 GB → 9.0 GBtms_ss2 · small4 threads · 5.60× speedup6.66 s baseline → 1.07 s optimizedmemory 11 GB → 11 GBtms_ss2 · ood_large24 threads · 3.98× speedup3.04 s baseline → 778 ms optimizedmemory 11 GB → 9.0 GBtms_ss2 · ood_large232 threads · 8.24× speedup2.78 s baseline → 329 ms optimizedmemory 11 GB → 8.9 GBtms_ss2 · small32 threads · 5.92× speedup6.62 s baseline → 1.01 s optimizedmemory 11 GB → 11 GB
8.24×
gastrulation_pijuansa…
gastrulation_pijuansala · ood_large21 threads · 4.53× speedup9.28 s baseline → 2.48 s optimizedmemory 11 GB → 15 GBgastrulation_pijuansala · ood_large31 threads · 1.93× speedup5.14 s baseline → 2.65 s optimizedmemory 18 GB → 15 GBgastrulation_pijuansala · ood_large24 threads · 6.37× speedup11.26 s baseline → 1.77 s optimizedmemory 19 GB → 19 GBgastrulation_pijuansala · ood_large34 threads · 4.84× speedup4.59 s baseline → 947 ms optimizedmemory 18 GB → 15 GBgastrulation_pijuansala · ood_large232 threads · 7.53× speedup11.75 s baseline → 1.50 s optimizedmemory 19 GB → 19 GBgastrulation_pijuansala · ood_large332 threads · 6.86× speedup3.64 s baseline → 530 ms optimizedmemory 18 GB → 15 GB
7.53×
splitseq_rosenberg
splitseq_rosenberg · ood_large11 threads · 3.62× speedup2.38 s baseline → 634 ms optimizedmemory 3.3 GB → 4.3 GBsplitseq_rosenberg · ood_large11 threads · 1.95× speedup1.39 s baseline → 714 ms optimizedmemory 4.9 GB → 4.5 GBsplitseq_rosenberg · ood_large14 threads · 6.73× speedup2.86 s baseline → 340 ms optimizedmemory 5.2 GB → 5.2 GBsplitseq_rosenberg · ood_large14 threads · 5.59× speedup1.20 s baseline → 218 ms optimizedmemory 4.9 GB → 4.5 GBsplitseq_rosenberg · ood_large132 threads · 7.42× speedup884 ms baseline → 120 ms optimizedmemory 4.9 GB → 4.3 GBsplitseq_rosenberg · ood_large132 threads · 6.49× speedup2.24 s baseline → 352 ms optimizedmemory 5.3 GB → 5.3 GB
The public API stays the same; AutoZyme replaces only the supported fast path.
This task targets highly_variable_genes in Scanpy. The benchmarked result
preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.
The dispatcher _patched_hvg routes by flavor and batch_key. The BENCHMARKED config (flavor="seurat", batch_key=None, sparse CSR log1p-normalized input, numba available) is handled by _fast_hvg_seurat.Read full supported scope
The dispatcher _patched_hvg routes by flavor and batch_key. The BENCHMARKED config (flavor="seurat", batch_key=None, sparse CSR log1p-normalized input, numba available) is handled by _fast_hvg_seurat. That fast path correctly supports: flavor="seurat" only; sparse input (auto-converted to CSR float32); both selection modes — n_top_genes (argpartition top-N on normalized dispersion) AND the cutoff mode with min_mean/max_mean/min_disp/max_disp (all four honored, lines 296-299); n_bins (honored, passed into kernel); layer= (reads adata.layers[layer]); subset= and inplace= (both honored, lines 304-321); it stores log1p(mean) for means to match upstream scanpy seurat contract (line 302). Separately, flavor in {seurat_v3, seurat_v3_paper} WITH batch_key set routes to _fast_hvg_seurat_v3_batch (a heavily guarded CSR-raw-counts batch path), but that is NOT the benchmarked path. The eval metric is hvg_jaccard>=0.95 (set overlap of selected genes), tolerant of small numeric drift.
Out-of-scope behavior
silent fallback to upstream
Show detailed speedup table12 runs▾
Dataset
Tier
Platform
Threads
Baseline
Optimized
Speedup
Memory
Concordance
Pass
gastrulation_pijuansala
ood_large2
Windows
32
11.75 s
1.50 s
7.53×
18.7 → 18.7 GB
—
pass
heart_adult
large
Windows
32
5.16 s
570 ms
9.05×
23.3 → 19.3 GB
—
pass
pbmc200k_glaucoma
medium
Windows
32
2.21 s
216 ms
10.3×
8.9 → 7.6 GB
—
pass
pbmc68k
small
Windows
4
283 ms
15 ms
19.6×
0.6 → 1.0 GB
—
pass
splitseq_rosenberg
ood_large1
Windows
32
884 ms
120 ms
7.42×
4.9 → 4.3 GB
—
pass
tms_ss2
ood_large2
Windows
32
2.78 s
329 ms
8.24×
10.6 → 8.9 GB
—
pass
gastrulation_pijuansala
ood_large2
macOS
14
7.63 s
896 ms
8.52×
10.2 → 9.7 GB
—
pass
heart_adult
large
macOS
14
7.95 s
1.25 s
6.26×
14.5 → 16.2 GB
—
pass
pbmc200k_glaucoma
medium
macOS
14
1.46 s
205 ms
7.45×
10.2 → 10.3 GB
—
pass
pbmc68k
small
macOS
14
58 ms
13 ms
6.46×
1.0 → 1.0 GB
—
pass
splitseq_rosenberg
ood_large1
macOS
14
1.53 s
178 ms
8.61×
4.4 → 3.5 GB
—
pass
tms_ss2
small
macOS
4
6.70 s
757 ms
8.85×
11.1 → 11.1 GB
—
pass
Frequently asked questions
Speeding up Scanpy highly_variable_genes
Why is Scanpy highly_variable_genes slow?
Scanpy highly_variable_genes is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 283 ms where the AutoZyme path takes 15 ms (19.6× faster).
How do I make Scanpy highly_variable_genes faster?
Install AutoZyme and activate the Scanpy patch, then keep using Scanpy highly_variable_genes exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 19.6× faster on the benchmark datasets, with no pipeline or API changes.
Does the AutoZyme speedup change the Scanpy highly_variable_genes output?
No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.
How do I install the Scanpy speedup?
In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call highly_variable_genes.