Speed up Scanpy highly_variable_genes (v3 batched)
Scanpy highly_variable_genes (v3 batched) is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a
verified, drop-in patch that is up to 9.31× faster, returning bit-for-bit identical results with no change to how you call it.
Best speedup9.31×
Median speedup8.38×
Output equivalenceBit-exact
Best runtime baseline 5.05 s → optimized 543 ms
Datasets6
Pass rate12/12
Benchmark charts
Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
log scale
pbmc68k
pbmc68k · small1 threads · 10.1× speedup313 ms baseline → 32 ms optimizedmemory 0.6 GB → 1.0 GBpbmc68k · small4 threads · 19.6× speedup283 ms baseline → 15 ms optimizedmemory 0.6 GB → 1.0 GBpbmc68k · small32 threads · 15.6× speedup238 ms baseline → 15 ms optimizedmemory 0.6 GB → 0.8 GB
19.6×
pbmc200k_glaucoma
pbmc200k_glaucoma · medium1 threads · 4.86× speedup4.67 s baseline → 1.04 s optimizedmemory 5.7 GB → 7.5 GBpbmc200k_glaucoma · medium1 threads · 2.02× speedup2.57 s baseline → 1.21 s optimizedmemory 8.9 GB → 7.7 GBpbmc200k_glaucoma · medium4 threads · 8.54× speedup5.14 s baseline → 592 ms optimizedmemory 9.3 GB → 9.3 GBpbmc200k_glaucoma · medium4 threads · 5.67× speedup2.43 s baseline → 413 ms optimizedmemory 8.9 GB → 7.7 GBpbmc200k_glaucoma · medium32 threads · 10.3× speedup2.21 s baseline → 216 ms optimizedmemory 8.9 GB → 7.6 GBpbmc200k_glaucoma · medium32 threads · 9.31× speedup5.05 s baseline → 543 ms optimizedmemory 9.3 GB → 9.3 GB
10.3×
heart_adult
heart_adult · large1 threads · 4.18× speedup10.75 s baseline → 2.57 s optimizedmemory 14 GB → 19 GBheart_adult · large1 threads · 1.87× speedup6.49 s baseline → 3.41 s optimizedmemory 23 GB → 19 GBheart_adult · large4 threads · 7.25× speedup10.26 s baseline → 1.48 s optimizedmemory 24 GB → 24 GBheart_adult · large4 threads · 4.57× speedup4.74 s baseline → 1.03 s optimizedmemory 23 GB → 19 GBheart_adult · large32 threads · 9.05× speedup5.16 s baseline → 570 ms optimizedmemory 23 GB → 19 GBheart_adult · large32 threads · 7.53× speedup10.42 s baseline → 1.43 s optimizedmemory 24 GB → 24 GB
9.05×
tms_ss2
tms_ss2 · small1 threads · 3.95× speedup5.18 s baseline → 1.52 s optimizedmemory 6.8 GB → 8.9 GBtms_ss2 · ood_large21 threads · 1.57× speedup3.37 s baseline → 2.08 s optimizedmemory 11 GB → 9.0 GBtms_ss2 · small4 threads · 5.60× speedup6.66 s baseline → 1.07 s optimizedmemory 11 GB → 11 GBtms_ss2 · ood_large24 threads · 3.98× speedup3.04 s baseline → 778 ms optimizedmemory 11 GB → 9.0 GBtms_ss2 · ood_large232 threads · 8.24× speedup2.78 s baseline → 329 ms optimizedmemory 11 GB → 8.9 GBtms_ss2 · small32 threads · 5.92× speedup6.62 s baseline → 1.01 s optimizedmemory 11 GB → 11 GB
8.24×
gastrulation_pijuansa…
gastrulation_pijuansala · ood_large21 threads · 4.53× speedup9.28 s baseline → 2.48 s optimizedmemory 11 GB → 15 GBgastrulation_pijuansala · ood_large31 threads · 1.93× speedup5.14 s baseline → 2.65 s optimizedmemory 18 GB → 15 GBgastrulation_pijuansala · ood_large24 threads · 6.37× speedup11.26 s baseline → 1.77 s optimizedmemory 19 GB → 19 GBgastrulation_pijuansala · ood_large34 threads · 4.84× speedup4.59 s baseline → 947 ms optimizedmemory 18 GB → 15 GBgastrulation_pijuansala · ood_large232 threads · 7.53× speedup11.75 s baseline → 1.50 s optimizedmemory 19 GB → 19 GBgastrulation_pijuansala · ood_large332 threads · 6.86× speedup3.64 s baseline → 530 ms optimizedmemory 18 GB → 15 GB
7.53×
splitseq_rosenberg
splitseq_rosenberg · ood_large11 threads · 3.62× speedup2.38 s baseline → 634 ms optimizedmemory 3.3 GB → 4.3 GBsplitseq_rosenberg · ood_large11 threads · 1.95× speedup1.39 s baseline → 714 ms optimizedmemory 4.9 GB → 4.5 GBsplitseq_rosenberg · ood_large14 threads · 6.73× speedup2.86 s baseline → 340 ms optimizedmemory 5.2 GB → 5.2 GBsplitseq_rosenberg · ood_large14 threads · 5.59× speedup1.20 s baseline → 218 ms optimizedmemory 4.9 GB → 4.5 GBsplitseq_rosenberg · ood_large132 threads · 7.42× speedup884 ms baseline → 120 ms optimizedmemory 4.9 GB → 4.3 GBsplitseq_rosenberg · ood_large132 threads · 6.49× speedup2.24 s baseline → 352 ms optimizedmemory 5.3 GB → 5.3 GB
The public API stays the same; AutoZyme replaces only the supported fast path.
This task targets highly_variable_genes (v3 · batched) in Scanpy. The benchmarked result
preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.
The v3-batch fast path (_fast_hvg_seurat_v3_batch) correctly handles: flavor in {"seurat_v3","seurat_v3_paper"} WITH a non-None batch_key, on a CSR-sparse raw-count matrix (adata.X or a named layer), with n_top_genes a concrete int (not None), subset=False,…Read full supported scope
The v3-batch fast path (_fast_hvg_seurat_v3_batch) correctly handles: flavor in {"seurat_v3","seurat_v3_paper"} WITH a non-None batch_key, on a CSR-sparse raw-count matrix (adata.X or a named layer), with n_top_genes a concrete int (not None), subset=False, inplace=True, and no extra/unknown kwargs. It honors span (passed into loess) and check_values (drives the non-integer warning). It supports multiple batches: per-batch loess fit, per-batch clipped variance, median-rank aggregation, and the seurat_v3 vs seurat_v3_paper lexsort tiebreak ordering (lines 519-522). It writes highly_variable, highly_variable_rank, means (overall), variances (overall), variances_norm, highly_variable_nbatches to adata.var and uns["hvg"]. Numba (parallel prange) must be importable and skmisc.loess must be importable. Requires >=2 obs, >=1 var, all batch sizes >=2, valid (non-negative) batch codes, and >=2 non-constant genes per batch. Anything outside this is delegated verbatim to the captured upstream original (__autozyme_original__). A separate flavor="seurat" non-batch path (_fast_hvg_seurat) also exists but is not the benchmarked target here.
Out-of-scope behavior
silent fallback to upstream
Show detailed speedup table12 runs▾
Dataset
Tier
Platform
Threads
Baseline
Optimized
Speedup
Memory
Concordance
Pass
gastrulation_pijuansala
ood_large2
Windows
32
11.75 s
1.50 s
7.53×
18.7 → 18.7 GB
—
pass
heart_adult
large
Windows
32
5.16 s
570 ms
9.05×
23.3 → 19.3 GB
—
pass
pbmc200k_glaucoma
medium
Windows
32
2.21 s
216 ms
10.3×
8.9 → 7.6 GB
—
pass
pbmc68k
small
Windows
4
283 ms
15 ms
19.6×
0.6 → 1.0 GB
—
pass
splitseq_rosenberg
ood_large1
Windows
32
884 ms
120 ms
7.42×
4.9 → 4.3 GB
—
pass
tms_ss2
ood_large2
Windows
32
2.78 s
329 ms
8.24×
10.6 → 8.9 GB
—
pass
gastrulation_pijuansala
ood_large2
macOS
14
7.63 s
896 ms
8.52×
10.2 → 9.7 GB
—
pass
heart_adult
large
macOS
14
7.95 s
1.25 s
6.26×
14.5 → 16.2 GB
—
pass
pbmc200k_glaucoma
medium
macOS
14
1.46 s
205 ms
7.45×
10.2 → 10.3 GB
—
pass
pbmc68k
small
macOS
14
58 ms
13 ms
6.46×
1.0 → 1.0 GB
—
pass
splitseq_rosenberg
ood_large1
macOS
14
1.53 s
178 ms
8.61×
4.4 → 3.5 GB
—
pass
tms_ss2
small
macOS
4
6.70 s
757 ms
8.85×
11.1 → 11.1 GB
—
pass
Frequently asked questions
Speeding up Scanpy highly_variable_genes (v3 batched)
Why is Scanpy highly_variable_genes (v3 batched) slow?
Scanpy highly_variable_genes (v3 batched) is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 5.05 s where the AutoZyme path takes 543 ms (9.31× faster).
How do I make Scanpy highly_variable_genes (v3 batched) faster?
Install AutoZyme and activate the Scanpy patch, then keep using Scanpy highly_variable_genes (v3 batched) exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 9.31× faster on the benchmark datasets, with no pipeline or API changes.
Does the AutoZyme speedup change the Scanpy highly_variable_genes (v3 batched) output?
No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.
How do I install the Scanpy speedup?
In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call highly_variable_genes (v3 batched).