Python Scanpy methods Scanpy

Speed up Scanpy highly_variable_genes (v3 batched)

Scanpy highly_variable_genes (v3 batched) is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a verified, drop-in patch that is up to 9.31× faster, returning bit-for-bit identical results with no change to how you call it.

Best speedup 9.31×
Median speedup 8.38×
Output equivalence Bit-exact
Best runtime baseline 5.05 s optimized 543 ms
Datasets 6
Pass rate 12/12

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
log scale
pbmc68kpbmc200k_glaucomaheart_adulttms_ss2gastrulation_pijuansa…splitseq_rosenberg
Thread sweep
Speedup across finalized thread counts on Windows
10×20×14full (32)pbmc68k · small1 threads · 10.1× speedup313 ms baseline → 32 ms optimizedmemory 0.6 GB → 1.0 GBpbmc68k · small4 threads · 19.6× speedup283 ms baseline → 15 ms optimizedmemory 0.6 GB → 1.0 GBpbmc68k · small32 threads · 15.6× speedup238 ms baseline → 15 ms optimizedmemory 0.6 GB → 0.8 GBpbmc200k_glaucoma · medium1 threads · 4.86× speedup4.67 s baseline → 1.04 s optimizedmemory 5.7 GB → 7.5 GBpbmc200k_glaucoma · medium4 threads · 8.54× speedup5.14 s baseline → 592 ms optimizedmemory 9.3 GB → 9.3 GBpbmc200k_glaucoma · medium32 threads · 10.3× speedup2.21 s baseline → 216 ms optimizedmemory 8.9 GB → 7.6 GBheart_adult · large1 threads · 4.18× speedup10.75 s baseline → 2.57 s optimizedmemory 14 GB → 19 GBheart_adult · large4 threads · 7.25× speedup10.26 s baseline → 1.48 s optimizedmemory 24 GB → 24 GBheart_adult · large32 threads · 9.05× speedup5.16 s baseline → 570 ms optimizedmemory 23 GB → 19 GBtms_ss2 · small1 threads · 3.95× speedup5.18 s baseline → 1.52 s optimizedmemory 6.8 GB → 8.9 GBtms_ss2 · small4 threads · 5.60× speedup6.66 s baseline → 1.07 s optimizedmemory 11 GB → 11 GBtms_ss2 · ood_large232 threads · 8.24× speedup2.78 s baseline → 329 ms optimizedmemory 11 GB → 8.9 GBgastrulation_pijuansala · ood_large21 threads · 4.53× speedup9.28 s baseline → 2.48 s optimizedmemory 11 GB → 15 GBgastrulation_pijuansala · ood_large24 threads · 6.37× speedup11.26 s baseline → 1.77 s optimizedmemory 19 GB → 19 GBgastrulation_pijuansala · ood_large232 threads · 7.53× speedup11.75 s baseline → 1.50 s optimizedmemory 19 GB → 19 GBsplitseq_rosenberg · ood_large11 threads · 3.62× speedup2.38 s baseline → 634 ms optimizedmemory 3.3 GB → 4.3 GBsplitseq_rosenberg · ood_large14 threads · 6.73× speedup2.86 s baseline → 340 ms optimizedmemory 5.2 GB → 5.2 GBsplitseq_rosenberg · ood_large132 threads · 7.42× speedup884 ms baseline → 120 ms optimizedmemory 4.9 GB → 4.3 GB
pbmc68kpbmc200k_glaucomaheart_adulttms_ss2gastrulation_pijuan…splitseq_rosenberg
Memory
Baseline vs optimized peak memory on Windows
0.0 GB25 GB50 GBheart_adult1.00×gastrulation_piju…1.00×tms_ss21.00×pbmc200k_glaucoma1.00×splitseq_rosenberg1.00×pbmc68k1.38×heart_adult · largememory 24 GB → 24 GBoptimized / baseline 1.00×7.53× speedup · 32 threadsgastrulation_pijuansala · ood_large2memory 19 GB → 19 GBoptimized / baseline 1.00×7.53× speedup · 32 threadstms_ss2 · smallmemory 11 GB → 11 GBoptimized / baseline 1.00×5.92× speedup · 32 threadspbmc200k_glaucoma · mediummemory 9.3 GB → 9.3 GBoptimized / baseline 1.00×9.31× speedup · 32 threadssplitseq_rosenberg · ood_large1memory 5.3 GB → 5.3 GBoptimized / baseline 1.00×6.49× speedup · 32 threadspbmc68k · smallmemory 0.6 GB → 0.8 GBoptimized / baseline 1.38×15.6× speedup · 32 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets highly_variable_genes (v3 · batched) in Scanpy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: HVG, highly variable genes, batched HVG, batch HVG, seurat_v3 batch, batched feature selection.

Supported scope

The v3-batch fast path (_fast_hvg_seurat_v3_batch) correctly handles: flavor in {"seurat_v3","seurat_v3_paper"} WITH a non-None batch_key, on a CSR-sparse raw-count matrix (adata.X or a named layer), with n_top_genes a concrete int (not None), subset=False,… Read full supported scope

The v3-batch fast path (_fast_hvg_seurat_v3_batch) correctly handles: flavor in {"seurat_v3","seurat_v3_paper"} WITH a non-None batch_key, on a CSR-sparse raw-count matrix (adata.X or a named layer), with n_top_genes a concrete int (not None), subset=False, inplace=True, and no extra/unknown kwargs. It honors span (passed into loess) and check_values (drives the non-integer warning). It supports multiple batches: per-batch loess fit, per-batch clipped variance, median-rank aggregation, and the seurat_v3 vs seurat_v3_paper lexsort tiebreak ordering (lines 519-522). It writes highly_variable, highly_variable_rank, means (overall), variances (overall), variances_norm, highly_variable_nbatches to adata.var and uns["hvg"]. Numba (parallel prange) must be importable and skmisc.loess must be importable. Requires >=2 obs, >=1 var, all batch sizes >=2, valid (non-negative) batch codes, and >=2 non-constant genes per batch. Anything outside this is delegated verbatim to the captured upstream original (__autozyme_original__). A separate flavor="seurat" non-batch path (_fast_hvg_seurat) also exists but is not the benchmarked target here.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 12 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
gastrulation_pijuansala ood_large2 Windows 32 11.75 s 1.50 s 7.53× 18.7 → 18.7 GB pass
heart_adult large Windows 32 5.16 s 570 ms 9.05× 23.3 → 19.3 GB pass
pbmc200k_glaucoma medium Windows 32 2.21 s 216 ms 10.3× 8.9 → 7.6 GB pass
pbmc68k small Windows 4 283 ms 15 ms 19.6× 0.6 → 1.0 GB pass
splitseq_rosenberg ood_large1 Windows 32 884 ms 120 ms 7.42× 4.9 → 4.3 GB pass
tms_ss2 ood_large2 Windows 32 2.78 s 329 ms 8.24× 10.6 → 8.9 GB pass
gastrulation_pijuansala ood_large2 macOS 14 7.63 s 896 ms 8.52× 10.2 → 9.7 GB pass
heart_adult large macOS 14 7.95 s 1.25 s 6.26× 14.5 → 16.2 GB pass
pbmc200k_glaucoma medium macOS 14 1.46 s 205 ms 7.45× 10.2 → 10.3 GB pass
pbmc68k small macOS 14 58 ms 13 ms 6.46× 1.0 → 1.0 GB pass
splitseq_rosenberg ood_large1 macOS 14 1.53 s 178 ms 8.61× 4.4 → 3.5 GB pass
tms_ss2 small macOS 4 6.70 s 757 ms 8.85× 11.1 → 11.1 GB pass

Frequently asked questions

Speeding up Scanpy highly_variable_genes (v3 batched)
Why is Scanpy highly_variable_genes (v3 batched) slow?

Scanpy highly_variable_genes (v3 batched) is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 5.05 s where the AutoZyme path takes 543 ms (9.31× faster).

How do I make Scanpy highly_variable_genes (v3 batched) faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy highly_variable_genes (v3 batched) exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 9.31× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Scanpy highly_variable_genes (v3 batched) output?

No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call highly_variable_genes (v3 batched).