Python Scanpy methods Scanpy

Speed up Scanpy scale

Scanpy scale is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a verified, drop-in patch that is up to 9.50× faster, returning bit-for-bit identical results with no change to how you call it.

Best speedup 9.50×
Median speedup 6.66×
Output equivalence Bit-exact
Best runtime baseline 540 ms optimized 57 ms
Datasets 6
Pass rate 11/11

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
pbmc68ktms_ss2splitseq_rosenbergheart_adultpbmc200k_glaucomagastrulation_pijuansa…
Thread sweep
Speedup across finalized thread counts on Windows
10×14full (32)pbmc68k · small1 threads · 9.00× speedup568 ms baseline → 61 ms optimizedmemory 1.9 GB → 1.3 GBpbmc68k · small4 threads · 9.50× speedup540 ms baseline → 57 ms optimizedmemory 1.9 GB → 1.4 GBpbmc68k · small32 threads · 7.72× speedup545 ms baseline → 71 ms optimizedmemory 1.9 GB → 1.3 GBtms_ss2 · ood_large21 threads · 7.35× speedup1.03 s baseline → 148 ms optimizedmemory 8.6 GB → 8.9 GBtms_ss2 · ood_large24 threads · 7.48× speedup1.35 s baseline → 145 ms optimizedmemory 8.6 GB → 8.9 GBtms_ss2 · ood_large232 threads · 7.04× speedup1.09 s baseline → 154 ms optimizedmemory 8.6 GB → 8.9 GBsplitseq_rosenberg · ood_large11 threads · 6.65× speedup1.05 s baseline → 158 ms optimizedmemory 5.7 GB → 4.6 GBsplitseq_rosenberg · ood_large14 threads · 6.66× speedup1.04 s baseline → 158 ms optimizedmemory 5.7 GB → 4.6 GBsplitseq_rosenberg · ood_large132 threads · 6.66× speedup1.07 s baseline → 158 ms optimizedmemory 5.7 GB → 4.6 GBheart_adult · large1 threads · 5.25× speedup3.22 s baseline → 615 ms optimizedmemory 21 GB → 19 GBheart_adult · large4 threads · 5.59× speedup3.18 s baseline → 577 ms optimizedmemory 21 GB → 19 GBheart_adult · large32 threads · 5.54× speedup3.56 s baseline → 582 ms optimizedmemory 21 GB → 19 GBpbmc200k_glaucoma · medium1 threads · 4.92× speedup1.51 s baseline → 307 ms optimizedmemory 8.6 GB → 7.6 GBpbmc200k_glaucoma · medium4 threads · 5.06× speedup1.49 s baseline → 299 ms optimizedmemory 8.6 GB → 7.6 GBpbmc200k_glaucoma · medium32 threads · 5.15× speedup1.84 s baseline → 294 ms optimizedmemory 8.6 GB → 7.6 GBgastrulation_pijuansala · ood_large31 threads · 4.56× speedup1.20 s baseline → 312 ms optimizedmemory 15 GB → 15 GBgastrulation_pijuansala · ood_large34 threads · 5.08× speedup1.47 s baseline → 280 ms optimizedmemory 15 GB → 15 GBgastrulation_pijuansala · ood_large332 threads · 4.82× speedup1.42 s baseline → 295 ms optimizedmemory 15 GB → 15 GB
pbmc68ktms_ss2splitseq_rosenbergheart_adultpbmc200k_glaucomagastrulation_pijuan…
Memory
Baseline vs optimized peak memory on Windows
0.0 GB25 GB50 GBheart_adult0.92×gastrulation_piju…1.02×tms_ss21.03×pbmc200k_glaucoma0.88×splitseq_rosenberg0.81×pbmc68k0.69×heart_adult · largememory 21 GB → 19 GBoptimized / baseline 0.92×5.54× speedup · 32 threadsgastrulation_pijuansala · ood_large3memory 15 GB → 15 GBoptimized / baseline 1.02×4.82× speedup · 32 threadstms_ss2 · ood_large2memory 8.6 GB → 8.9 GBoptimized / baseline 1.03×7.04× speedup · 32 threadspbmc200k_glaucoma · mediummemory 8.6 GB → 7.6 GBoptimized / baseline 0.88×5.15× speedup · 32 threadssplitseq_rosenberg · ood_large1memory 5.7 GB → 4.6 GBoptimized / baseline 0.81×6.66× speedup · 32 threadspbmc68k · smallmemory 1.9 GB → 1.3 GBoptimized / baseline 0.69×7.72× speedup · 32 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets scale in Scanpy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: scaling, z-score, standardize, ScaleData, pp.scale.

Supported scope

Fast numba path activates ONLY when ALL hold: numba is importable; data is an anndata.AnnData; zero_center=True; layer is None; obsm is None; mask_obs is None; and adata.X is a scipy CSR sparse matrix (sparse.isspmatrix_csr). Read full supported scope

Fast numba path activates ONLY when ALL hold: numba is importable; data is an anndata.AnnData; zero_center=True; layer is None; obsm is None; mask_obs is None; and adata.X is a scipy CSR sparse matrix (sparse.isspmatrix_csr). On this path it computes per-gene mean and unbiased (ddof=1) variance directly from the CSR data/indices via a numba accumulation kernel, densifies X once to float32, and applies a fused (mean-subtract, divide-by-std, symmetric clip) numba kernel. max_value is fully supported: None -> +inf (no clip), or a finite value -> symmetric clip to [-max_value, +max_value] (the 2026-05-21 fix restored two-sided clipping to match upstream; the benchmark/old run.py used upper-only clip but the SHIPPED kernel is symmetric). copy=True (returns a scaled copy) and copy=False (in-place, returns None) are both handled. std==0 columns are set to 1.0 (matching upstream constant-gene handling). It also writes adata.var['mean'/'var'/'std'] like upstream.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 11 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
gastrulation_pijuansala ood_large3 Windows 4 1.47 s 280 ms 5.08× 14.8 → 15.0 GB pass
heart_adult large Windows 4 3.18 s 577 ms 5.59× 20.9 → 19.2 GB pass
pbmc200k_glaucoma medium Windows 32 1.84 s 294 ms 5.15× 8.6 → 7.6 GB pass
pbmc68k small Windows 4 540 ms 57 ms 9.50× 1.9 → 1.4 GB pass
splitseq_rosenberg ood_large1 Windows 32 1.07 s 158 ms 6.66× 5.7 → 4.6 GB pass
tms_ss2 ood_large2 Windows 4 1.35 s 145 ms 7.48× 8.6 → 8.9 GB pass
gastrulation_pijuansala ood_large3 macOS 8 576 ms 142 ms 5.12× 14.6 → 14.6 GB pass
pbmc200k_glaucoma medium macOS 14 966 ms 149 ms 7.42× 10.2 → 10.3 GB pass
pbmc68k small macOS 14 286 ms 38 ms 8.68× 2.4 → 1.5 GB pass
splitseq_rosenberg ood_large1 macOS 14 729 ms 106 ms 6.88× 8.1 → 5.8 GB pass
tms_ss2 ood_large2 macOS 8 544 ms 90 ms 6.23× 8.9 → 8.1 GB pass

Frequently asked questions

Speeding up Scanpy scale
Why is Scanpy scale slow?

Scanpy scale is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 540 ms where the AutoZyme path takes 57 ms (9.50× faster).

How do I make Scanpy scale faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy scale exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 9.50× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Scanpy scale output?

No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call scale.