Speed up Scanpy scale: up to 9.50× faster, identical output

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

pbmc68k

9.50×

tms_ss2

7.48×

splitseq_rosenberg

6.66×

heart_adult

5.59×

pbmc200k_glaucoma

5.15×

gastrulation_pijuansa…

5.08×

pbmc68ktms_ss2splitseq_rosenbergheart_adultpbmc200k_glaucomagastrulation_pijuansa…

Thread sweep

Speedup across finalized thread counts on Windows

pbmc68ktms_ss2splitseq_rosenbergheart_adultpbmc200k_glaucomagastrulation_pijuan…

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets scale in Scanpy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: scaling, z-score, standardize, ScaleData, pp.scale.

Supported scope

Fast numba path activates ONLY when ALL hold: numba is importable; data is an anndata.AnnData; zero_center=True; layer is None; obsm is None; mask_obs is None; and adata.X is a scipy CSR sparse matrix (sparse.isspmatrix_csr). Read full supported scope

Fast numba path activates ONLY when ALL hold: numba is importable; data is an anndata.AnnData; zero_center=True; layer is None; obsm is None; mask_obs is None; and adata.X is a scipy CSR sparse matrix (sparse.isspmatrix_csr). On this path it computes per-gene mean and unbiased (ddof=1) variance directly from the CSR data/indices via a numba accumulation kernel, densifies X once to float32, and applies a fused (mean-subtract, divide-by-std, symmetric clip) numba kernel. max_value is fully supported: None -> +inf (no clip), or a finite value -> symmetric clip to [-max_value, +max_value] (the 2026-05-21 fix restored two-sided clipping to match upstream; the benchmark/old run.py used upper-only clip but the SHIPPED kernel is symmetric). copy=True (returns a scaled copy) and copy=False (in-place, returns None) are both handled. std==0 columns are set to 1.0 (matching upstream constant-gene handling). It also writes adata.var['mean'/'var'/'std'] like upstream.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 11 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`gastrulation_pijuansala`	ood_large3	Windows	4	1.47 s	280 ms	5.08×	14.8 → 15.0 GB	—	pass
`heart_adult`	large	Windows	4	3.18 s	577 ms	5.59×	20.9 → 19.2 GB	—	pass
`pbmc200k_glaucoma`	medium	Windows	32	1.84 s	294 ms	5.15×	8.6 → 7.6 GB	—	pass
`pbmc68k`	small	Windows	4	540 ms	57 ms	9.50×	1.9 → 1.4 GB	—	pass
`splitseq_rosenberg`	ood_large1	Windows	32	1.07 s	158 ms	6.66×	5.7 → 4.6 GB	—	pass
`tms_ss2`	ood_large2	Windows	4	1.35 s	145 ms	7.48×	8.6 → 8.9 GB	—	pass
`gastrulation_pijuansala`	ood_large3	macOS	8	576 ms	142 ms	5.12×	14.6 → 14.6 GB	—	pass
`pbmc200k_glaucoma`	medium	macOS	14	966 ms	149 ms	7.42×	10.2 → 10.3 GB	—	pass
`pbmc68k`	small	macOS	14	286 ms	38 ms	8.68×	2.4 → 1.5 GB	—	pass
`splitseq_rosenberg`	ood_large1	macOS	14	729 ms	106 ms	6.88×	8.1 → 5.8 GB	—	pass
`tms_ss2`	ood_large2	macOS	8	544 ms	90 ms	6.23×	8.9 → 8.1 GB	—	pass

Frequently asked questions

Speeding up Scanpy scale

Why is Scanpy scale slow?

Scanpy scale is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 540 ms where the AutoZyme path takes 57 ms (9.50× faster).

How do I make Scanpy scale faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy scale exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 9.50× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Scanpy scale output?

No. The accelerated path returns bit-for-bit identical results to the original Scanpy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call scale.