Speed up Scanpy normalize_total: up to 62.9× faster, near-identical output

Q: Why is Scanpy normalize_total slow?

Scanpy normalize_total is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 322 ms where the AutoZyme path takes 4 ms (62.9× faster).

Q: How do I make Scanpy normalize_total faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy normalize_total exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 62.9× faster on the benchmark datasets, with no pipeline or API changes.

Q: Does the AutoZyme speedup change the Scanpy normalize_total output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original Scanpy result) on every benchmark dataset.

Q: How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call normalize_total.

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

log scale

pbmc68k

62.9×

gastrulation_pijuansa…

25.4×

splitseq_rosenberg

20.2×

tms_ss2

19.0×

pbmc200k_glaucoma

15.5×

heart_adult

14.8×

pbmc68kgastrulation_pijuansa…splitseq_rosenbergtms_ss2pbmc200k_glaucomaheart_adult

Thread sweep

Speedup across finalized thread counts on Windows

pbmc68kgastrulation_pijuan…splitseq_rosenbergtms_ss2pbmc200k_glaucomaheart_adult

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets normalize_total in Scanpy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: normalization, normalize, library size normalization, NormalizeData, pp.normalize_total.

Supported scope

Fast path handles the standard single-cell tutorial pair on sparse data: X that is sparse-convertible (converted to CSR float32 with int32 indices when nnz <= 2^31-1), inplace=True, copy in {True, False}, and no extra keyword args. Read full supported scope

Fast path handles the standard single-cell tutorial pair on sparse data: X that is sparse-convertible (converted to CSR float32 with int32 indices when nnz <= 2^31-1), inplace=True, copy in {True, False}, and no extra keyword args. Both target_sum modes are implemented: explicit target_sum (e.g. 1e4) routes to the fused per-row sum+scale kernel _fused_normalize_only; target_sum=None routes to _row_sums + a median-based target then _scale_only. log1p applies a parallel numba np.log1p over CSR .data (natural log, base=None), and falls back to np.log1p(out) for dense X. Numeric output of the .X matrix matches upstream on the benchmarked default-dtype CSR counts path (this is what the concordance metric checks).

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 12 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`gastrulation_pijuansala`	ood_large3	Windows	32	5.33 s	205 ms	25.4×	18.4 → 18.6 GB	—	pass
`heart_adult`	large	Windows	32	6.78 s	344 ms	14.8×	23.5 → 24.0 GB	—	pass
`pbmc200k_glaucoma`	medium	Windows	32	2.27 s	129 ms	15.5×	9.0 → 9.4 GB	—	pass
`pbmc68k`	small	Windows	32	322 ms	4 ms	62.9×	0.6 → 1.0 GB	—	pass
`splitseq_rosenberg`	ood_large1	Windows	32	1.07 s	54 ms	20.2×	5.0 → 5.2 GB	—	pass
`tms_ss2`	ood_large2	Windows	32	3.13 s	165 ms	19.0×	10.7 → 11.1 GB	—	pass
`gastrulation_pijuansala`	ood_large3	macOS	8	1.58 s	224 ms	7.10×	11.3 → 11.5 GB	—	pass
`heart_adult`	large	macOS	14	2.09 s	319 ms	6.54×	16.1 → 18.2 GB	—	pass
`pbmc200k_glaucoma`	medium	macOS	14	831 ms	73 ms	11.3×	6.6 → 6.8 GB	—	pass
`pbmc68k`	small	macOS	14	77 ms	3 ms	25.7×	0.8 → 0.8 GB	—	pass
`splitseq_rosenberg`	ood_large1	macOS	14	465 ms	43 ms	10.2×	3.4 → 3.4 GB	—	pass
`tms_ss2`	ood_large2	macOS	14	957 ms	91 ms	10.6×	6.8 → 6.9 GB	—	pass

Frequently asked questions

Speeding up Scanpy normalize_total

Why is Scanpy normalize_total slow?

Scanpy normalize_total is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 322 ms where the AutoZyme path takes 4 ms (62.9× faster).

How do I make Scanpy normalize_total faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy normalize_total exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 62.9× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Scanpy normalize_total output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original Scanpy result) on every benchmark dataset.

How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call normalize_total.