Python Scanpy methods Scanpy

Speed up Scanpy normalize_total

Scanpy normalize_total is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a verified, drop-in patch that is up to 62.9× faster, returning output within a strict, verified tolerance with no change to how you call it.

Best speedup 62.9×
Median speedup 15.2×
Output equivalence Tolerance
Best runtime baseline 322 ms optimized 4 ms
Datasets 6
Pass rate 12/12

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
log scale
pbmc68kgastrulation_pijuansa…splitseq_rosenbergtms_ss2pbmc200k_glaucomaheart_adult
Thread sweep
Speedup across finalized thread counts on Windows
0.5×10×100×log scale14full (32)pbmc68k · small1 threads · 5.47× speedup200 ms baseline → 45 ms optimizedmemory 0.8 GB → 0.9 GBpbmc68k · small4 threads · 29.2× speedup245 ms baseline → 8 ms optimizedmemory 0.6 GB → 1.0 GBpbmc68k · small32 threads · 62.9× speedup322 ms baseline → 4 ms optimizedmemory 0.6 GB → 1.0 GBgastrulation_pijuansala · ood_large31 threads · 2.67× speedup4.70 s baseline → 1.95 s optimizedmemory 18 GB → 19 GBgastrulation_pijuansala · ood_large34 threads · 6.36× speedup5.25 s baseline → 818 ms optimizedmemory 18 GB → 19 GBgastrulation_pijuansala · ood_large332 threads · 25.4× speedup5.33 s baseline → 205 ms optimizedmemory 18 GB → 19 GBsplitseq_rosenberg · ood_large11 threads · 2.33× speedup1.12 s baseline → 466 ms optimizedmemory 5.0 GB → 5.2 GBsplitseq_rosenberg · ood_large14 threads · 8.98× speedup1.07 s baseline → 121 ms optimizedmemory 5.0 GB → 5.2 GBsplitseq_rosenberg · ood_large132 threads · 20.2× speedup1.07 s baseline → 54 ms optimizedmemory 5.0 GB → 5.2 GBtms_ss2 · ood_large21 threads · 1.57× speedup2.76 s baseline → 1.99 s optimizedmemory 11 GB → 11 GBtms_ss2 · ood_large24 threads · 4.75× speedup4.10 s baseline → 659 ms optimizedmemory 11 GB → 11 GBtms_ss2 · ood_large232 threads · 19.0× speedup3.13 s baseline → 165 ms optimizedmemory 11 GB → 11 GBpbmc200k_glaucoma · medium1 threads · 1.41× speedup2.01 s baseline → 1.41 s optimizedmemory 9.0 GB → 9.4 GBpbmc200k_glaucoma · medium4 threads · 4.15× speedup1.95 s baseline → 479 ms optimizedmemory 9.0 GB → 9.4 GBpbmc200k_glaucoma · medium32 threads · 15.5× speedup2.27 s baseline → 129 ms optimizedmemory 9.0 GB → 9.4 GBheart_adult · large1 threads · 0.79× speedup5.01 s baseline → 6.42 s optimizedmemory 24 GB → 24 GBheart_adult · large4 threads · 5.28× speedup4.91 s baseline → 967 ms optimizedmemory 24 GB → 24 GBheart_adult · large32 threads · 14.8× speedup6.78 s baseline → 344 ms optimizedmemory 24 GB → 24 GB
pbmc68kgastrulation_pijuan…splitseq_rosenbergtms_ss2pbmc200k_glaucomaheart_adult
Memory
Baseline vs optimized peak memory on Windows
0.0 GB25 GB50 GBheart_adult1.02×gastrulation_piju…1.01×tms_ss21.04×pbmc200k_glaucoma1.05×splitseq_rosenberg1.06×pbmc68k1.17×heart_adult · largememory 24 GB → 24 GBoptimized / baseline 1.02×14.8× speedup · 32 threadsgastrulation_pijuansala · ood_large3memory 18 GB → 19 GBoptimized / baseline 1.01×25.4× speedup · 32 threadstms_ss2 · ood_large2memory 11 GB → 11 GBoptimized / baseline 1.04×19.0× speedup · 32 threadspbmc200k_glaucoma · mediummemory 9.0 GB → 9.4 GBoptimized / baseline 1.05×15.5× speedup · 32 threadssplitseq_rosenberg · ood_large1memory 5.0 GB → 5.2 GBoptimized / baseline 1.06×20.2× speedup · 32 threadspbmc68k · smallmemory 0.8 GB → 0.9 GBoptimized / baseline 1.17×5.47× speedup · 1 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets normalize_total in Scanpy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: normalization, normalize, library size normalization, NormalizeData, pp.normalize_total.

Supported scope

Fast path handles the standard single-cell tutorial pair on sparse data: X that is sparse-convertible (converted to CSR float32 with int32 indices when nnz <= 2^31-1), inplace=True, copy in {True, False}, and no extra keyword args. Read full supported scope

Fast path handles the standard single-cell tutorial pair on sparse data: X that is sparse-convertible (converted to CSR float32 with int32 indices when nnz <= 2^31-1), inplace=True, copy in {True, False}, and no extra keyword args. Both target_sum modes are implemented: explicit target_sum (e.g. 1e4) routes to the fused per-row sum+scale kernel _fused_normalize_only; target_sum=None routes to _row_sums + a median-based target then _scale_only. log1p applies a parallel numba np.log1p over CSR .data (natural log, base=None), and falls back to np.log1p(out) for dense X. Numeric output of the .X matrix matches upstream on the benchmarked default-dtype CSR counts path (this is what the concordance metric checks).

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 12 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
gastrulation_pijuansala ood_large3 Windows 32 5.33 s 205 ms 25.4× 18.4 → 18.6 GB pass
heart_adult large Windows 32 6.78 s 344 ms 14.8× 23.5 → 24.0 GB pass
pbmc200k_glaucoma medium Windows 32 2.27 s 129 ms 15.5× 9.0 → 9.4 GB pass
pbmc68k small Windows 32 322 ms 4 ms 62.9× 0.6 → 1.0 GB pass
splitseq_rosenberg ood_large1 Windows 32 1.07 s 54 ms 20.2× 5.0 → 5.2 GB pass
tms_ss2 ood_large2 Windows 32 3.13 s 165 ms 19.0× 10.7 → 11.1 GB pass
gastrulation_pijuansala ood_large3 macOS 8 1.58 s 224 ms 7.10× 11.3 → 11.5 GB pass
heart_adult large macOS 14 2.09 s 319 ms 6.54× 16.1 → 18.2 GB pass
pbmc200k_glaucoma medium macOS 14 831 ms 73 ms 11.3× 6.6 → 6.8 GB pass
pbmc68k small macOS 14 77 ms 3 ms 25.7× 0.8 → 0.8 GB pass
splitseq_rosenberg ood_large1 macOS 14 465 ms 43 ms 10.2× 3.4 → 3.4 GB pass
tms_ss2 ood_large2 macOS 14 957 ms 91 ms 10.6× 6.8 → 6.9 GB pass

Frequently asked questions

Speeding up Scanpy normalize_total
Why is Scanpy normalize_total slow?

Scanpy normalize_total is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 322 ms where the AutoZyme path takes 4 ms (62.9× faster).

How do I make Scanpy normalize_total faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy normalize_total exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 62.9× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Scanpy normalize_total output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original Scanpy result) on every benchmark dataset.

How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call normalize_total.