Scanpy normalize_total is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a
verified, drop-in patch that is up to 62.9× faster, returning output within a strict, verified tolerance with no change to how you call it.
Best speedup62.9×
Median speedup15.2×
Output equivalenceTolerance
Best runtime baseline 322 ms → optimized 4 ms
Datasets6
Pass rate12/12
Benchmark charts
Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
log scale
pbmc68k
pbmc68k · small1 threads · 5.47× speedup200 ms baseline → 45 ms optimizedmemory 0.8 GB → 0.9 GBpbmc68k · small4 threads · 29.2× speedup245 ms baseline → 8 ms optimizedmemory 0.6 GB → 1.0 GBpbmc68k · small32 threads · 62.9× speedup322 ms baseline → 4 ms optimizedmemory 0.6 GB → 1.0 GB
62.9×
gastrulation_pijuansa…
gastrulation_pijuansala · ood_large31 threads · 2.67× speedup4.70 s baseline → 1.95 s optimizedmemory 18 GB → 19 GBgastrulation_pijuansala · ood_large34 threads · 6.36× speedup5.25 s baseline → 818 ms optimizedmemory 18 GB → 19 GBgastrulation_pijuansala · ood_large332 threads · 25.4× speedup5.33 s baseline → 205 ms optimizedmemory 18 GB → 19 GB
25.4×
splitseq_rosenberg
splitseq_rosenberg · ood_large11 threads · 2.33× speedup1.12 s baseline → 466 ms optimizedmemory 5.0 GB → 5.2 GBsplitseq_rosenberg · ood_large14 threads · 8.98× speedup1.07 s baseline → 121 ms optimizedmemory 5.0 GB → 5.2 GBsplitseq_rosenberg · ood_large132 threads · 20.2× speedup1.07 s baseline → 54 ms optimizedmemory 5.0 GB → 5.2 GB
20.2×
tms_ss2
tms_ss2 · ood_large21 threads · 1.57× speedup2.76 s baseline → 1.99 s optimizedmemory 11 GB → 11 GBtms_ss2 · ood_large24 threads · 4.75× speedup4.10 s baseline → 659 ms optimizedmemory 11 GB → 11 GBtms_ss2 · ood_large232 threads · 19.0× speedup3.13 s baseline → 165 ms optimizedmemory 11 GB → 11 GB
19.0×
pbmc200k_glaucoma
pbmc200k_glaucoma · medium1 threads · 1.41× speedup2.01 s baseline → 1.41 s optimizedmemory 9.0 GB → 9.4 GBpbmc200k_glaucoma · medium4 threads · 4.15× speedup1.95 s baseline → 479 ms optimizedmemory 9.0 GB → 9.4 GBpbmc200k_glaucoma · medium32 threads · 15.5× speedup2.27 s baseline → 129 ms optimizedmemory 9.0 GB → 9.4 GB
15.5×
heart_adult
heart_adult · large1 threads · 0.79× speedup5.01 s baseline → 6.42 s optimizedmemory 24 GB → 24 GBheart_adult · large4 threads · 5.28× speedup4.91 s baseline → 967 ms optimizedmemory 24 GB → 24 GBheart_adult · large32 threads · 14.8× speedup6.78 s baseline → 344 ms optimizedmemory 24 GB → 24 GB
The public API stays the same; AutoZyme replaces only the supported fast path.
This task targets normalize_total in Scanpy. The benchmarked result
preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.
Also searched as: normalization, normalize, library size normalization, NormalizeData, pp.normalize_total.
Supported scope
Fast path handles the standard single-cell tutorial pair on sparse data: X that is sparse-convertible (converted to CSR float32 with int32 indices when nnz <= 2^31-1), inplace=True, copy in {True, False}, and no extra keyword args.Read full supported scope
Fast path handles the standard single-cell tutorial pair on sparse data: X that is sparse-convertible (converted to CSR float32 with int32 indices when nnz <= 2^31-1), inplace=True, copy in {True, False}, and no extra keyword args. Both target_sum modes are implemented: explicit target_sum (e.g. 1e4) routes to the fused per-row sum+scale kernel _fused_normalize_only; target_sum=None routes to _row_sums + a median-based target then _scale_only. log1p applies a parallel numba np.log1p over CSR .data (natural log, base=None), and falls back to np.log1p(out) for dense X. Numeric output of the .X matrix matches upstream on the benchmarked default-dtype CSR counts path (this is what the concordance metric checks).
Out-of-scope behavior
silent fallback to upstream
Show detailed speedup table12 runs▾
Dataset
Tier
Platform
Threads
Baseline
Optimized
Speedup
Memory
Concordance
Pass
gastrulation_pijuansala
ood_large3
Windows
32
5.33 s
205 ms
25.4×
18.4 → 18.6 GB
—
pass
heart_adult
large
Windows
32
6.78 s
344 ms
14.8×
23.5 → 24.0 GB
—
pass
pbmc200k_glaucoma
medium
Windows
32
2.27 s
129 ms
15.5×
9.0 → 9.4 GB
—
pass
pbmc68k
small
Windows
32
322 ms
4 ms
62.9×
0.6 → 1.0 GB
—
pass
splitseq_rosenberg
ood_large1
Windows
32
1.07 s
54 ms
20.2×
5.0 → 5.2 GB
—
pass
tms_ss2
ood_large2
Windows
32
3.13 s
165 ms
19.0×
10.7 → 11.1 GB
—
pass
gastrulation_pijuansala
ood_large3
macOS
8
1.58 s
224 ms
7.10×
11.3 → 11.5 GB
—
pass
heart_adult
large
macOS
14
2.09 s
319 ms
6.54×
16.1 → 18.2 GB
—
pass
pbmc200k_glaucoma
medium
macOS
14
831 ms
73 ms
11.3×
6.6 → 6.8 GB
—
pass
pbmc68k
small
macOS
14
77 ms
3 ms
25.7×
0.8 → 0.8 GB
—
pass
splitseq_rosenberg
ood_large1
macOS
14
465 ms
43 ms
10.2×
3.4 → 3.4 GB
—
pass
tms_ss2
ood_large2
macOS
14
957 ms
91 ms
10.6×
6.8 → 6.9 GB
—
pass
Frequently asked questions
Speeding up Scanpy normalize_total
Why is Scanpy normalize_total slow?
Scanpy normalize_total is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 322 ms where the AutoZyme path takes 4 ms (62.9× faster).
How do I make Scanpy normalize_total faster?
Install AutoZyme and activate the Scanpy patch, then keep using Scanpy normalize_total exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 62.9× faster on the benchmark datasets, with no pipeline or API changes.
Does the AutoZyme speedup change the Scanpy normalize_total output?
Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original Scanpy result) on every benchmark dataset.
How do I install the Scanpy speedup?
In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call normalize_total.