Speed up Scanpy pca: up to 36.0× faster, near-identical output

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

log scale

pbmc200k_glaucoma

36.0×

pbmc68k

25.2×

heart_adult

23.7×

splitseq_rosenberg

23.1×

gastrulation_pijuansa…

15.7×

tms_ss2

13.6×

pbmc200k_glaucomapbmc68kheart_adultsplitseq_rosenberggastrulation_pijuansa…tms_ss2

Thread sweep

Speedup across finalized thread counts on Windows

pbmc200k_glaucomapbmc68kheart_adultsplitseq_rosenberggastrulation_pijuan…tms_ss2

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets pca in Scanpy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: PCA, principal components, dimensionality reduction, RunPCA, pp.pca, tl.pca.

Supported scope

The fast path computes a full, zero-centered PCA via a Gram matrix (X^T X scaled by 1/(n_cells-1)) plus a symmetric eigendecomposition, then projects centered X onto the top n_comps eigenvectors. Read full supported scope

The fast path computes a full, zero-centered PCA via a Gram matrix (X^T X scaled by 1/(n_cells-1)) plus a symmetric eigendecomposition, then projects centered X onto the top n_comps eigenvectors. It correctly handles: dense or sparse adata.X (sparse is densified via toarray, line 159); any n_comps valid for the matrix; copy=True/False; and the implicit upstream zero_center=True / arpack-or-auto / full deterministic-solver case (output is mathematically equivalent up to sign, which is what the eval metric min_pc_cor>=0.95 checks). use_highly_variable is honored explicitly (lines 121-145): when an HVG annotation is present (or use_highly_variable=True) and the mask is a proper subset, it recurses on the HVG-subset matrix and lifts PCs back into full var space with zeros for non-HVG genes; use_highly_variable=False forces the full-gene path. Matrices with n_genes>8000 fall through to upstream ARPACK (line 154-156) where all kwargs are forwarded, and zyme=False (line 105-106) forwards everything to upstream. On macOS it uses Apple Accelerate cblas_sgemm + numpy.linalg.eigh; elsewhere numpy BLAS + scipy.linalg.eigh(driver="evr"). All computation is done in float32 internally regardless of requested dtype.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 11 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`gastrulation_pijuansala`	ood_large3	Windows	32	18.52 s	1.17 s	15.7×	14.8 → 15.0 GB	—	pass
`heart_adult`	large	Windows	32	1.28 min	3.27 s	23.7×	32.7 → 19.3 GB	—	pass
`pbmc200k_glaucoma`	medium	Windows	32	58.94 s	1.63 s	36.0×	14.0 → 7.9 GB	—	pass
`pbmc68k`	small	Windows	4	20.90 s	829 ms	25.2×	3.5 → 1.6 GB	—	pass
`splitseq_rosenberg`	ood_large1	Windows	32	27.56 s	1.19 s	23.1×	9.8 → 4.9 GB	—	pass
`tms_ss2`	ood_large2	Windows	4	17.89 s	1.34 s	13.6×	9.7 → 8.9 GB	—	pass
`gastrulation_pijuansala`	ood_large3	macOS	4	8.75 s	815 ms	12.3×	14.4 → 14.7 GB	—	pass
`pbmc200k_glaucoma`	medium	macOS	4	28.54 s	976 ms	29.6×	16.5 → 10.3 GB	—	pass
`pbmc68k`	small	macOS	14	12.35 s	600 ms	20.5×	7.3 → 2.2 GB	—	pass
`splitseq_rosenberg`	ood_large1	macOS	4	13.93 s	911 ms	17.0×	14.1 → 7.1 GB	—	pass
`tms_ss2`	ood_large2	macOS	14	7.47 s	766 ms	9.30×	9.2 → 9.0 GB	—	pass

Frequently asked questions

Speeding up Scanpy pca

Why is Scanpy pca slow?

Scanpy pca is CPU-bound, and the stock implementation in Scanpy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 58.94 s where the AutoZyme path takes 1.63 s (36.0× faster).

How do I make Scanpy pca faster?

Install AutoZyme and activate the Scanpy patch, then keep using Scanpy pca exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 36.0× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Scanpy pca output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original Scanpy result) on every benchmark dataset.

How do I install the Scanpy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("scanpy"). The patch applies automatically the next time you call pca.