Speed up cell2location: up to 2.28× faster, near-identical output

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

lymph_node_4035

2.28×

lymph_node_1500

2.26×

lymph_node_3000_seed4…

2.13×

lymph_node_500

2.10×

lymph_node_10000_boot…

2.07×

lymph_node_4035lymph_node_1500lymph_node_3000_seed4…lymph_node_500lymph_node_10000_boot…

Thread sweep

Speedup across finalized thread counts on Windows

lymph_node_4035lymph_node_1500lymph_node_3000_see…lymph_node_500lymph_node_10000_bo…

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets cell2location.models.Cell2location.train in cell2location. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: deconvolution, spatial deconvolution, spatial mapping, cell type mapping.

Supported scope

The fast path is correct ONLY for the single-batch, full-batch SVI regime that the benchmark uses. Read full supported scope

The fast path is correct ONLY for the single-batch, full-batch SVI regime that the benchmark uses. Specifically: (1) n_batch == 1 — fast_forward checks `if self.n_batch != 1: return _orig_forward(...)` (line 132), so multi-sample data correctly falls back to upstream. (2) Full-batch training, i.e. batch_size=None / train_size=1, so the entire count matrix x_data is passed every step as the SAME tensor object — this is required by the lgamma cache keyed on value.data_ptr() (lines 46-49, 123-128) which assumes lgamma(value+1) is constant across all 300 epochs. (3) The (alpha, mu)-parameterized GammaPoisson data likelihood as constructed in the LocationModelLinearDependentWMultiExperimentLocationBackgroundNormLevelGeneAlpha model forward — _GPLogProbFn (lines 73-119) hardcodes the explicit forward+backward for exactly that NegativeBinomial/GammaPoisson form (alpha = 1/alpha_g_inverse^2, mu = (w_sf @ (cell_state*m_g) + s_g_gene_add)*detection_y_s). (4) The fall-back upstream path (n_batch != 1) is still accelerated by fast_gp_log_prob, which is mathematically equivalent to GammaPoisson.log_prob and adds a row-redundancy shortcut only when concentration rows are bit-identical (verified via torch.equal, lines 53-62) — otherwise it computes the full result. (5) The validation-disable toggle is save/restored locally around the n_batch=1 forward (lines 135-139, 268), so it has no process-global side effect. Verified bit-close: pearson_loss=1.0, pearson_w_sf=1.0, tiny max_abs diffs across all benchmarked tiers (speedups_finalized.tsv).

Out-of-scope behavior

silent possibly wrong

Show detailed speedup table 10 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`lymph_node_10000_boot101_10lineage`	ood_xlarge	Windows	1	41.00 min	19.43 min	2.07×	7.7 → 6.2 GB	—	pass
`lymph_node_1500`	medium	Windows	1	4.85 min	2.37 min	2.26×	1.9 → 1.6 GB	—	pass
`lymph_node_3000_seed42_10lineage`	ood_large	Windows	1	11.78 min	5.72 min	2.13×	3.0 → 2.4 GB	—	pass
`lymph_node_4035`	large	Windows	1	15.86 min	7.16 min	2.28×	3.6 → 3.1 GB	—	pass
`lymph_node_500`	small	Windows	1	1.60 min	50.07 s	2.10×	1.2 → 1.1 GB	—	pass
`lymph_node_10000_boot101_10lineage`	ood_xlarge	macOS	4	13.02 min	5.15 min	2.53×	8.4 → 6.8 GB	—	pass
`lymph_node_1500`	medium	macOS	1	6.33 min	2.52 min	2.51×	1.9 → 1.7 GB	—	pass
`lymph_node_3000_seed42_10lineage`	ood_large	macOS	4	4.42 min	1.57 min	2.82×	3.1 → 2.6 GB	—	pass
`lymph_node_4035`	large	macOS	1	17.00 min	6.74 min	2.52×	3.9 → 3.3 GB	—	pass
`lymph_node_500`	small	macOS	1	2.19 min	53.81 s	2.44×	1.3 → 1.1 GB	—	pass

Frequently asked questions

Speeding up cell2location

Why is cell2location slow?

cell2location is CPU-bound, and the stock implementation in cell2location leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 15.86 min where the AutoZyme path takes 7.16 min (2.28× faster).

How do I make cell2location faster?

Install AutoZyme and activate the cell2location patch, then keep using cell2location exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 2.28× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the cell2location output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original cell2location result) on every benchmark dataset.

How do I install the cell2location speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("cell2location"). The patch applies automatically the next time you call cell2location.models.Cell2location.train.