Python Spatial & deconvolution cell2location

Speed up cell2location

cell2location is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a verified, drop-in patch that is up to 2.28× faster, returning output within a strict, verified tolerance with no change to how you call it.

Best speedup 2.28×
Median speedup 2.36×
Output equivalence Tolerance
Best runtime baseline 15.86 min optimized 7.16 min
Datasets 5
Pass rate 10/10

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
lymph_node_4035lymph_node_1500lymph_node_3000_seed4…lymph_node_500lymph_node_10000_boot…
Thread sweep
Speedup across finalized thread counts on Windows
2.5×14full (8)lymph_node_4035 · large1 threads · 2.28× speedup15.86 min baseline → 7.16 min optimizedmemory 3.6 GB → 3.1 GBlymph_node_4035 · large4 threads · 1.90× speedup6.12 min baseline → 3.45 min optimizedmemory 3.6 GB → 3.0 GBlymph_node_4035 · large8 threads · 2.05× speedup5.94 min baseline → 2.81 min optimizedmemory 3.6 GB → 3.2 GBlymph_node_1500 · medium1 threads · 2.26× speedup4.85 min baseline → 2.37 min optimizedmemory 1.9 GB → 1.6 GBlymph_node_1500 · medium4 threads · 1.89× speedup2.06 min baseline → 1.14 min optimizedmemory 1.9 GB → 1.7 GBlymph_node_1500 · medium8 threads · 2.20× speedup2.06 min baseline → 1.03 min optimizedmemory 1.9 GB → 1.7 GBlymph_node_3000_seed42_10lineage · ood_large1 threads · 2.13× speedup11.78 min baseline → 5.72 min optimizedmemory 3.0 GB → 2.4 GBlymph_node_3000_seed42_10lineage · ood_large4 threads · 1.54× speedup6.02 min baseline → 3.78 min optimizedmemory 3.0 GB → 2.4 GBlymph_node_3000_seed42_10lineage · ood_large8 threads · 2.12× speedup4.24 min baseline → 1.97 min optimizedmemory 3.0 GB → 2.4 GBlymph_node_500 · small1 threads · 2.10× speedup1.60 min baseline → 50.07 s optimizedmemory 1.2 GB → 1.1 GBlymph_node_500 · small4 threads · 1.97× speedup43.05 s baseline → 23.97 s optimizedmemory 1.2 GB → 1.1 GBlymph_node_500 · small8 threads · 1.89× speedup43.58 s baseline → 23.56 s optimizedmemory 1.2 GB → 1.1 GBlymph_node_10000_boot101_10lineage · ood_xlarge1 threads · 2.07× speedup41.00 min baseline → 19.43 min optimizedmemory 7.7 GB → 6.2 GBlymph_node_10000_boot101_10lineage · ood_xlarge4 threads · 1.62× speedup21.04 min baseline → 13.05 min optimizedmemory 7.7 GB → 6.2 GBlymph_node_10000_boot101_10lineage · ood_xlarge8 threads · 2.04× speedup14.39 min baseline → 6.80 min optimizedmemory 7.7 GB → 6.2 GB
lymph_node_4035lymph_node_1500lymph_node_3000_see…lymph_node_500lymph_node_10000_bo…
Memory
Baseline vs optimized peak memory on Windows
0.0 GB5.0 GB10 GBlymph_node_10000_…0.80×lymph_node_40350.87×lymph_node_3000_s…0.81×lymph_node_15000.87×lymph_node_5000.91×lymph_node_10000_boot101_10lineage · ood_xlargememory 7.7 GB → 6.2 GBoptimized / baseline 0.80×2.04× speedup · 8 threadslymph_node_4035 · largememory 3.6 GB → 3.2 GBoptimized / baseline 0.87×2.05× speedup · 8 threadslymph_node_3000_seed42_10lineage · ood_largememory 3.0 GB → 2.4 GBoptimized / baseline 0.81×2.12× speedup · 8 threadslymph_node_1500 · mediummemory 1.9 GB → 1.7 GBoptimized / baseline 0.87×2.20× speedup · 8 threadslymph_node_500 · smallmemory 1.2 GB → 1.1 GBoptimized / baseline 0.91×1.89× speedup · 8 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets cell2location.models.Cell2location.train in cell2location. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: deconvolution, spatial deconvolution, spatial mapping, cell type mapping.

Supported scope

The fast path is correct ONLY for the single-batch, full-batch SVI regime that the benchmark uses. Read full supported scope

The fast path is correct ONLY for the single-batch, full-batch SVI regime that the benchmark uses. Specifically: (1) n_batch == 1 — fast_forward checks `if self.n_batch != 1: return _orig_forward(...)` (line 132), so multi-sample data correctly falls back to upstream. (2) Full-batch training, i.e. batch_size=None / train_size=1, so the entire count matrix x_data is passed every step as the SAME tensor object — this is required by the lgamma cache keyed on value.data_ptr() (lines 46-49, 123-128) which assumes lgamma(value+1) is constant across all 300 epochs. (3) The (alpha, mu)-parameterized GammaPoisson data likelihood as constructed in the LocationModelLinearDependentWMultiExperimentLocationBackgroundNormLevelGeneAlpha model forward — _GPLogProbFn (lines 73-119) hardcodes the explicit forward+backward for exactly that NegativeBinomial/GammaPoisson form (alpha = 1/alpha_g_inverse^2, mu = (w_sf @ (cell_state*m_g) + s_g_gene_add)*detection_y_s). (4) The fall-back upstream path (n_batch != 1) is still accelerated by fast_gp_log_prob, which is mathematically equivalent to GammaPoisson.log_prob and adds a row-redundancy shortcut only when concentration rows are bit-identical (verified via torch.equal, lines 53-62) — otherwise it computes the full result. (5) The validation-disable toggle is save/restored locally around the n_batch=1 forward (lines 135-139, 268), so it has no process-global side effect. Verified bit-close: pearson_loss=1.0, pearson_w_sf=1.0, tiny max_abs diffs across all benchmarked tiers (speedups_finalized.tsv).

Out-of-scope behavior

silent possibly wrong

Show detailed speedup table 10 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
lymph_node_10000_boot101_10lineage ood_xlarge Windows 1 41.00 min 19.43 min 2.07× 7.7 → 6.2 GB pass
lymph_node_1500 medium Windows 1 4.85 min 2.37 min 2.26× 1.9 → 1.6 GB pass
lymph_node_3000_seed42_10lineage ood_large Windows 1 11.78 min 5.72 min 2.13× 3.0 → 2.4 GB pass
lymph_node_4035 large Windows 1 15.86 min 7.16 min 2.28× 3.6 → 3.1 GB pass
lymph_node_500 small Windows 1 1.60 min 50.07 s 2.10× 1.2 → 1.1 GB pass
lymph_node_10000_boot101_10lineage ood_xlarge macOS 4 13.02 min 5.15 min 2.53× 8.4 → 6.8 GB pass
lymph_node_1500 medium macOS 1 6.33 min 2.52 min 2.51× 1.9 → 1.7 GB pass
lymph_node_3000_seed42_10lineage ood_large macOS 4 4.42 min 1.57 min 2.82× 3.1 → 2.6 GB pass
lymph_node_4035 large macOS 1 17.00 min 6.74 min 2.52× 3.9 → 3.3 GB pass
lymph_node_500 small macOS 1 2.19 min 53.81 s 2.44× 1.3 → 1.1 GB pass

Frequently asked questions

Speeding up cell2location
Why is cell2location slow?

cell2location is CPU-bound, and the stock implementation in cell2location leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 15.86 min where the AutoZyme path takes 7.16 min (2.28× faster).

How do I make cell2location faster?

Install AutoZyme and activate the cell2location patch, then keep using cell2location exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 2.28× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the cell2location output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original cell2location result) on every benchmark dataset.

How do I install the cell2location speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("cell2location"). The patch applies automatically the next time you call cell2location.models.Cell2location.train.