cell2location is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a
verified, drop-in patch that is up to 2.28× faster, returning output within a strict, verified tolerance with no change to how you call it.
Best speedup2.28×
Median speedup2.36×
Output equivalenceTolerance
Best runtime baseline 15.86 min → optimized 7.16 min
Datasets5
Pass rate10/10
Benchmark charts
Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
lymph_node_4035
lymph_node_4035 · large1 threads · 2.28× speedup15.86 min baseline → 7.16 min optimizedmemory 3.6 GB → 3.1 GBlymph_node_4035 · large4 threads · 1.90× speedup6.12 min baseline → 3.45 min optimizedmemory 3.6 GB → 3.0 GBlymph_node_4035 · large8 threads · 2.05× speedup5.94 min baseline → 2.81 min optimizedmemory 3.6 GB → 3.2 GB
2.28×
lymph_node_1500
lymph_node_1500 · medium1 threads · 2.26× speedup4.85 min baseline → 2.37 min optimizedmemory 1.9 GB → 1.6 GBlymph_node_1500 · medium4 threads · 1.89× speedup2.06 min baseline → 1.14 min optimizedmemory 1.9 GB → 1.7 GBlymph_node_1500 · medium8 threads · 2.20× speedup2.06 min baseline → 1.03 min optimizedmemory 1.9 GB → 1.7 GB
2.26×
lymph_node_3000_seed4…
lymph_node_3000_seed42_10lineage · ood_large1 threads · 2.13× speedup11.78 min baseline → 5.72 min optimizedmemory 3.0 GB → 2.4 GBlymph_node_3000_seed42_10lineage · ood_large4 threads · 1.54× speedup6.02 min baseline → 3.78 min optimizedmemory 3.0 GB → 2.4 GBlymph_node_3000_seed42_10lineage · ood_large8 threads · 2.12× speedup4.24 min baseline → 1.97 min optimizedmemory 3.0 GB → 2.4 GB
2.13×
lymph_node_500
lymph_node_500 · small1 threads · 2.10× speedup1.60 min baseline → 50.07 s optimizedmemory 1.2 GB → 1.1 GBlymph_node_500 · small4 threads · 1.97× speedup43.05 s baseline → 23.97 s optimizedmemory 1.2 GB → 1.1 GBlymph_node_500 · small8 threads · 1.89× speedup43.58 s baseline → 23.56 s optimizedmemory 1.2 GB → 1.1 GB
2.10×
lymph_node_10000_boot…
lymph_node_10000_boot101_10lineage · ood_xlarge1 threads · 2.07× speedup41.00 min baseline → 19.43 min optimizedmemory 7.7 GB → 6.2 GBlymph_node_10000_boot101_10lineage · ood_xlarge4 threads · 1.62× speedup21.04 min baseline → 13.05 min optimizedmemory 7.7 GB → 6.2 GBlymph_node_10000_boot101_10lineage · ood_xlarge8 threads · 2.04× speedup14.39 min baseline → 6.80 min optimizedmemory 7.7 GB → 6.2 GB
The public API stays the same; AutoZyme replaces only the supported fast path.
This task targets cell2location.models.Cell2location.train in cell2location. The benchmarked result
preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.
Also searched as: deconvolution, spatial deconvolution, spatial mapping, cell type mapping.
Supported scope
The fast path is correct ONLY for the single-batch, full-batch SVI regime that the benchmark uses.Read full supported scope
The fast path is correct ONLY for the single-batch, full-batch SVI regime that the benchmark uses. Specifically: (1) n_batch == 1 — fast_forward checks `if self.n_batch != 1: return _orig_forward(...)` (line 132), so multi-sample data correctly falls back to upstream. (2) Full-batch training, i.e. batch_size=None / train_size=1, so the entire count matrix x_data is passed every step as the SAME tensor object — this is required by the lgamma cache keyed on value.data_ptr() (lines 46-49, 123-128) which assumes lgamma(value+1) is constant across all 300 epochs. (3) The (alpha, mu)-parameterized GammaPoisson data likelihood as constructed in the LocationModelLinearDependentWMultiExperimentLocationBackgroundNormLevelGeneAlpha model forward — _GPLogProbFn (lines 73-119) hardcodes the explicit forward+backward for exactly that NegativeBinomial/GammaPoisson form (alpha = 1/alpha_g_inverse^2, mu = (w_sf @ (cell_state*m_g) + s_g_gene_add)*detection_y_s). (4) The fall-back upstream path (n_batch != 1) is still accelerated by fast_gp_log_prob, which is mathematically equivalent to GammaPoisson.log_prob and adds a row-redundancy shortcut only when concentration rows are bit-identical (verified via torch.equal, lines 53-62) — otherwise it computes the full result. (5) The validation-disable toggle is save/restored locally around the n_batch=1 forward (lines 135-139, 268), so it has no process-global side effect. Verified bit-close: pearson_loss=1.0, pearson_w_sf=1.0, tiny max_abs diffs across all benchmarked tiers (speedups_finalized.tsv).
Out-of-scope behavior
silent possibly wrong
Show detailed speedup table10 runs▾
Dataset
Tier
Platform
Threads
Baseline
Optimized
Speedup
Memory
Concordance
Pass
lymph_node_10000_boot101_10lineage
ood_xlarge
Windows
1
41.00 min
19.43 min
2.07×
7.7 → 6.2 GB
—
pass
lymph_node_1500
medium
Windows
1
4.85 min
2.37 min
2.26×
1.9 → 1.6 GB
—
pass
lymph_node_3000_seed42_10lineage
ood_large
Windows
1
11.78 min
5.72 min
2.13×
3.0 → 2.4 GB
—
pass
lymph_node_4035
large
Windows
1
15.86 min
7.16 min
2.28×
3.6 → 3.1 GB
—
pass
lymph_node_500
small
Windows
1
1.60 min
50.07 s
2.10×
1.2 → 1.1 GB
—
pass
lymph_node_10000_boot101_10lineage
ood_xlarge
macOS
4
13.02 min
5.15 min
2.53×
8.4 → 6.8 GB
—
pass
lymph_node_1500
medium
macOS
1
6.33 min
2.52 min
2.51×
1.9 → 1.7 GB
—
pass
lymph_node_3000_seed42_10lineage
ood_large
macOS
4
4.42 min
1.57 min
2.82×
3.1 → 2.6 GB
—
pass
lymph_node_4035
large
macOS
1
17.00 min
6.74 min
2.52×
3.9 → 3.3 GB
—
pass
lymph_node_500
small
macOS
1
2.19 min
53.81 s
2.44×
1.3 → 1.1 GB
—
pass
Frequently asked questions
Speeding up cell2location
Why is cell2location slow?
cell2location is CPU-bound, and the stock implementation in cell2location leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 15.86 min where the AutoZyme path takes 7.16 min (2.28× faster).
How do I make cell2location faster?
Install AutoZyme and activate the cell2location patch, then keep using cell2location exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 2.28× faster on the benchmark datasets, with no pipeline or API changes.
Does the AutoZyme speedup change the cell2location output?
Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original cell2location result) on every benchmark dataset.
How do I install the cell2location speedup?
In Python: pip install autozyme, then import autozyme and autozyme.activate("cell2location"). The patch applies automatically the next time you call cell2location.models.Cell2location.train.