Speed up celda decontX: up to 2.80× faster, near-identical output

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

ext_hgmm_chemistry_20…

2.80×

ext_scmixology_sc_10x…

1.80×

ext_pfc_snrna_herring…

1.70×

pbmc68k

1.25×

ifnb

1.24×

ext_hgmm_chemistry_20…ext_scmixology_sc_10x…ext_pfc_snrna_herring…pbmc68kifnb

Thread sweep

Speedup across finalized thread counts on Windows

ext_hgmm_chemistry_…ext_scmixology_sc_1…ext_pfc_snrna_herri…pbmc68kifnb

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets decontX in celda. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: celda, ambient RNA, decontamination, contamination removal, quality control.

Supported scope

The patch overrides four celda namespace functions that decontX calls internally; the upstream decontX driver (argument parsing, the per-batch loop over .decontXoneBatch, the z=NULL vs user-z dispatch, the EM convergence test on max|Δθ|) is left fully intact,… Read full supported scope

The patch overrides four celda namespace functions that decontX calls internally; the upstream decontX driver (argument parsing, the per-batch loop over .decontXoneBatch, the z=NULL vs user-z dispatch, the EM convergence test on max|Δθ|) is left fully intact, so the fast path runs for the SAME parameter space decontX itself accepts. Specifically: (1) decontXLogLik is restored to a passthrough of celda's original C++ LL (the header comment calling it a no-op is stale — it computes the real LL), so LL-based diagnostics stay correct. (2) .decontxInitializeZ reproduces celda's UMAP+dbscan(+kmeans fallback) initialization on a raw counts matrix or SingleCellExperiment, differing only by passing auto_threads(cap=8) to scater::calculateUMAP; it is reached only when z=NULL (celda skips init when the user supplies z), and respects varGenes, dbscanEps, estimateCellTypes, and seed. (3) calculateNativeMatrix is a sparse R reimplementation of celda's normp-weighted native-count formula, keeping res$decontXcounts a real output. (4) decontXEM is an RcppEigen + std::thread reimplementation of one EM iteration, math-equivalent to upstream and reported bit-exact at all tested tiers; it honors estimate_eta, estimate_delta, delta (2-element prior), pseudocount, theta, and per-cell counts. Parallelism engages only when nC >= 500 AND requested threads > 1 (otherwise an identical serial path runs), and estimate_delta delegates to MCMCprecision::fit_dirichlet exactly as upstream. Because the patch sits below the batch loop (decontXEM is invoked once per batch by celda), batch != NULL is handled correctly via upstream's preserved dispatch. Benchmarked only at the default config (delta=c(10,10), estimateDelta=TRUE, varGenes=5000, dbscanEps=1, maxIter=500, seed=12345) across tiny/small/medium/large + two OOD tiers; honest speedups ~1.1-1.5x at 1 thread, up to ~2.85x at 4 threads.

Out-of-scope behavior

silent possibly wrong

Show detailed speedup table 10 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`ext_hgmm_chemistry_20k_3p_HT`	large	Windows	4	5.40 min	1.93 min	2.80×	6.2 → 7.9 GB	—	pass
`ext_pfc_snrna_herring_RL2103_ga22`	medium	Windows	4	1.35 min	47.81 s	1.70×	2.9 → 3.1 GB	—	pass
`ext_scmixology_sc_10x_3cl`	small	Windows	4	14.71 s	8.17 s	1.80×	1.4 → 1.4 GB	—	pass
`ifnb`	ood_large	Windows	8	54.02 s	43.70 s	1.24×	2.2 → 2.4 GB	—	pass
`pbmc68k`	ood_xlarge	Windows	4	5.55 min	4.43 min	1.25×	6.9 → 7.1 GB	—	pass
`ext_hgmm_chemistry_20k_3p_HT`	large	macOS	8	4.99 min	1.38 min	3.67×	10.4 → 10.5 GB	—	pass
`ext_pfc_snrna_herring_RL2103_ga22`	medium	macOS	8	40.27 s	20.12 s	2.04×	5.0 → 5.8 GB	—	pass
`ext_scmixology_sc_10x_3cl`	small	macOS	8	7.66 s	3.75 s	2.13×	1.6 → 1.9 GB	—	pass
`ifnb`	ood_large	macOS	4	19.02 s	14.82 s	1.28×	3.5 → 3.3 GB	—	pass
`pbmc68k`	ood_xlarge	macOS	4	2.44 min	1.70 min	1.41×	9.0 → 8.8 GB	—	pass

Frequently asked questions

Speeding up celda decontX

Why is celda decontX slow?

celda decontX is CPU-bound, and the stock implementation in celda leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 5.40 min where the AutoZyme path takes 1.93 min (2.80× faster).

How do I make celda decontX faster?

Install AutoZyme and activate the celda patch, then keep using celda decontX exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 2.80× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the celda decontX output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original celda result) on every benchmark dataset.

How do I install the celda speedup?

In R: install the autozyme package, then run library(autozyme) and autozyme::activate("celda"). The patch applies automatically the next time you call celda decontX.