Benchmark charts
Speedup distribution
Each dot is one finalized dataset/thread run on WindowsThread sweep
Speedup across finalized thread counts on WindowsMemory
Baseline vs optimized peak memory on WindowsWhat is accelerated
This task targets decontX in celda. The benchmarked result
preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.
Also searched as: celda, ambient RNA, decontamination, contamination removal, quality control.
Supported scope
The patch overrides four celda namespace functions that decontX calls internally; the upstream decontX driver (argument parsing, the per-batch loop over .decontXoneBatch, the z=NULL vs user-z dispatch, the EM convergence test on max|Δθ|) is left fully intact,… Read full supported scope
The patch overrides four celda namespace functions that decontX calls internally; the upstream decontX driver (argument parsing, the per-batch loop over .decontXoneBatch, the z=NULL vs user-z dispatch, the EM convergence test on max|Δθ|) is left fully intact, so the fast path runs for the SAME parameter space decontX itself accepts. Specifically: (1) decontXLogLik is restored to a passthrough of celda's original C++ LL (the header comment calling it a no-op is stale — it computes the real LL), so LL-based diagnostics stay correct. (2) .decontxInitializeZ reproduces celda's UMAP+dbscan(+kmeans fallback) initialization on a raw counts matrix or SingleCellExperiment, differing only by passing auto_threads(cap=8) to scater::calculateUMAP; it is reached only when z=NULL (celda skips init when the user supplies z), and respects varGenes, dbscanEps, estimateCellTypes, and seed. (3) calculateNativeMatrix is a sparse R reimplementation of celda's normp-weighted native-count formula, keeping res$decontXcounts a real output. (4) decontXEM is an RcppEigen + std::thread reimplementation of one EM iteration, math-equivalent to upstream and reported bit-exact at all tested tiers; it honors estimate_eta, estimate_delta, delta (2-element prior), pseudocount, theta, and per-cell counts. Parallelism engages only when nC >= 500 AND requested threads > 1 (otherwise an identical serial path runs), and estimate_delta delegates to MCMCprecision::fit_dirichlet exactly as upstream. Because the patch sits below the batch loop (decontXEM is invoked once per batch by celda), batch != NULL is handled correctly via upstream's preserved dispatch. Benchmarked only at the default config (delta=c(10,10), estimateDelta=TRUE, varGenes=5000, dbscanEps=1, maxIter=500, seed=12345) across tiny/small/medium/large + two OOD tiers; honest speedups ~1.1-1.5x at 1 thread, up to ~2.85x at 4 threads.
Out-of-scope behavior
silent possibly wrong
Show detailed speedup table 10 runs
| Dataset | Tier | Platform | Threads | Baseline | Optimized | Speedup | Memory | Concordance | Pass |
|---|---|---|---|---|---|---|---|---|---|
ext_hgmm_chemistry_20k_3p_HT | large | Windows | 4 | 5.40 min | 1.93 min | 2.80× | 6.2 → 7.9 GB | — | pass |
ext_pfc_snrna_herring_RL2103_ga22 | medium | Windows | 4 | 1.35 min | 47.81 s | 1.70× | 2.9 → 3.1 GB | — | pass |
ext_scmixology_sc_10x_3cl | small | Windows | 4 | 14.71 s | 8.17 s | 1.80× | 1.4 → 1.4 GB | — | pass |
ifnb | ood_large | Windows | 8 | 54.02 s | 43.70 s | 1.24× | 2.2 → 2.4 GB | — | pass |
pbmc68k | ood_xlarge | Windows | 4 | 5.55 min | 4.43 min | 1.25× | 6.9 → 7.1 GB | — | pass |
ext_hgmm_chemistry_20k_3p_HT | large | macOS | 8 | 4.99 min | 1.38 min | 3.67× | 10.4 → 10.5 GB | — | pass |
ext_pfc_snrna_herring_RL2103_ga22 | medium | macOS | 8 | 40.27 s | 20.12 s | 2.04× | 5.0 → 5.8 GB | — | pass |
ext_scmixology_sc_10x_3cl | small | macOS | 8 | 7.66 s | 3.75 s | 2.13× | 1.6 → 1.9 GB | — | pass |
ifnb | ood_large | macOS | 4 | 19.02 s | 14.82 s | 1.28× | 3.5 → 3.3 GB | — | pass |
pbmc68k | ood_xlarge | macOS | 4 | 2.44 min | 1.70 min | 1.41× | 9.0 → 8.8 GB | — | pass |
Frequently asked questions
Why is celda decontX slow?
celda decontX is CPU-bound, and the stock implementation in celda leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 5.40 min where the AutoZyme path takes 1.93 min (2.80× faster).
How do I make celda decontX faster?
Install AutoZyme and activate the celda patch, then keep using celda decontX exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 2.80× faster on the benchmark datasets, with no pipeline or API changes.
Does the AutoZyme speedup change the celda decontX output?
Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original celda result) on every benchmark dataset.
How do I install the celda speedup?
In R: install the autozyme package, then run library(autozyme) and autozyme::activate("celda"). The patch applies automatically the next time you call celda decontX.