R Quality control celda

Speed up celda decontX

celda decontX is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a verified, drop-in patch that is up to 2.80× faster, returning output within a strict, verified tolerance with no change to how you call it.

Best speedup 2.80×
Median speedup 1.75×
Output equivalence Tolerance
Best runtime baseline 5.40 min optimized 1.93 min
Datasets 5
Pass rate 10/10

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
ext_hgmm_chemistry_20…ext_scmixology_sc_10x…ext_pfc_snrna_herring…pbmc68kifnb
Thread sweep
Speedup across finalized thread counts on Windows
2.5×14full (8)ext_hgmm_chemistry_20k_3p_HT · large1 threads · 1.51× speedup5.67 min baseline → 3.68 min optimizedmemory 6.2 GB → 7.9 GBext_hgmm_chemistry_20k_3p_HT · large4 threads · 2.80× speedup5.40 min baseline → 1.93 min optimizedmemory 6.2 GB → 7.9 GBext_hgmm_chemistry_20k_3p_HT · large8 threads · 2.06× speedup5.87 min baseline → 2.85 min optimizedmemory 6.2 GB → 7.9 GBext_scmixology_sc_10x_3cl · small1 threads · 1.15× speedup17.67 s baseline → 15.52 s optimizedmemory 1.4 GB → 1.4 GBext_scmixology_sc_10x_3cl · small4 threads · 1.80× speedup14.71 s baseline → 8.17 s optimizedmemory 1.4 GB → 1.4 GBext_scmixology_sc_10x_3cl · small8 threads · 1.36× speedup17.72 s baseline → 12.94 s optimizedmemory 1.4 GB → 1.4 GBext_pfc_snrna_herring_RL2103_ga22 · medium1 threads · 1.21× speedup1.48 min baseline → 1.22 min optimizedmemory 2.9 GB → 3.1 GBext_pfc_snrna_herring_RL2103_ga22 · medium4 threads · 1.70× speedup1.35 min baseline → 47.81 s optimizedmemory 2.9 GB → 3.1 GBext_pfc_snrna_herring_RL2103_ga22 · medium8 threads · 1.35× speedup1.43 min baseline → 1.06 min optimizedmemory 2.9 GB → 3.1 GBpbmc68k · ood_xlarge1 threads · 1.11× speedup5.71 min baseline → 5.13 min optimizedmemory 6.9 GB → 7.1 GBpbmc68k · ood_xlarge4 threads · 1.25× speedup5.55 min baseline → 4.43 min optimizedmemory 6.9 GB → 7.1 GBpbmc68k · ood_xlarge8 threads · 1.21× speedup5.64 min baseline → 4.65 min optimizedmemory 6.9 GB → 7.1 GBifnb · ood_large1 threads · 1.11× speedup46.69 s baseline → 43.82 s optimizedmemory 2.2 GB → 2.4 GBifnb · ood_large4 threads · 1.21× speedup49.96 s baseline → 41.31 s optimizedmemory 2.2 GB → 2.4 GBifnb · ood_large8 threads · 1.24× speedup54.02 s baseline → 43.70 s optimizedmemory 2.2 GB → 2.4 GB
ext_hgmm_chemistry_…ext_scmixology_sc_1…ext_pfc_snrna_herri…pbmc68kifnb
Memory
Baseline vs optimized peak memory on Windows
0.0 GB5.0 GB10 GBpbmc68k1.02×ext_hgmm_chemistr…1.28×ext_pfc_snrna_her…1.07×ifnb1.08×ext_scmixology_sc…1.01×pbmc68k · ood_xlargememory 6.9 GB → 7.1 GBoptimized / baseline 1.02×1.21× speedup · 8 threadsext_hgmm_chemistry_20k_3p_HT · largememory 6.2 GB → 7.9 GBoptimized / baseline 1.28×1.51× speedup · 1 threadsext_pfc_snrna_herring_RL2103_ga22 · mediummemory 2.9 GB → 3.1 GBoptimized / baseline 1.07×1.21× speedup · 1 threadsifnb · ood_largememory 2.2 GB → 2.4 GBoptimized / baseline 1.08×1.21× speedup · 4 threadsext_scmixology_sc_10x_3cl · smallmemory 1.4 GB → 1.4 GBoptimized / baseline 1.01×1.36× speedup · 8 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets decontX in celda. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: celda, ambient RNA, decontamination, contamination removal, quality control.

Supported scope

The patch overrides four celda namespace functions that decontX calls internally; the upstream decontX driver (argument parsing, the per-batch loop over .decontXoneBatch, the z=NULL vs user-z dispatch, the EM convergence test on max|Δθ|) is left fully intact,… Read full supported scope

The patch overrides four celda namespace functions that decontX calls internally; the upstream decontX driver (argument parsing, the per-batch loop over .decontXoneBatch, the z=NULL vs user-z dispatch, the EM convergence test on max|Δθ|) is left fully intact, so the fast path runs for the SAME parameter space decontX itself accepts. Specifically: (1) decontXLogLik is restored to a passthrough of celda's original C++ LL (the header comment calling it a no-op is stale — it computes the real LL), so LL-based diagnostics stay correct. (2) .decontxInitializeZ reproduces celda's UMAP+dbscan(+kmeans fallback) initialization on a raw counts matrix or SingleCellExperiment, differing only by passing auto_threads(cap=8) to scater::calculateUMAP; it is reached only when z=NULL (celda skips init when the user supplies z), and respects varGenes, dbscanEps, estimateCellTypes, and seed. (3) calculateNativeMatrix is a sparse R reimplementation of celda's normp-weighted native-count formula, keeping res$decontXcounts a real output. (4) decontXEM is an RcppEigen + std::thread reimplementation of one EM iteration, math-equivalent to upstream and reported bit-exact at all tested tiers; it honors estimate_eta, estimate_delta, delta (2-element prior), pseudocount, theta, and per-cell counts. Parallelism engages only when nC >= 500 AND requested threads > 1 (otherwise an identical serial path runs), and estimate_delta delegates to MCMCprecision::fit_dirichlet exactly as upstream. Because the patch sits below the batch loop (decontXEM is invoked once per batch by celda), batch != NULL is handled correctly via upstream's preserved dispatch. Benchmarked only at the default config (delta=c(10,10), estimateDelta=TRUE, varGenes=5000, dbscanEps=1, maxIter=500, seed=12345) across tiny/small/medium/large + two OOD tiers; honest speedups ~1.1-1.5x at 1 thread, up to ~2.85x at 4 threads.

Out-of-scope behavior

silent possibly wrong

Show detailed speedup table 10 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
ext_hgmm_chemistry_20k_3p_HT large Windows 4 5.40 min 1.93 min 2.80× 6.2 → 7.9 GB pass
ext_pfc_snrna_herring_RL2103_ga22 medium Windows 4 1.35 min 47.81 s 1.70× 2.9 → 3.1 GB pass
ext_scmixology_sc_10x_3cl small Windows 4 14.71 s 8.17 s 1.80× 1.4 → 1.4 GB pass
ifnb ood_large Windows 8 54.02 s 43.70 s 1.24× 2.2 → 2.4 GB pass
pbmc68k ood_xlarge Windows 4 5.55 min 4.43 min 1.25× 6.9 → 7.1 GB pass
ext_hgmm_chemistry_20k_3p_HT large macOS 8 4.99 min 1.38 min 3.67× 10.4 → 10.5 GB pass
ext_pfc_snrna_herring_RL2103_ga22 medium macOS 8 40.27 s 20.12 s 2.04× 5.0 → 5.8 GB pass
ext_scmixology_sc_10x_3cl small macOS 8 7.66 s 3.75 s 2.13× 1.6 → 1.9 GB pass
ifnb ood_large macOS 4 19.02 s 14.82 s 1.28× 3.5 → 3.3 GB pass
pbmc68k ood_xlarge macOS 4 2.44 min 1.70 min 1.41× 9.0 → 8.8 GB pass

Frequently asked questions

Speeding up celda decontX
Why is celda decontX slow?

celda decontX is CPU-bound, and the stock implementation in celda leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 5.40 min where the AutoZyme path takes 1.93 min (2.80× faster).

How do I make celda decontX faster?

Install AutoZyme and activate the celda patch, then keep using celda decontX exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 2.80× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the celda decontX output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original celda result) on every benchmark dataset.

How do I install the celda speedup?

In R: install the autozyme package, then run library(autozyme) and autozyme::activate("celda"). The patch applies automatically the next time you call celda decontX.