Speed up spacexr RCTD: up to 23.7× faster, validated output

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

log scale

lymph_node_rctd_10000…

23.7×

lymph_node_rctd_4033

23.7×

lymph_node_rctd_1498

18.8×

lymph_node_rctd_500

11.6×

lymph_node_rctd_10000…lymph_node_rctd_4033lymph_node_rctd_1498lymph_node_rctd_500

Thread sweep

Speedup across finalized thread counts on Windows

lymph_node_rctd_100…lymph_node_rctd_4033lymph_node_rctd_1498lymph_node_rctd_500

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets run.RCTD in spacexr. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: spacexr, deconvolution, spatial deconvolution, cell type deconvolution.

Supported scope

Accelerates ONLY the doublet-mode pixel-fitting loop reached by run.RCTD(., doublet_mode="doublet"), and the "full" mode loop (decompose_batch is patched). Read full supported scope

Accelerates ONLY the doublet-mode pixel-fitting loop reached by run.RCTD(., doublet_mode="doublet"), and the "full" mode loop (decompose_batch is patched). It patches 10 spacexr namespace internals (calc_log_l_vec, get_der_fast, solveWLS, solveIRWLS.weights, psd, process_bead_doublet, decompose_sparse, gather_results, process_beads_batch, decompose_batch) with C++/closed-form kernels. Fast C++ paths fire only for the non-bulk, non-constrained branches that fitPixels invokes (fitPixels calls process_beads_batch with constrain=F). Specifically: get_der_fast non-bulk; solveWLS p=1 (closed-form scalar Newton, exact) and p=2 (4-vertex active-set, equivalent to solve.QP for the 2x2 bound problem) only when !bulk_mode && !constrain; solveWLS p>2 non-bulk-unconstrained uses R quadprog single Newton step with the patched get_der_fast; solveIRWLS.weights non-bulk unconstrained ncol(S)>2 via rctd_cpp_irwls_full_nonbulk; decompose_sparse unconstrained p<=2 via rctd_cpp_irwls_sparse_p12; psd closed-form for 1x1 and 2x2; process_bead_doublet fused C++ candidate scoring + sparse pair refit in the !constrain branch. Full cell-type weights are reproduced exactly (pearson_weights=1.0) and small/medium/large/ood_xlarge tiers pass all thresholds on macOS and Windows. Threads {1,4,8} supported (mclapply fork on non-Windows when cores>1; PSOCK on Windows only when N>=6000 pixels, capped at 4 workers; otherwise serial). Likelihood globals Q_mat/SQ_mat/X_vals/K_val must be populated by spacexr::set_likelihood_vars (done by run.RCTD before fitPixels) before any fast_* runs.

Out-of-scope behavior

silent possibly wrong

Show detailed speedup table 10 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`lymph_node_rctd_10000_boot`	ood_xlarge	Windows	1	38.95 min	1.72 min	23.7×	2.1 → 1.8 GB	—	pass
`lymph_node_rctd_1498`	medium	Windows	1	5.12 min	16.34 s	18.8×	1.2 → 1.0 GB	—	pass
`lymph_node_rctd_3000_seed42`	ood_large	Windows	1	11.31 min	36.80 s	18.4×	1.4 → 1.2 GB	—	fail
`lymph_node_rctd_4033`	large	Windows	1	13.51 min	34.21 s	23.7×	1.6 → 1.3 GB	—	pass
`lymph_node_rctd_500`	small	Windows	1	1.83 min	9.50 s	11.6×	1.1 → 1.0 GB	—	pass
`lymph_node_rctd_10000_boot`	ood_xlarge	macOS	1	18.94 min	53.51 s	21.4×	3.5 → 2.6 GB	—	pass
`lymph_node_rctd_1498`	medium	macOS	1	2.95 min	8.45 s	21.0×	2.5 → 1.5 GB	—	pass
`lymph_node_rctd_3000_seed42`	ood_large	macOS	1	5.96 min	21.96 s	16.5×	2.1 → 1.7 GB	—	pass
`lymph_node_rctd_4033`	large	macOS	1	7.74 min	21.83 s	21.3×	2.6 → 2.0 GB	—	pass
`lymph_node_rctd_500`	small	macOS	1	1.12 min	8.70 s	7.70×	2.0 → 1.3 GB	—	pass

Frequently asked questions

Speeding up spacexr RCTD

Why is spacexr RCTD slow?

spacexr RCTD is CPU-bound, and the stock implementation in spacexr leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 38.95 min where the AutoZyme path takes 1.72 min (23.7× faster).

How do I make spacexr RCTD faster?

Install AutoZyme and activate the spacexr patch, then keep using spacexr RCTD exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 23.7× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the spacexr RCTD output?

Differences are small and bounded: concordance-validated to within roughly 1.5 to 5% of the original spacexr result on every benchmark dataset, inside a frozen gate.

How do I install the spacexr speedup?

In R: install the autozyme package, then run library(autozyme) and autozyme::activate("spacexr"). The patch applies automatically the next time you call run.RCTD.