R Differential expression & markers tradeSeq

Speed up tradeSeq fitGAM

tradeSeq fitGAM is one of the slower steps in many single-cell genomics workflows. AutoZyme ships a verified, drop-in patch that is up to 5.13× faster, returning bit-for-bit identical results with no change to how you call it.

Best speedup 5.13×
Median speedup 5.15×
Output equivalence Bit-exact
Best runtime baseline 16.44 min optimized 3.20 min
Datasets 5
Pass rate 10/10

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
fitgam_ood_2lin_6000g…fitgam_ood_2lin_3000g…fitgam_synth_4000g_30…fitgam_synth_2000g_30…fitgam_synth_750g_300…
Thread sweep
Speedup across finalized thread counts on Windows
10×14full (8)fitgam_ood_2lin_6000g_5000c · ood_xlarge1 threads · 1.35× speedup16.44 min baseline → 12.12 min optimizedmemory 8.2 GB → 4.0 GBfitgam_ood_2lin_6000g_5000c · ood_xlarge4 threads · 4.00× speedup15.35 min baseline → 4.11 min optimizedmemory 5.8 GB → 2.8 GBfitgam_ood_2lin_6000g_5000c · ood_xlarge8 threads · 5.13× speedup16.44 min baseline → 3.20 min optimizedmemory 7.0 GB → 2.8 GBfitgam_ood_2lin_3000g_3500c · ood_large4 threads · 3.14× speedup5.31 min baseline → 1.69 min optimizedmemory 3.3 GB → 1.2 GBfitgam_ood_2lin_3000g_3500c · ood_large8 threads · 3.23× speedup5.31 min baseline → 1.64 min optimizedmemory 3.3 GB → 1.6 GBfitgam_synth_4000g_3000c · large1 threads · 1.33× speedup4.84 min baseline → 3.52 min optimizedmemory 3.2 GB → 2.1 GBfitgam_synth_4000g_3000c · large4 threads · 3.00× speedup4.63 min baseline → 1.56 min optimizedmemory 2.7 GB → 1.6 GBfitgam_synth_4000g_3000c · large8 threads · 2.87× speedup4.68 min baseline → 1.63 min optimizedmemory 3.0 GB → 1.6 GBfitgam_synth_2000g_3000c · medium1 threads · 1.14× speedup2.41 min baseline → 2.11 min optimizedmemory 2.0 GB → 1.5 GBfitgam_synth_2000g_3000c · medium4 threads · 2.25× speedup2.32 min baseline → 1.07 min optimizedmemory 2.0 GB → 1.2 GBfitgam_synth_2000g_3000c · medium8 threads · 1.52× speedup2.58 min baseline → 1.59 min optimizedmemory 2.0 GB → 1.2 GBfitgam_synth_750g_3000c · small1 threads · 1.11× speedup59.69 s baseline → 53.60 s optimizedmemory 1.2 GB → 1.1 GBfitgam_synth_750g_3000c · small4 threads · 1.42× speedup59.68 s baseline → 41.78 s optimizedmemory 1.2 GB → 1.0 GBfitgam_synth_750g_3000c · small8 threads · 1.32× speedup57.13 s baseline → 44.94 s optimizedmemory 1.2 GB → 1.0 GB
fitgam_ood_2lin_600…fitgam_ood_2lin_300…fitgam_synth_4000g_…fitgam_synth_2000g_…fitgam_synth_750g_3…
Memory
Baseline vs optimized peak memory on Windows
0.0 GB5.0 GB10 GBfitgam_ood_2lin_6…0.49×fitgam_ood_2lin_3…0.47×fitgam_synth_4000…0.67×fitgam_synth_2000…0.60×fitgam_synth_750g…0.94×fitgam_ood_2lin_6000g_5000c · ood_xlargememory 8.2 GB → 4.0 GBoptimized / baseline 0.49×1.35× speedup · 1 threadsfitgam_ood_2lin_3000g_3500c · ood_largememory 3.3 GB → 1.6 GBoptimized / baseline 0.47×3.23× speedup · 8 threadsfitgam_synth_4000g_3000c · largememory 3.2 GB → 2.1 GBoptimized / baseline 0.67×1.33× speedup · 1 threadsfitgam_synth_2000g_3000c · mediummemory 2.0 GB → 1.2 GBoptimized / baseline 0.60×1.52× speedup · 8 threadsfitgam_synth_750g_3000c · smallmemory 1.2 GB → 1.1 GBoptimized / baseline 0.94×1.11× speedup · 1 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets tradeSeq · fitGAM in tradeSeq. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: trajectory differential expression, pseudotime DE, GAM, fitGAM, trajectory DE.

Supported scope

The fast path (fast_fitGAM_hoist_formula) is entered only when zyme=TRUE AND conditions is NULL. Read full supported scope

The fast path (fast_fitGAM_hoist_formula) is entered only when zyme=TRUE AND conditions is NULL. Within that it correctly handles the standard tradeSeq NB-GAM fitting workload: family="nb" with mgcv's log link, weights=NULL, offset with no dim (vector offset, the .get_offset default), sce=TRUE, aic=FALSE, single-lineage (ncol(pseudotime)==1, quantile-based knot placement with duplicate repair) AND multi-lineage (ncol(pseudotime)>=2, delegates knot placement to tradeSeq::.findKnots). Both single-lineage (dev tiers small/medium/large) and 2-lineage (OOD tiers ood_large/ood_xlarge, which force ncol==2 and the .findKnots branch) are benchmarked and pass bit-exact gates (beta/sigma/X/knot max_abs_diff <=1e-8..1e-12, converged_match=1.0). The hot-path acceleration (prefit_G reuse, hoisted formula template, fork-pool via .zyme_mclapply on macOS / PSOCK on Windows, and the mgcv::nb / gam.fit4 crossprod overrides) activates only when use_prefit_G is TRUE: is.null(weights) && is.null(dim(offset)) && family=="nb" && length(id)>0 (patch.R:335-336), and the fork/parallel compact path additionally requires !verbose && sce && !aic && worker_count>1 (patch.R:416). When use_prefit_G is FALSE it falls through to a serial/pbapply per-gene mgcv::gam() with the same formula. The scBLAS NB kernels (fast_dDeta and scblasR routing in fast_nb) are gated OFF by default via .az_feature_enabled('scblas_nb', default=FALSE); the default path uses the inline Rcpp nb_Dd_cpp/linkinv_log_cpp/nb_dev_resids_cpp kept bit-exact. The mgcv gam.fit4 crossprod rewrite is a textual substitution pinned to mgcv 1.9.4 (sentinel warns if patterns miss).

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 10 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
fitgam_ood_2lin_3000g_3500c ood_large Windows 8 5.31 min 1.64 min 3.23× 3.3 → 1.6 GB pass
fitgam_ood_2lin_6000g_5000c ood_xlarge Windows 8 16.44 min 3.20 min 5.13× 7.0 → 2.8 GB pass
fitgam_synth_2000g_3000c medium Windows 4 2.32 min 1.07 min 2.25× 2.0 → 1.2 GB pass
fitgam_synth_4000g_3000c large Windows 4 4.63 min 1.56 min 3.00× 2.7 → 1.6 GB pass
fitgam_synth_750g_3000c small Windows 4 59.68 s 41.78 s 1.42× 1.2 → 1.0 GB pass
fitgam_ood_2lin_3000g_3500c ood_large macOS 8 3.48 min 29.67 s 7.04× 5.1 → 1.5 GB pass
fitgam_ood_2lin_6000g_5000c ood_xlarge macOS 8 8.70 min 1.37 min 6.41× 13.0 → 2.4 GB pass
fitgam_synth_2000g_3000c medium macOS 8 59.92 s 10.66 s 5.63× 2.8 → 1.2 GB pass
fitgam_synth_4000g_3000c large macOS 8 2.13 min 20.20 s 6.32× 4.7 → 1.4 GB pass
fitgam_synth_750g_3000c small macOS 8 23.02 s 4.57 s 5.18× 1.7 → 1.1 GB pass

Frequently asked questions

Speeding up tradeSeq fitGAM
Why is tradeSeq fitGAM slow?

tradeSeq fitGAM is CPU-bound, and the stock implementation in tradeSeq leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 16.44 min where the AutoZyme path takes 3.20 min (5.13× faster).

How do I make tradeSeq fitGAM faster?

Install AutoZyme and activate the tradeSeq patch, then keep using tradeSeq fitGAM exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 5.13× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the tradeSeq fitGAM output?

No. The accelerated path returns bit-for-bit identical results to the original tradeSeq implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the tradeSeq speedup?

In R: install the autozyme package, then run library(autozyme) and autozyme::activate("tradeseq"). The patch applies automatically the next time you call tradeSeq fitGAM.