Speed up WGCNA blockwise: up to 66.4× faster, identical output

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

log scale

heart_adult_25kx5p5k

66.4×

pbmc200k_glaucoma_20k…

60.2×

pbmc68k_15kx3k

40.2×

gastrulation_mouse_20…

36.9×

pbmc68k_5kx2k

20.6×

heart_adult_25kx5p5kpbmc200k_glaucoma_20k…pbmc68k_15kx3kgastrulation_mouse_20…pbmc68k_5kx2k

Thread sweep

Speedup across finalized thread counts on Windows

heart_adult_25kx5p5kpbmc200k_glaucoma_2…pbmc68k_15kx3kgastrulation_mouse_…pbmc68k_5kx2k

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets WGCNA::blockwiseModules in WGCNA. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: co-expression, gene network, weighted gene co-expression, blockwiseModules, network.

Supported scope

The fast path is correct only for WGCNA's "common case" as explicitly gated in .fast_tom_kernel_dispatch (patch.R lines 359-367): corType="pearson" (CcorType==0), networkType="unsigned" (CnetworkType==0), TOMType="signed" (CTOMType==2), TOMDenom="min"… Read full supported scope

The fast path is correct only for WGCNA's "common case" as explicitly gated in .fast_tom_kernel_dispatch (patch.R lines 359-367): corType="pearson" (CcorType==0), networkType="unsigned" (CnetworkType==0), TOMType="signed" (CTOMType==2), TOMDenom="min" (TOMDenomC==0), no observation weights (weights NULL), cosineCorrelation FALSE, replaceMissingAdjacencies FALSE, suppressTOMForZeroAdjacencies FALSE, suppressNegativeTOM FALSE, useInternalMatrixAlgebra FALSE, and no NAs in the per-block expression submatrix (!anyNA(selExpr)). When ALL those hold the per-block TOM is computed via matrixStats column z-score + BLAS crossprod (Apple Accelerate on macOS, dynamic BLAS on Windows, forked-chunk crossprod on other Unix, direct crossprod fallback) and is claimed bit-perfect vs WGCNA's C kernel. For ANY other combination the dispatch falls through to the original .Call("tomSimilarity_call", PACKAGE="WGCNA"), so non-common-case TOM is handled correctly by upstream. The other three namespace overrides are independently guarded: fast_moduleEigengenes defers to the original when zyme=FALSE and reimplements the upstream eigengene pipeline (irlba truncated SVD, matrixStats row-scale) for arbitrary colors/nPC/align/impute/subHubs; fast_goodSamplesGenes short-circuits to all-TRUE only after verifying no weights, no NAs, and all-finite nonzero column variances, otherwise defers to upstream; fast_collectGarbage is an unconditional no-op. blockwiseModules itself is body-patched (dead scale() skip + TOM .Call redirection) with a guarded fallback to the unmodified original if either string substitution fails to match (e.g. upstream version drift); tested_against WGCNA 1.74. The benchmarked params (power=4, signed TOM, unsigned network, pearson, min denom, no weights, clean HVG matrix) sit squarely inside the common-case gate.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 10 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`gastrulation_mouse_20kx4k`	ood_large	Windows	8	9.50 min	15.30 s	36.9×	6.1 → 3.9 GB	—	pass
`heart_adult_25kx5p5k`	ood_xlarge	Windows	8	22.96 min	21.24 s	66.4×	10.2 → 7.0 GB	—	pass
`pbmc200k_glaucoma_20kx4k`	large	Windows	8	11.11 min	11.06 s	60.2×	6.4 → 4.4 GB	—	pass
`pbmc68k_15kx3k`	medium	Windows	8	4.50 min	6.72 s	40.2×	3.6 → 2.5 GB	—	pass
`pbmc68k_5kx2k`	small	Windows	8	55.29 s	2.69 s	20.6×	1.3 → 0.9 GB	—	pass
`gastrulation_mouse_20kx4k`	ood_large	macOS	4	6.68 min	7.13 s	56.6×	10.6 → 7.4 GB	—	pass
`heart_adult_25kx5p5k`	ood_xlarge	macOS	4	19.51 min	10.33 s	115.7×	15.1 → 11.2 GB	—	pass
`pbmc200k_glaucoma_20kx4k`	large	macOS	8	9.24 min	5.53 s	101.9×	10.2 → 6.7 GB	—	pass
`pbmc68k_15kx3k`	medium	macOS	8	3.73 min	3.27 s	69.7×	6.2 → 3.8 GB	—	pass
`pbmc68k_5kx2k`	small	macOS	8	39.45 s	1.25 s	31.6×	2.3 → 1.3 GB	—	pass

Frequently asked questions

Speeding up WGCNA blockwise

Why is WGCNA blockwise slow?

WGCNA blockwise is CPU-bound, and the stock implementation in WGCNA leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 22.96 min where the AutoZyme path takes 21.24 s (66.4× faster).

How do I make WGCNA blockwise faster?

Install AutoZyme and activate the WGCNA patch, then keep using WGCNA blockwise exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 66.4× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the WGCNA blockwise output?

No. The accelerated path returns bit-for-bit identical results to the original WGCNA implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the WGCNA speedup?

In R: install the autozyme package, then run library(autozyme) and autozyme::activate("wgcna"). The patch applies automatically the next time you call WGCNA::blockwiseModules.