Speed up DIPY DTI: up to 6.57× faster, near-identical output

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

stanford_hardi_30x

6.57×

stanford_hardi_60x

6.26×

cfin_multib_60x

6.09×

stanford_hardi_8x

5.97×

sherbrooke_3shell_180x

5.02×

stanford_hardi_30xstanford_hardi_60xcfin_multib_60xstanford_hardi_8xsherbrooke_3shell_180x

Thread sweep

Speedup across finalized thread counts on Windows

No finalized multi-thread sweep for this platform.

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets DIPY · DTI in dipy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: diffusion tensor imaging, DTI, diffusion MRI, dMRI, tractography.

Supported scope

Correctly handles fit_method="WLS" (the upstream default) tensor fits. (1) Masked WLS fit (mask not None) uses the chunked fast path: per-chunk (step=1250) masked-voxel unpacking, batched 7x7 normal-equations solve via hand-rolled Cholesky (_chol7_solve),… Read full supported scope

Correctly handles fit_method="WLS" (the upstream default) tensor fits. (1) Masked WLS fit (mask not None) uses the chunked fast path: per-chunk (step=1250) masked-voxel unpacking, batched 7x7 normal-equations solve via hand-rolled Cholesky (_chol7_solve), with per-voxel SVD-pinv fallback (_pinv_wls_fit) for any voxel whose normal matrix is not safely positive-definite — so singular/ill-conditioned voxels are guarded and matched to upstream. (2) Unmasked WLS fit (mask=None) is also accelerated via the patched dti.wls_fit_tensor through the upstream-style reshape path. (3) return_S0_hat True/False both handled (line 235-250, 280-299). (4) weights= passed through to the kernel (line 144-145). (5) return_lower_triangular handled (line 177-178). (6) return_leverages=True routed to an upstream-equivalent SVD-pinv solve (line 165-172). All numerics are float64, matching upstream (no float32 downcast). Empty mask and single-voxel masks work. The supported numeric domain is the standard rank-2 diffusion tensor: a 7-column design matrix (6 tensor + 1 log-S0) producing the (...,12) eig_from_lo_tri output contract.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 9 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`cfin_multib_60x`	ood_large	Windows	1	4.58 min	45.18 s	6.09×	25.1 → 19.6 GB	—	pass
`sherbrooke_3shell_180x`	ood_xlarge	Windows	1	16.05 min	3.21 min	5.02×	79.4 → 78.4 GB	—	pass
`stanford_hardi_30x`	medium	Windows	1	3.04 min	27.77 s	6.57×	14.0 → 11.9 GB	—	pass
`stanford_hardi_60x`	large	Windows	1	6.05 min	58.01 s	6.26×	27.8 → 23.8 GB	—	pass
`stanford_hardi_8x`	small	Windows	1	38.61 s	6.47 s	5.97×	3.9 → 3.2 GB	—	pass
`cfin_multib_60x`	ood_large	macOS	1	3.67 min	14.12 s	15.6×	18.9 → 18.2 GB	—	pass
`stanford_hardi_30x`	medium	macOS	1	2.50 min	12.18 s	12.3×	15.0 → 12.6 GB	—	pass
`stanford_hardi_60x`	large	macOS	1	5.06 min	23.95 s	12.7×	18.4 → 17.2 GB	—	pass
`stanford_hardi_8x`	small	macOS	1	39.64 s	4.01 s	9.89×	5.5 → 4.1 GB	—	pass

Frequently asked questions

Speeding up DIPY DTI

Why is DIPY DTI slow?

DIPY DTI is CPU-bound, and the stock implementation in dipy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 3.04 min where the AutoZyme path takes 27.77 s (6.57× faster).

How do I make DIPY DTI faster?

Install AutoZyme and activate the dipy patch, then keep using DIPY DTI exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 6.57× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the DIPY DTI output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original dipy result) on every benchmark dataset.

How do I install the dipy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("dipy"). The patch applies automatically the next time you call DIPY DTI.