Python Biomedical imaging dipy

Speed up DIPY DTI

DIPY DTI is one of the slower steps in many biomedical imaging workflows. AutoZyme ships a verified, drop-in patch that is up to 6.57× faster, returning output within a strict, verified tolerance with no change to how you call it.

Best speedup 6.57×
Median speedup 6.57×
Output equivalence Tolerance
Best runtime baseline 3.04 min optimized 27.77 s
Datasets 5
Pass rate 9/9

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
stanford_hardi_30xstanford_hardi_60xcfin_multib_60xstanford_hardi_8xsherbrooke_3shell_180x
Thread sweep
Speedup across finalized thread counts on Windows
No finalized multi-thread sweep for this platform.
Memory
Baseline vs optimized peak memory on Windows
0.0 GB50 GB100 GBsherbrooke_3shell…0.99×stanford_hardi_60x0.85×cfin_multib_60x0.78×stanford_hardi_30x0.85×stanford_hardi_8x0.84×sherbrooke_3shell_180x · ood_xlargememory 79 GB → 78 GBoptimized / baseline 0.99×5.02× speedup · 1 threadsstanford_hardi_60x · largememory 28 GB → 24 GBoptimized / baseline 0.85×6.26× speedup · 1 threadscfin_multib_60x · ood_largememory 25 GB → 20 GBoptimized / baseline 0.78×6.09× speedup · 1 threadsstanford_hardi_30x · mediummemory 14 GB → 12 GBoptimized / baseline 0.85×6.57× speedup · 1 threadsstanford_hardi_8x · smallmemory 3.9 GB → 3.2 GBoptimized / baseline 0.84×5.97× speedup · 1 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets DIPY · DTI in dipy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: diffusion tensor imaging, DTI, diffusion MRI, dMRI, tractography.

Supported scope

Correctly handles fit_method="WLS" (the upstream default) tensor fits. (1) Masked WLS fit (mask not None) uses the chunked fast path: per-chunk (step=1250) masked-voxel unpacking, batched 7x7 normal-equations solve via hand-rolled Cholesky (_chol7_solve),… Read full supported scope

Correctly handles fit_method="WLS" (the upstream default) tensor fits. (1) Masked WLS fit (mask not None) uses the chunked fast path: per-chunk (step=1250) masked-voxel unpacking, batched 7x7 normal-equations solve via hand-rolled Cholesky (_chol7_solve), with per-voxel SVD-pinv fallback (_pinv_wls_fit) for any voxel whose normal matrix is not safely positive-definite — so singular/ill-conditioned voxels are guarded and matched to upstream. (2) Unmasked WLS fit (mask=None) is also accelerated via the patched dti.wls_fit_tensor through the upstream-style reshape path. (3) return_S0_hat True/False both handled (line 235-250, 280-299). (4) weights= passed through to the kernel (line 144-145). (5) return_lower_triangular handled (line 177-178). (6) return_leverages=True routed to an upstream-equivalent SVD-pinv solve (line 165-172). All numerics are float64, matching upstream (no float32 downcast). Empty mask and single-voxel masks work. The supported numeric domain is the standard rank-2 diffusion tensor: a 7-column design matrix (6 tensor + 1 log-S0) producing the (...,12) eig_from_lo_tri output contract.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 9 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
cfin_multib_60x ood_large Windows 1 4.58 min 45.18 s 6.09× 25.1 → 19.6 GB pass
sherbrooke_3shell_180x ood_xlarge Windows 1 16.05 min 3.21 min 5.02× 79.4 → 78.4 GB pass
stanford_hardi_30x medium Windows 1 3.04 min 27.77 s 6.57× 14.0 → 11.9 GB pass
stanford_hardi_60x large Windows 1 6.05 min 58.01 s 6.26× 27.8 → 23.8 GB pass
stanford_hardi_8x small Windows 1 38.61 s 6.47 s 5.97× 3.9 → 3.2 GB pass
cfin_multib_60x ood_large macOS 1 3.67 min 14.12 s 15.6× 18.9 → 18.2 GB pass
stanford_hardi_30x medium macOS 1 2.50 min 12.18 s 12.3× 15.0 → 12.6 GB pass
stanford_hardi_60x large macOS 1 5.06 min 23.95 s 12.7× 18.4 → 17.2 GB pass
stanford_hardi_8x small macOS 1 39.64 s 4.01 s 9.89× 5.5 → 4.1 GB pass

Frequently asked questions

Speeding up DIPY DTI
Why is DIPY DTI slow?

DIPY DTI is CPU-bound, and the stock implementation in dipy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 3.04 min where the AutoZyme path takes 27.77 s (6.57× faster).

How do I make DIPY DTI faster?

Install AutoZyme and activate the dipy patch, then keep using DIPY DTI exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 6.57× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the DIPY DTI output?

Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original dipy result) on every benchmark dataset.

How do I install the dipy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("dipy"). The patch applies automatically the next time you call DIPY DTI.