Benchmark charts
Speedup distribution
Each dot is one finalized dataset/thread run on WindowsThread sweep
Speedup across finalized thread counts on WindowsMemory
Baseline vs optimized peak memory on WindowsWhat is accelerated
This task targets DIPY · DTI in dipy. The benchmarked result
preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.
Also searched as: diffusion tensor imaging, DTI, diffusion MRI, dMRI, tractography.
Supported scope
Correctly handles fit_method="WLS" (the upstream default) tensor fits. (1) Masked WLS fit (mask not None) uses the chunked fast path: per-chunk (step=1250) masked-voxel unpacking, batched 7x7 normal-equations solve via hand-rolled Cholesky (_chol7_solve),… Read full supported scope
Correctly handles fit_method="WLS" (the upstream default) tensor fits. (1) Masked WLS fit (mask not None) uses the chunked fast path: per-chunk (step=1250) masked-voxel unpacking, batched 7x7 normal-equations solve via hand-rolled Cholesky (_chol7_solve), with per-voxel SVD-pinv fallback (_pinv_wls_fit) for any voxel whose normal matrix is not safely positive-definite — so singular/ill-conditioned voxels are guarded and matched to upstream. (2) Unmasked WLS fit (mask=None) is also accelerated via the patched dti.wls_fit_tensor through the upstream-style reshape path. (3) return_S0_hat True/False both handled (line 235-250, 280-299). (4) weights= passed through to the kernel (line 144-145). (5) return_lower_triangular handled (line 177-178). (6) return_leverages=True routed to an upstream-equivalent SVD-pinv solve (line 165-172). All numerics are float64, matching upstream (no float32 downcast). Empty mask and single-voxel masks work. The supported numeric domain is the standard rank-2 diffusion tensor: a 7-column design matrix (6 tensor + 1 log-S0) producing the (...,12) eig_from_lo_tri output contract.
Out-of-scope behavior
silent fallback to upstream
Show detailed speedup table 9 runs
| Dataset | Tier | Platform | Threads | Baseline | Optimized | Speedup | Memory | Concordance | Pass |
|---|---|---|---|---|---|---|---|---|---|
cfin_multib_60x | ood_large | Windows | 1 | 4.58 min | 45.18 s | 6.09× | 25.1 → 19.6 GB | — | pass |
sherbrooke_3shell_180x | ood_xlarge | Windows | 1 | 16.05 min | 3.21 min | 5.02× | 79.4 → 78.4 GB | — | pass |
stanford_hardi_30x | medium | Windows | 1 | 3.04 min | 27.77 s | 6.57× | 14.0 → 11.9 GB | — | pass |
stanford_hardi_60x | large | Windows | 1 | 6.05 min | 58.01 s | 6.26× | 27.8 → 23.8 GB | — | pass |
stanford_hardi_8x | small | Windows | 1 | 38.61 s | 6.47 s | 5.97× | 3.9 → 3.2 GB | — | pass |
cfin_multib_60x | ood_large | macOS | 1 | 3.67 min | 14.12 s | 15.6× | 18.9 → 18.2 GB | — | pass |
stanford_hardi_30x | medium | macOS | 1 | 2.50 min | 12.18 s | 12.3× | 15.0 → 12.6 GB | — | pass |
stanford_hardi_60x | large | macOS | 1 | 5.06 min | 23.95 s | 12.7× | 18.4 → 17.2 GB | — | pass |
stanford_hardi_8x | small | macOS | 1 | 39.64 s | 4.01 s | 9.89× | 5.5 → 4.1 GB | — | pass |
Frequently asked questions
Why is DIPY DTI slow?
DIPY DTI is CPU-bound, and the stock implementation in dipy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 3.04 min where the AutoZyme path takes 27.77 s (6.57× faster).
How do I make DIPY DTI faster?
Install AutoZyme and activate the dipy patch, then keep using DIPY DTI exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 6.57× faster on the benchmark datasets, with no pipeline or API changes.
Does the AutoZyme speedup change the DIPY DTI output?
Effectively no. The output is tolerance-equivalent: held within a frozen concordance gate (up to about 0.6% drift from the original dipy result) on every benchmark dataset.
How do I install the dipy speedup?
In Python: pip install autozyme, then import autozyme and autozyme.activate("dipy"). The patch applies automatically the next time you call DIPY DTI.