Python Astronomy astropy

Speed up Astropy BoxLeastSquares

Astropy BoxLeastSquares is one of the slower steps in many astronomy workflows. AutoZyme ships a verified, drop-in patch that is up to 6.69× faster, returning bit-for-bit identical results with no change to how you call it.

Best speedup 6.69×
Median speedup 6.51×
Output equivalence Bit-exact
Best runtime baseline 16.68 min optimized 2.74 min
Datasets 5
Pass rate 9/9

Benchmark charts

Switch benchmark platform; all charts update together
Platform
Speedup distribution
Each dot is one finalized dataset/thread run on Windows
kepler11_q3_q4_months…kepler9_q2_full_sckepler8_q2_full_sckepler10_q3_months1_2…kepler_tres2_q2_month…
Thread sweep
Speedup across finalized thread counts on Windows
10×14full (8)kepler11_q3_q4_months1_5_sc · ood_xlarge1 threads · 1.17× speedup15.40 min baseline → 15.66 min optimizedmemory 0.2 GB → 0.2 GBkepler11_q3_q4_months1_5_sc · ood_xlarge4 threads · 3.17× speedup21.08 min baseline → 5.77 min optimizedmemory 0.2 GB → 0.3 GBkepler11_q3_q4_months1_5_sc · ood_xlarge8 threads · 6.69× speedup16.68 min baseline → 2.74 min optimizedmemory 0.2 GB → 0.3 GBkepler9_q2_full_sc · large1 threads · 1.09× speedup4.51 min baseline → 4.52 min optimizedmemory 0.1 GB → 0.1 GBkepler9_q2_full_sc · large4 threads · 3.38× speedup5.79 min baseline → 1.46 min optimizedmemory 0.1 GB → 0.2 GBkepler9_q2_full_sc · large8 threads · 6.51× speedup4.95 min baseline → 46.35 s optimizedmemory 0.1 GB → 0.2 GBkepler8_q2_full_sc · ood_large1 threads · 1.17× speedup4.67 min baseline → 4.69 min optimizedmemory 0.1 GB → 0.1 GBkepler8_q2_full_sc · ood_large4 threads · 3.36× speedup5.69 min baseline → 1.63 min optimizedmemory 0.1 GB → 0.2 GBkepler8_q2_full_sc · ood_large8 threads · 6.30× speedup5.30 min baseline → 52.22 s optimizedmemory 0.1 GB → 0.2 GBkepler10_q3_months1_2_sc · medium1 threads · 0.99× speedup2.36 min baseline → 2.33 min optimizedmemory 0.1 GB → 0.1 GBkepler10_q3_months1_2_sc · medium4 threads · 2.79× speedup2.29 min baseline → 49.51 s optimizedmemory 0.1 GB → 0.2 GBkepler10_q3_months1_2_sc · medium8 threads · 6.17× speedup2.30 min baseline → 22.38 s optimizedmemory 0.1 GB → 0.2 GBkepler_tres2_q2_month1_sc · small1 threads · 0.99× speedup29.30 s baseline → 29.30 s optimizedmemory 0.1 GB → 0.1 GBkepler_tres2_q2_month1_sc · small4 threads · 2.77× speedup29.20 s baseline → 10.48 s optimizedmemory 0.1 GB → 0.1 GBkepler_tres2_q2_month1_sc · small8 threads · 5.78× speedup28.73 s baseline → 5.02 s optimizedmemory 0.1 GB → 0.1 GB
kepler11_q3_q4_mont…kepler9_q2_full_sckepler8_q2_full_sckepler10_q3_months1…kepler_tres2_q2_mon…
Memory
Baseline vs optimized peak memory on Windows
0.0 GB1.0 GB2.0 GBkepler11_q3_q4_mo…1.56×kepler9_q2_full_sc1.42×kepler8_q2_full_sc1.42×kepler10_q3_month…1.36×kepler_tres2_q2_m…1.23×kepler11_q3_q4_months1_5_sc · ood_xlargememory 0.2 GB → 0.3 GBoptimized / baseline 1.56×3.17× speedup · 4 threadskepler9_q2_full_sc · largememory 0.1 GB → 0.2 GBoptimized / baseline 1.42×3.38× speedup · 4 threadskepler8_q2_full_sc · ood_largememory 0.1 GB → 0.2 GBoptimized / baseline 1.42×3.36× speedup · 4 threadskepler10_q3_months1_2_sc · mediummemory 0.1 GB → 0.2 GBoptimized / baseline 1.36×6.17× speedup · 8 threadskepler_tres2_q2_month1_sc · smallmemory 0.1 GB → 0.1 GBoptimized / baseline 1.23×5.78× speedup · 8 threads
baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets astropy.timeseries.BoxLeastSquares.autopower in astropy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: BLS, box least squares, periodogram, transit search, exoplanet, period search.

Supported scope

The patch replaces the native kernel astropy.timeseries.periodograms.bls.methods.bls_fast (the hot function under BoxLeastSquares.autopower/.power for method='fast'). Read full supported scope

The patch replaces the native kernel astropy.timeseries.periodograms.bls.methods.bls_fast (the hot function under BoxLeastSquares.autopower/.power for method='fast'). It is an embarrassingly-parallel split of the upstream bls_fast over the trial-period axis: it slices period[chunk] into n_workers disjoint contiguous ranges, calls the captured upstream _orig_bls_fast (bls_impl) on each slice with the SAME (t, y, ivar, duration, oversample, use_likelihood) it received, then re-assembles the 7-tuple result by field-wise np.concatenate in original period order. Because each trial period's BLS computation in bls_impl is independent of every other period, this reproduces upstream bit-exactly (speedups_finalized.tsv shows pearson_power=1.0, q99_rel_diff_power=0.0, rel_peak_period_err=0.0, top10_peak_jaccard=1.0). Crucially, the patch receives oversample and use_likelihood as already-resolved arguments and passes them through unchanged, so it correctly handles ANY objective ('likelihood' or 'snr'), ANY oversample>=1, ANY duration array, and any period grid produced by autopower/autoperiod (any minimum_n_transit/minimum_period/maximum_period/frequency_factor). method='slow' is NOT patched (only bls_fast is in targets=), so method='slow' transparently uses unmodified upstream bls_slow. The parallel path only engages when len(period) >= 2*16384 (32768) AND the resolved worker count (_thread_count) >= 2; otherwise (single-thread env or small grid) it returns the original bls_fast result unchanged. Threads resolve from ZYME_THREADS/AUTOZYME_THREADS/AUTOZYMER_THREADS/OMP_NUM_THREADS, else os.cpu_count()+6.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 9 runs
Dataset Tier Platform Threads Baseline Optimized Speedup Memory Concordance Pass
kepler_tres2_q2_month1_sc small Windows 8 28.73 s 5.02 s 5.78× 0.1 → 0.1 GB pass
kepler10_q3_months1_2_sc medium Windows 8 2.30 min 22.38 s 6.17× 0.1 → 0.2 GB pass
kepler11_q3_q4_months1_5_sc ood_xlarge Windows 8 16.68 min 2.74 min 6.69× 0.2 → 0.3 GB pass
kepler8_q2_full_sc ood_large Windows 8 5.30 min 52.22 s 6.30× 0.1 → 0.2 GB pass
kepler9_q2_full_sc large Windows 8 4.95 min 46.35 s 6.51× 0.1 → 0.2 GB pass
kepler_tres2_q2_month1_sc small macOS 8 27.77 s 3.64 s 7.64× 0.1 → 0.1 GB pass
kepler10_q3_months1_2_sc medium macOS 8 2.58 min 19.64 s 7.84× 0.1 → 0.2 GB pass
kepler8_q2_full_sc ood_large macOS 4 5.44 min 1.41 min 3.89× 0.1 → 0.2 GB pass
kepler9_q2_full_sc large macOS 8 5.22 min 39.73 s 7.79× 0.1 → 0.2 GB pass

Frequently asked questions

Speeding up Astropy BoxLeastSquares
Why is Astropy BoxLeastSquares slow?

Astropy BoxLeastSquares is CPU-bound, and the stock implementation in astropy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 16.68 min where the AutoZyme path takes 2.74 min (6.69× faster).

How do I make Astropy BoxLeastSquares faster?

Install AutoZyme and activate the astropy patch, then keep using Astropy BoxLeastSquares exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 6.69× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Astropy BoxLeastSquares output?

No. The accelerated path returns bit-for-bit identical results to the original astropy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the astropy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("astropy"). The patch applies automatically the next time you call astropy.timeseries.BoxLeastSquares.autopower.