Speed up Astropy BoxLeastSquares: up to 6.69× faster, identical output

Benchmark charts

Switch benchmark platform; all charts update together

Speedup distribution

Each dot is one finalized dataset/thread run on Windows

kepler11_q3_q4_months…

6.69×

kepler9_q2_full_sc

6.51×

kepler8_q2_full_sc

6.30×

kepler10_q3_months1_2…

6.17×

kepler_tres2_q2_month…

5.78×

kepler11_q3_q4_months…kepler9_q2_full_sckepler8_q2_full_sckepler10_q3_months1_2…kepler_tres2_q2_month…

Thread sweep

Speedup across finalized thread counts on Windows

kepler11_q3_q4_mont…kepler9_q2_full_sckepler8_q2_full_sckepler10_q3_months1…kepler_tres2_q2_mon…

Memory

Baseline vs optimized peak memory on Windows

baselineoptimized

What is accelerated

The public API stays the same; AutoZyme replaces only the supported fast path.

This task targets astropy.timeseries.BoxLeastSquares.autopower in astropy. The benchmarked result preserves the declared scientific output gate while reducing CPU runtime on the listed datasets.

Also searched as: BLS, box least squares, periodogram, transit search, exoplanet, period search.

Supported scope

The patch replaces the native kernel astropy.timeseries.periodograms.bls.methods.bls_fast (the hot function under BoxLeastSquares.autopower/.power for method='fast'). Read full supported scope

The patch replaces the native kernel astropy.timeseries.periodograms.bls.methods.bls_fast (the hot function under BoxLeastSquares.autopower/.power for method='fast'). It is an embarrassingly-parallel split of the upstream bls_fast over the trial-period axis: it slices period[chunk] into n_workers disjoint contiguous ranges, calls the captured upstream _orig_bls_fast (bls_impl) on each slice with the SAME (t, y, ivar, duration, oversample, use_likelihood) it received, then re-assembles the 7-tuple result by field-wise np.concatenate in original period order. Because each trial period's BLS computation in bls_impl is independent of every other period, this reproduces upstream bit-exactly (speedups_finalized.tsv shows pearson_power=1.0, q99_rel_diff_power=0.0, rel_peak_period_err=0.0, top10_peak_jaccard=1.0). Crucially, the patch receives oversample and use_likelihood as already-resolved arguments and passes them through unchanged, so it correctly handles ANY objective ('likelihood' or 'snr'), ANY oversample>=1, ANY duration array, and any period grid produced by autopower/autoperiod (any minimum_n_transit/minimum_period/maximum_period/frequency_factor). method='slow' is NOT patched (only bls_fast is in targets=), so method='slow' transparently uses unmodified upstream bls_slow. The parallel path only engages when len(period) >= 2*16384 (32768) AND the resolved worker count (_thread_count) >= 2; otherwise (single-thread env or small grid) it returns the original bls_fast result unchanged. Threads resolve from ZYME_THREADS/AUTOZYME_THREADS/AUTOZYMER_THREADS/OMP_NUM_THREADS, else os.cpu_count()+6.

Out-of-scope behavior

silent fallback to upstream

Show detailed speedup table 9 runs

Dataset	Tier	Platform	Threads	Baseline	Optimized	Speedup	Memory	Concordance	Pass
`kepler_tres2_q2_month1_sc`	small	Windows	8	28.73 s	5.02 s	5.78×	0.1 → 0.1 GB	—	pass
`kepler10_q3_months1_2_sc`	medium	Windows	8	2.30 min	22.38 s	6.17×	0.1 → 0.2 GB	—	pass
`kepler11_q3_q4_months1_5_sc`	ood_xlarge	Windows	8	16.68 min	2.74 min	6.69×	0.2 → 0.3 GB	—	pass
`kepler8_q2_full_sc`	ood_large	Windows	8	5.30 min	52.22 s	6.30×	0.1 → 0.2 GB	—	pass
`kepler9_q2_full_sc`	large	Windows	8	4.95 min	46.35 s	6.51×	0.1 → 0.2 GB	—	pass
`kepler_tres2_q2_month1_sc`	small	macOS	8	27.77 s	3.64 s	7.64×	0.1 → 0.1 GB	—	pass
`kepler10_q3_months1_2_sc`	medium	macOS	8	2.58 min	19.64 s	7.84×	0.1 → 0.2 GB	—	pass
`kepler8_q2_full_sc`	ood_large	macOS	4	5.44 min	1.41 min	3.89×	0.1 → 0.2 GB	—	pass
`kepler9_q2_full_sc`	large	macOS	8	5.22 min	39.73 s	7.79×	0.1 → 0.2 GB	—	pass

Frequently asked questions

Speeding up Astropy BoxLeastSquares

Why is Astropy BoxLeastSquares slow?

Astropy BoxLeastSquares is CPU-bound, and the stock implementation in astropy leaves performance on the table in its core numerical work. On the benchmark datasets the original takes 16.68 min where the AutoZyme path takes 2.74 min (6.69× faster).

How do I make Astropy BoxLeastSquares faster?

Install AutoZyme and activate the astropy patch, then keep using Astropy BoxLeastSquares exactly as before. AutoZyme transparently substitutes the faster, output-validated path, up to 6.69× faster on the benchmark datasets, with no pipeline or API changes.

Does the AutoZyme speedup change the Astropy BoxLeastSquares output?

No. The accelerated path returns bit-for-bit identical results to the original astropy implementation (maximum absolute difference 0), checked by a frozen concordance gate on every benchmark dataset.

How do I install the astropy speedup?

In Python: pip install autozyme, then import autozyme and autozyme.activate("astropy"). The patch applies automatically the next time you call astropy.timeseries.BoxLeastSquares.autopower.