Observable Universe WG — NSF-Simons AI Institute for Cosmic Origins

Observable Universe

The Big Picture

As we enter a new era of astronomical exploration, telescopes like the ALMA (Atacama Large Millimeter/submillimeter Array) are undergoing massive Wideband Sensitivity Upgrades (WSU), and the upcoming next-generation Very Large Array (ngVLA) will soon begin collecting data at an unprecedented scale. While these upgrades allow us to see the universe in more detail than ever before, they also produce a "data deluge"—a flood of information so massive and complex that it is impossible for human scientists to sort through it manually. Our working group is building the bridge between these raw signals from space and the scientific discoveries they hold.

We are developing advanced AI models that can automatically clean and prepare this raw data as it arrives from the telescope. By “teaching” computers to distinguish meaningful cosmic signals from random background noise, we can accelerate the process of turning raw radio waves into clear, science-ready images and maps. This automation ensures that astronomers spend less time fixing technical data issues and more time answering fundamental questions about how stars and galaxies form.

These AI advancements will eventually be part of CosmicCoder, a virtual research assistant designed to help astronomers navigate these vast digital archives. By making data more accessible and easier to understand, we are ensuring that the next generation of great discoveries is more efficient and readily accessible.

Research Overview

The Observable Universe working group is currently establishing a "data-to-discovery" pipeline for radio astronomy, specifically designed to address the challenges posed by high-dimensional, low signal-to-noise datasets. Modern radio telescopes generate massive volumes of raw telescope data that must undergo an elaborate and computationally intensive process of calibration and reduction, including removing interference, correcting for atmospheric effects, and varying antenna gains, before it becomes science-ready. This raw data is inherently multidimensional, and once processed, it yields massive hyperspectral cubes. These cubes are three-dimensional data structures where two dimensions represent the sky's coordinates and the third represents frequency (or velocity). As next-generation facilities like the ngVLA and the ALMA WSU come online, our goal is to support astronomical data processing by transitioning from largely manual supervision to foundational AI automation.

The first primary focus of our research is on developing new statistical methods for anomaly detection in raw telescope data. In this context, anomaly detection refers to identifying problematic data—such as Radio-Frequency Interference (RFI), hardware malfunctions, or calibration errors—that can corrupt the final image. We are working toward AI models designed to flag these "anomalies" automatically, pushing the boundaries of what can be recovered from datasets where the signal is nearly buried. By isolating and mitigating these data-quality issues at the source, we aim to fully automate the radio data processing pipeline, ensuring that only high-fidelity signals reach the final archive.

The second primary front focuses on segmentation and identification within these high-dimensional cubes. Using deep learning, we intend to automatically isolate and "mask" regions of scientific interest—such as rotating proto-stellar disks or distant evolving galaxies. Segmentation is the process of partitioning a digital image (or a 3D volume) into multiple segments to simplify its representation. Our objective is to train an AI model to accurately trace the boundaries of these complex structures, providing researchers with precisely defined targets for follow-up analysis without the need for manual inspection.

A significant offshoot of these efforts will be the automated generation of high-quality training data for AI/ML applications by the community. Historically, the "bottleneck" in astronomical AI has been the lack of "ground truth" labels—the human-verified examples needed to train neural networks. As our detection and segmentation algorithms begin to process the archives, they are expected to naturally produce a vast, self-consistent library of labeled data. This secondary outcome will abstract complex raw data into manageable, pre-labeled categories, creating a foundational resource for the broader community to train their own specialized models.

Finally, these research outputs are planned for integration into CosmicCoder, a virtual research assistant. By feeding our future automated labels and segmented data into a large language model (LLM) framework, we aim to allow researchers to query archives using natural language. This would transform the global scientific community's interaction with radio archives from a manual search process into an intuitive, AI-guided exploration.

Projects

Astronomy-Related

Lead Scientists: Brian Mason, Jeff M. Phillips
During the data acquisition of interferometric radio observations, the bandpass calibration is a procedure to normalize the atmospheric transmission, which computes antenna responses against observing spectral frequency (referred to as calibration solutions). The reliability of these solutions is currently assessed using bespoke, expert-crafted heuristics implemented in the ALMA pipeline. To improve diagnostic performance and better guide data reduction actions, we are developing a supervised classifier for detecting anomalies in the bandpass solutions. The project uses features derived from kernel-based Scan Statistics that characterize issues such as bandpass platforming. The model is trained on a dataset of ~100,000 bandpass solutions from ALMA Cycle 9 observations and historical data flags from human experts as training labels. As part of the evaluation, distinguishing between anomalies in bandpass solutions that have a consequential impact on the calibrated visibility domain and those that do not.
Lead Scientists: Brain Mason, El Kindi Rezig.
Our work focuses on understanding and improving anomaly detection by identifying label outliers in calibration data. In practice, some signals are marked as anomalous even when their values do not align with typical anomaly patterns, suggesting inconsistencies in the labeling rather than data quality problems. Rather than analyzing signals individually, this work looks for agreement within groups of signals, where signal patterns that are similar are expected to share the same label. We are working towards a framework to surface these inconsistencies and enable closer examination of the affected signals, with the goal of improving the overall reliability and robustness of anomaly detection.
Lead Scientist: Omkar Bait
This research explores the radio continuum properties of low-redshift galaxies to identify the physical mechanisms driving the escape of ionizing Lyman continuum (LyC) photons, a process essential for understanding cosmic reionization. By analyzing multi-frequency radio observations (1–15 GHz) from the Low-redshift LyC Survey (LzLCS) and a sample of metal-poor extreme starburst galaxies (xSFGs), the studies establish an interesting link between radio spectral indices and LyC escape fractions (fesc).
Strong LyC leakers appear to exhibit flat radio spectra, indicating a dominance of thermal free-free emission and young stellar populations (under 5 million years old). In contrast, non-leakers show steeper spectra associated with older starbursts and cosmic-ray aging. The findings suggest that extreme star-forming environments, characterized by dense gas and young massive star clusters, facilitate LyC leakage p,ossibly before strong supernova (SN) feedback clears the interstellar medium. However, the exact nature of SN feedback, cosmic ray and magnetic fields in facilitating the escape (or lack) of LyC photons and the extreme nature of these sources remains to be studied.

AI-Related

Leads: Omkar Bait, Brian Mason, Ryan Loomis, Ci Xue.
Modern radio telescopes generate massive volumes of raw multi-dimensional data that must undergo extensive calibration and cleaning before becoming science-ready. To expedite this, we are developing Variational Auto-Encoders (VAEs) to learn the underlying distribution of "good" calibrations. By training on vast archives of expert-verified ALMA calibrated visibility data, the VAE learns to map the characteristics of high-fidelity signals into a compressed latent representation.
This approach enables robust anomaly detection by training a classifier that uses both reconstruction error and latent-space outliers. By capturing the nuances of expert heuristics at scale, this framework provides an adaptive diagnostic tool for future instruments like the ALMA WSU and ngVLA. Ultimately, this will automate the identification of problematic calibration tables and corrupted visibilities, ensuring high-integrity data with minimal human intervention.
Lead Scientists: Ziad Al-Halah, Ryan Loomis
Reconstructing an image from discretely sampled interferometric data involves identifying astronomical emission components or regions within hyperspectral data cubes. A mask defines the spatial and spectral regions used during deconvolution, and accurate masking prevents unnecessary processing of signal-free regions while enabling deeper, more reliable reconstruction of real signals. Mask generation is challenged by data sparsity and limited labels, due to the high-dimensional structure (both spatiotemporal and spectral) of these data cubes. This project is adapting pre-trained computer vision segmentation models for hyperspectral astronomical signal detection through transfer learning, leveraging self-supervised feature extraction methods and extensive images and video datasets to mitigate annotation challenges. The aim is to provide better mask quality, faster masking time, and fewer major deconvolution cycles, which dominates overall runtime in the current ALMA imaging pipeline. (This project is awarded as a CosmicAI 2025 Seed Funding project)