Explorable Universe Working Group - CosmicAI — NSF-Simons AI Institute for Cosmic Origins

Explorable Universe

The Big Picture

Astronomers face a growing challenge: modern telescopes generate massive amounts of data that must be carefully analyzed, visualized, and interpreted to unlock scientific discoveries. Creating high-quality visualizations and statistical models from this data typically requires years of specialized training and knowledge of astronomy-specific standards. While AI assistants can help, how can they effectively collaborate with scientists? Further, how can we make systems run more efficiently on massive amounts of data?

The Explorable Working Group is developing better AI tools that are more useful for expert astronomers and efficient enough to run at scale. Our AI systems create, critique, and improve their work through multiple rounds of refinement, much like a scientist would approach a problem. Since astronomy research often requires large amounts of data, we are developing methods to train and deploy large AI models more efficiently. Our techniques reduce training data requirements and accelerate AI. For AI assistants to help astronomers conduct research, they need to trust the model's output. We have created frameworks to quantify uncertainty in AI outputs and have examined how human-AI collaboration reshapes scientific creativity.

By automating time-consuming tasks with efficient, trustworthy systems, our work helps astronomers spend less time on manual data processing and more time making discoveries about the universe.

Research Overview

The Explorable Working Group is developing efficient AI-driven frameworks to accelerate scientific discovery in astronomy by using code assistants to automate key components of the research workflow: data processing, visualization, model building, and scientific literature review. High-quality scientific visualizations and robust statistical models are central to interpreting astronomical observations, yet traditional AI assistants often produce outputs that do not adhere to field-specific standards or poorly reflect underlying phenomena due to a lack of specialized domain knowledge. The Explorable WG addresses these challenges by leveraging Vision-Language Model (VLM)-based AI agents, backed by expert scientific reasoning. Given the scale of astronomical datasets and computational demands of modern foundation models, the WG also develops methods for efficient training, inference, and deployment that enable these AI systems to operate at scale.
Exploratory Coding Agents for Processing, Visualizing, and Modeling Data.

We are developing CosmicCoder, an AI coding assistant designed to accelerate scientific discovery in astronomy through robust multimodal reasoning and autonomous environment exploration. CosmicCoder will push the boundaries of generative AI for scientific research by: performing automated model discovery; integrating exploration policies with the model’s own uncertainty; using tools to better handle the underlying data under increasingly realistic observational conditions; and enhancing the ability to understand scientific queries and scientific utility. We are developing evaluation benchmarks to validate these advances and for larger community involvement. Our vision is for CosmicCoder to serve as a sophisticated partner, capable of navigating the high-cost "exploration" of scientific coding to surface new research directions in data-intensive domains.

Efficient Training and Inference for AI Models

Given the scale of astronomical datasets and computational demands of modern foundation models, the WG develops methods to improve efficiency across training, inference, and deployment of AI systems.

First, high-quality training data is crucial for enabling the CosmicCoder model to reason about its (multimodal) generations and accelerate scientific discovery. To do so, we are developing methods to generate high-quality data for efficient training of the CosmicCoder and ensure accurate outputs. This includes generating targeted (multimodal) synthetic data to enable the model to reason efficiently, and using reinforcement learning methods to further improve its generalizability across various scientific tasks. We are also theoretically analyzing and identifying the characteristics of high-quality data that enable effective reasoning and test-time scaling.

For inference and deployment, we focus on efficient job scheduling and resource management to better support large-scale inference workloads. By designing schedulers that intelligently allocate incoming requests across model instances, we aim to reduce query latency, increase throughput, and improve overall system responsiveness. This work enables more timely and cost-effective scientific question answering for astronomy researchers and users.

Trustworthiness and Interpretability of AI Systems

As AI systems integrate into scientific workflows, ensuring their reliability and understanding their behavior becomes essential. Our work addresses three critical aspects: understanding failure modes, quantifying uncertainty, and examining human-AI collaboration dynamics. We have identified attention heads in VLMs responsible for prompt-induced hallucinations and developed mitigation strategies. To support reliable deployment, we created a framework that predicts continuous-valued metrics with calibrated uncertainty intervals for free-form generation tasks, enabling systems to quantify confidence across applications, from code generation to scientific summarization. Beyond technical reliability, we examined sociotechnical implications by studying LLM-supported research ideation, revealing how automation reshapes creativity. Ownership becomes a negotiation between humans and AI, while human effort shifts from ideation to verification. These efforts advance both technical foundations and responsible deployment of AI in scientific discovery.

Use-inspired Research

Use-inspired research within the Explorable Universe working group leverages advanced machine learning to extract physical insights from the next generation of large astronomical datasets. One focus area targets galaxy evolution using foundation models and probabilistic autoencoders applied to massively multiplexed spectroscopic surveys such as the Sloan Digital Sky Survey (SDSS) and the Dark Energy Spectroscopic Instrument (DESI) survey. These efforts probe the structure of learned latent spaces, revealing interpretable axes linked to key evolutionary processes such as galaxy quenching, while also uncovering unexpected correlations, including spectral signatures that encode information typically associated with galaxy morphology in imaging data. In parallel, we are developing robust ML frameworks to refine DESI redshift measurements for millions of spectra, enabling automated identification of pipeline failures and systematic uncertainties, as well as the discovery of rare or anomalous astrophysical sources. Complementing these approaches, we are building a new multi-modal foundation model that jointly learns from DESI spectra and Legacy Surveys imaging, enabling unified representations across data modalities. Together, these projects establish a path toward AI-driven models that simultaneously capture intrinsic astrophysical processes (e.g., governing the co-evolution of galaxies and their central supermassive black holes) while accounting for observational selection effects and instrumental signatures embedded in the data. This research will both inform and demonstrate the capabilities of the emerging CosmicAI Data Platform, including integrated AI assistants such as CosmicCoder, designed to make these complex analyses accessible, scalable, and reproducible for the broader community.

Collectively, our research emphasizes exploratory scientific reasoning, using AI coding agents to automate astronomical research and to develop methods for efficiently training, deploying, and interpreting large-scale AI assistants. By synthesizing textual and visual modalities, leveraging domain-specific feedback, and improving the efficiency of training and inference pipelines, the Explorable WG aims to develop generalizable tools that reduce manual effort, improve the quality of scientific outputs, and accelerate discovery in high-volume astronomical research environments.

Publications/Works in Progress

AstroVisBench: A Code Benchmark for Scientific Computing and Visualization in Astronomy (NeuRIPS 2025).

The first benchmark for scientific computing and visualization in the astronomy domain, AstroVisBench, judges language models’ capabilities in producing astronomy-specific data processing and visualization workflows.

Website | ArXiv

Understanding the Role of Training Data in Test-Time Scaling (submitted).

Test-time scaling improves LLM reasoning through longer chains-of-thought, but effectiveness depends on training data diversity, task relevance, and whether required skills are sufficiently present.

ArXiv

AdaGen: Workload-Adaptive Cluster Scheduler for Latency-Optimal LLM Inference Serving (EuroSys 2026)

AdaGen is a workload-adaptive LLM cluster scheduler that optimizes compute layouts via classification, load balancing, and selective distribution, improving SLO attainment 3.6× and cost-efficiency 2×.

Who Owns Creativity and Who Does the Work? Trade-offs in LLM-Supported Research Ideation (under review)

LLM agents reshape scientific creativity by shifting human effort from ideation to verification, with ownership becoming negotiated rather than inherent, requiring designs emphasizing researcher empowerment.

ArXiv

LoRA is All You Need For Safety Alignment of Reasoning LLMs (Under Review)

LoRA-based safety alignment preserves reasoning abilities by restricting weight updates to low-rank spaces, achieving strong safety without the "Safety Tax" at minimal computational cost.

ArXiv

PIE: Performance Interval Estimation for Free-Form Generation Tasks (under review)

We introduce Performance Interval Estimation (PIE), which predicts continuous evaluation scores with calibrated uncertainty for generative tasks, and show regression-based methods outperform LLM-as-judge across 11 datasets.

ArXiv

Data Selection for Fine-Tuning Vision Language Models via Cross Modal Alignment Trajectories (under review)

XMAS enables data-efficient LVLM training by clustering examples with similar cross-modal attention patterns, removing redundancy while preserving performance and reducing datasets by up to 85%.

ArXiv

Mechanisms of Prompt-Induced Hallucination in Vision–Language Models (Under Review)

Vision-language models hallucinate by favoring prompts over visual evidence. Ablating specific attention heads reduces prompt-induced hallucinations by 40% without training, revealing model-specific mechanisms.

ArXiv

All CosmicAI Publications