From Visualization to Knowledge Discovery

PASC 2026

Bernd Doser & Sebastian Trujillo-Gomez

Heidelberg Institute for Theoretical Studies (HITS)

Motivation

  • The amount of astronomical data is growing exponentially

    • Exascale cosmological simulations produce petabytes of data
    • Surveys (e.g. Euclid, Rubin) generate terabytes of data daily
  • Machine learning methods are needed to explore this amount of data and to extract knowledge

  • Our goal: To develop modular open-source tools for self-supervised knowledge discovery and interactive visualization of large-scale cosmological data

Data Acquisition

  • PEST preprocesses universal cosmological simulation data into multi-channel images, data cubes, and point clouds

  • ETL (Extract \(\rightarrow\) Transform \(\rightarrow\) Load) pipeline driven by YAML

  • Apache Parquet stores efficiently multi-modal data in a columnar data storage

  • Upload to Hugging Face or Zenodo for easy sharing and integration with ML frameworks

Representation Learning

  • Representation learning using a variational autoencoder

Polsterer et al. (2024); Doser et al. (2026)

  • Dimensionality reduction to a (hyper-)spherical latent space
  • Completely self-supervised - no labels required
  • Export to ONNX for interoperability with other frameworks

Why a (Hyper-)Spherical Latent Space?

  • Unbiased representation learning: The (hyper-)sphere has no preferred directions
  • Better interpolation: The (hyper-)sphere allows for smooth interpolation between points in the latent space

Cao and Aziz (2020)

How many dimensions?

Optimal latent space dimensionality selected by reconstruction quality.

Reconstruction Quality

Original IllustrisTNG SKIRT SDSS images

ResNet-18 autoencoder, 512 features: Sufficient reconstruction

ResNet-18 VAE-S\(^{128}\): Details are present, but blurry

Reconstruction Quality

Original IllustrisTNG SKIRT SDSS images

ResNet-18 VAE-S\(^{128}\): Details are present, but blurry

ResNet-18 VAE-S\({^2}\): Details are lost, but the overall structure is preserved

UMAP vs. Spherinator

  • UMAP is a popular dimensionality reduction technique, but it typically distorts the global data structure by using a 2D projection
  • Spherinator preserves the global data structure by using a sphere, leading to more meaningful representations

HiPSter

  • Takes the spherical latent positions from Spherinator
  • Generates a HiPS (Hierarchical Progressive Survey) map
  • Enables progressive zoom - more detail as you zoom in
  • Works with any standard HiPS viewer (e.g. Aladin-Lite)


Fernique et al. (2015)

End-to-End ML Pipeline

  • PEST processes the huge data amount on-site and uploads a standardised Parquet dataset
  • Spherinator learns a compact spherical representation and exports it via ONNX

VAE-S\(^2\) with Emojis

Projected

Generated

valhalla/emoji-dataset (2,749 emojis)

VAE-S\(^2\) with Emojis

Projected

Projected (zoomed)

valhalla/emoji-dataset (2,749 emojis)

VAE-S\(^2\) with Celebrities “Fill the sky with stars”

Projected

Generated

tonyassi/celebrity-1000 (18,184 images of 1000 celebrities)

Gaia Explorer

Largest most uniform all-sky spectrophotometric survey (over 220 million sources)

Doser et al. (2026)

Outlook: Shared Universe Engine

  • DynaVerse is a German Cluster of Excellence that combines astrophysics, mathematics, and computer science to study cosmic processes of different timescales
  • The SUE (Shared Universe Engine) is a unified platform connecting data, models, and simulations across scales and disciplines

Thank you for your attention!

Acknowledgement & Disclaimer

Funded by the European Union. This work has received funding from the European High Performance Computing Joint Undertaking (JU) and Belgium, Czech Republic, France, Germany, Greece, Italy, Norway, and Spain under grant agreement No 101093441.

Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European High Performance Computing Joint Undertaking (JU) and Belgium, Czech Republic, France, Germany, Greece, Italy, Norway, and Spain. Neither the European Union nor the granting authority can be held responsible for them.

Associated Materials

References

Cao, Nicola De, and Wilker Aziz. 2020. “The Power Spherical Distribution.” Proceedings of the 37th International Conference on Machine Learning, INNF+, ahead of print, June. https://doi.org/10.48550/arXiv.2006.04437.
Doser, Bernd, Kai L. Polsterer, and Sebastian Trujillo-Gomez. 2026. “Representation Learning for Gaia XP DR3.” Astronomical Data Analysis Software and Systems XXXV, ahead of print. https://doi.org/10.5281/zenodo.20689266.
Fernique, P., M. G. Allen, T. Boch, et al. 2015. “Hierarchical Progressive Surveys: Multi-Resolution HEALPix Data Structures for Astronomical Images, Catalogues, and 3-Dimensional Data Cubes.” Astronomy &Amp; Astrophysics 578 (June): A114. https://doi.org/10.1051/0004-6361/201526075.
Polsterer, Kai L., Bernd Doser, Andreas Fehlner, and Sebastian Trujillo-Gomez. 2024. “Spherinator and HiPSter: Representation Learning for Unbiased Knowledge Discovery from Simulations.” Astronomical Data Analysis Software and Systems XXXIII, ahead of print. https://doi.org/10.48550/arXiv.2406.03810.