Representation Learning with Spherinator

High-Performance Computing Visualization

Bernd Doser & Sebastian Trujillo-Gomez
Kai Polsterer, Andreas Fehlner, Fenja Schweder, Romain Chazotte

Heidelberg Institute for Theoretical Studies (HITS)

May 2025

Agenda

  • Project X: The Big Picture
  • PEST: Data Preprocessing
  • Spherinator: The Training
  • HiPSter: The Inference
  • Live demo: Illustris TNG
  • Multimodality: 1D Spectral Data
  • Flyte and StreamFlow: Workflow Orchestration

Associated Materials

Project X: The Big Picture

Project X: The Big Picture

Project X: The Big Picture

PEST: Data Preprocessing

  • PEST preprocess universal cosmological simulation data into multi-channel images, data cubes, and point clouds
  • Apache Parquet stores multi-modal data in a unique way
    • Efficient columnar data storage
    • Fast access by Apache Arrow
    • Interoperable with many frameworks (PyTorch, TensorFlow) and programming languages (Python, Julia, C++, Rust)

Spherinator: The Training

  • Representation learning using a Variational Autoencoder
  • Dimensionality reduction to a (Hyper-)spherical latent space
  • Training with PyTorch Lightning

Spherinator: The Power Spherical Distribution

Normal distribution on the hyper-sphere:

\[\begin{aligned} p_{X}(x; \mu, \kappa) = N_{X}(\kappa, d)^{-1}(1 + \mu^{\top}x)^{\kappa} \end{aligned}\]

\(d\): Dimension

\(\mu\): Direction

\(\kappa\): Concentration

\(N_{X}\): Normalization factor

HiPSter: The Inference

  • The HEALPix framework is used to generate a Hierarchical Progressive Survey (HiPS) for the corresponding spherical latent space positions.
  • Aladin-Lite is designed to visualize the HiPS representation.

Let’s begin the demonstration!

Multimodality: 1D Spectra

Gaia DR3 XP contains over 200 million blue (BP) and red (RP) spectra as continuous spectra with 55 parameters per spectrum.

Workflow Orchestration with Flyte

Flyte is a highly scalable cloud-native workflow orchestration platform on top of containers and Kubernetes

Workflow Orchestration with StreamFlow

StreamFlow executes Common Workflow Language (CWL) using a deployment model that includes containers, Slurm (HPC), and Kubernetes.

Summary and Outlook

  • Modular and flexible data workflow (Project X)
  • Uniform interconnectivity
    • Parquet for data storage
    • ONNX for model exchange
  • Workflow Orchestration (ML Workflow Seminar)
    • Flyte for cloud-native workflows
    • StreamFlow for HPC workflows
  • Prototype for Illustris TNG and Gaia DR3 XP is available at space.h-its.org

Acknowledgement & Disclaimer

References

Cao, Nicola De, and Wilker Aziz. 2020. “The Power Spherical Distribution.” Proceedings of the 37th International Conference on Machine Learning, INNF+, June. https://doi.org/10.48550/arXiv.2006.04437.
Doser, Bernd, Kai L. Polsterer, Andreas Fehlner, and Sebastian Trujillo-Gomez. 2025. “Machine Learning Workflow for Morphological Classification of Galaxies.” Astronomical Data Analysis Software and Systems XXXIV. https://doi.org/10.48550/arXiv.2505.04676.
Fernique, P., M. G. Allen, T. Boch, A. Oberto, F-X. Pineau, D. Durand, C. Bot, et al. 2015. “Hierarchical Progressive Surveys: Multi-Resolution HEALPix Data Structures for Astronomical Images, Catalogues, and 3-Dimensional Data Cubes.” Astronomy &Amp; Astrophysics 578 (June): A114. https://doi.org/10.1051/0004-6361/201526075.
Polsterer, Kai L., Bernd Doser, Andreas Fehlner, and Sebastian Trujillo-Gomez. 2024. “Spherinator and HiPSter: Representation Learning for Unbiased Knowledge Discovery from Simulations.” Astronomical Data Analysis Software and Systems XXXIII. https://doi.org/10.48550/arXiv.2406.03810.