PASC Posters - PASC26

Accelerating Lattice QCD Dirac GCR Solvers with Multiple Right-Hand Sides (MRHS)

Lattice QCD simulations are often limited by the memory-bandwidth bottlenecks of solving the Dirac equation for numerous source vectors. We present an optimised Multiple Right-Hand Side (MRHS) Generalised Conjugate Residual (GCR) solver in the openQxD framework that addresses these limitations. By transitioning the core computational kernel from Matrix-Vector to Matrix-Matrix multiplication, we significantly increase arithmetic intensity and improve cache locality. Additionally, we reduce communication overhead in distributed-memory systems by bundling MPI boundary exchanges, effectively lowering communication latency from O(number of RHS) to O(1). Our MRHS implementation offers a four-fold performance increase for Dirac operator applications on modern HPC architectures. The MRHS implementation in established codebase QUDA is also examined. QUDA’s GPU implementation of MRHS improves GPU saturation and increases energy efficiency, reducing the energy cost per solve by a factor of 1.6. This provides a scalable and efficient solution for high-throughput lattice simulations requiring multiple solves of the same Dirac equation.

Author(s): JingJing Li (University of Bern), Roman Gruber (University of Bern), and Marina Krstic Marinkovic (ETH Zurich)

Domain: Physics

Advancing The Data Assimilation Research Testbed (DART) as an Early-Career Software Engineer

The Data Assimilation Research Testbed (DART) is an open-source software facility for ensemble data assimilation that combines information from numerical model predictions with measurements of the Earth system to enhance the value of both. It has supported a diverse community of users for over 20 years. DART allows for the use of a variety of assimilation algorithms, including novel methods that are especially effective for addressing bounded environmental metrics such as pollutants, sea ice concentration, and soil moisture. Our modular software design and generalized model interface facilitates the easy use of existing and addition of new models and observations. DART utilizes the power of HPC through distributed memory and parallel computation with Message Passing Interface to run complex, high resolution models, providing scalability on a variety of systems. My poster will detail the versatility of DART and its functionalities. It will describe my varied technical contributions as a software engineer, and also highlight a few of my larger projects that showcase the interdisciplinary nature of my role, including collaborative work with the NASA Goddard Space Flight Center and the NSF NCAR High Altitude Observatory, performing data assimilation with space weather models on the NASA Pleiades Supercomputer.

Author(s): Marlena Smith (NSF National Center for Atmospheric Research)

Domain: Climate, Weather and Earth Sciences

Algebraic Multi-Level Methods for Lattice Dirac Operators in LQCD

The main computational challenge in Lattice QCD is the efficient and scalable approximate solution of the Dirac equation Dz = b, where D denotes the Dirac matrix on a four-dimensional space-time lattice. Modern solvers for this case are based on Adaptive Multigrid. Among them, Domain Decomposition Adaptive Algebraic Multigrid (DDalphaAMG) is particularly effective and serves as the foundation for our work. On the other hand, Aggregative Multiscale Algebraic Multigrid (AM-AMG) is an AMG setup method, which follows the idea of a geometric solver for reservoir modeling, but is completely algebraic. This was developed for reservoir simulations. We adapted AM-AMG to a point-based approach that is able to handle the Dirac equation in the Schwinger model, a testbed for QCD. This approach exploits the block structure of the 2d Dirac matrix. We compare the performance of AM-AMG vs DDalphaAMG when used as preconditioners for FGMRES to invert D. Generally, AM-AMG handles the Schwinger model well but there are well known issues when approaching the critical masses. For the time being, DDalphaAMG still has the better performance, but the current results show that AM-AMG has the potential to become the solver of choice.

Author(s): Pauline Schauerte (University of Bonn, Fraunhofer SCAI), and Jaime Fabian Nieto Castellanos (Forschungszentrum Jülich, University of Bonn)

Domain: Physics

Algorithms and Optimizations for Global Non-Linear Hybrid Fluid-Kinetic Finite Element Stellarator Simulations

Predictive modeling of stellarator plasmas is crucial for advancing nuclear fusion energy, yet it faces unique computational difficulties. A primary challenge is accurately simulating the dynamics of specific particle species not well captured by fluid models, necessitating the use of hybrid fluid-kinetic models. The non-axisymmetric geometry of stellarators fundamentally couples toroidal Fourier modes, in contrast to tokamaks, requiring specialized numerical treatment. This work presents a novel, globally coupled projection scheme inside the JOREK finite element framework. The approach ensures a self-consistent and physically accurate transfer of kinetic markers to the fluid grid, effectively handling complex 3D meshes by solving a unified linear system that encompasses all toroidal harmonics simultaneously. To manage computational complexity, matrix construction is significantly accelerated using the Fast Fourier Transform. Efficient localization of millions of particles is enabled by a 3D R-Tree spatial index, ensuring computational tractability at scale. On realistic Wendelstein 7-X stellarator geometries, the framework’s fidelity is rigorously demonstrated. In sharp contrast to uncoupled approaches, quantitative convergence tests verify that the coupled scheme attains the theoretically anticipated spectral convergence. This study offers a crucial capability for the predictive analysis and optimization of next-generation stellarator designs by providing a validated, high-fidelity computational tool.

Author(s): Luca Venerando Greco (Max Planck Institute for Plasma Physics), Matthias Hoelzl (Max Planck Institute for Plasma Physics), Guido Huijsmans (CEA, IRFM), and Edoardo Carrà (Max Planck Institute for Plasma Physics)

Domain: Computational Methods and Applied Mathematics

Bridging Python Flexibility and GPU Performance with Aithon:”Kernel-Level Optimization, Scaling, and Extreme-Resolution MHD Turbulence Simulations”

We present Aithon, a GPU-accelerated incompressible flow solver for hydrodynamics and magnetohydrodynamics, designed for extreme-scale supercomputing. Optimized for AMD MI250X GPUs and deployed on the Frontier system, Aithon combines kernel-level GPU optimizations, CUDA/HIP-aware MPI, and Python integration via pybind11 to achieve near-ideal scaling up to 32,768 GPUs. These capabilities enable some of the highest-resolution MHD turbulence simulations (in 2D) to date, reaching $8192^2$ and $1536^3$ grids. Using multiple independent diagnostics, we resolve a decades-old debate on inertial-range spectral scaling, providing compelling numerical evidence in favor of Kolmogorov scaling. This work demonstrates how extreme-scale GPU computing can directly enable definitive scientific breakthroughs.

Author(s): Manthan Verma (Indian Institute of Technology kanpur), Gina Sitaraman (AMD), and Mahendra Verma (Indian Institute of Technology kanpur)

Domain: Physics

Co-designing Regional High-Performance Computing Ecosystems in Africa: A Pilot Focus on Kenya and West Africa

High-performance computing is becoming more important for bioinformatics, genomics, and public health research. However, in Africa, its growth and use remain uneven, scattered, and poorly documented. This study examines current HPC capacity and access models in West, East, and Southern Africa, drawing on both published and grey literature to describe institutional clusters, national facilities, and collaborative platforms. We find that, even with abundant computing power, access is governed by different institutional, project, and consortium models, which makes it hard for regions to collaborate and for everyone to use these resources equitably. While cloud computing is a solution, better, more coordinated HPC systems remain crucial for research that requires large amounts of data. We propose a regional plan to improve HPC by connecting existing resources rather than creating new ones, linking large national centers with smaller institutional clusters through shared resources and access, spearheaded by a community-led group. By incorporating community governance and building skills into the codesign framework, the approach aims to make African HPC a collaborative research network that aligns with both regional goals and global standards.

Author(s): Pauline Karega (University of Manchester, Bioinformatics Hub of Kenya initiative)

Domain: Applied Social Sciences and Humanities

Correlated Electrons on Accelerated Architectures from Frequency-Dependent Response Functions

Understanding, characterizing and engineering spectral properties of correlated materials is crucial for next-generation technologies, including energy harvesting and quantum technologies. These properties encode a material’s response to external stimuli, and while important in general, they are even more critical for correlated materials, which exhibit diverse many-body and topological phenomena that challenge our physical understanding. Accurately describing these many-body systems requires advanced electronic structure methods relying on frequency-dependent response functions, whose calculation currently represents a major computational bottleneck in ab-initio simulations. Here, we present a novel formulation for the evaluation of dynamical response functions in localized manifolds, specifically maximally localized Wannier functions, based on time-dependent density functional perturbation theory, its implementation in Quantum ESPRESSO, and its application to prototypical correlated materials. The most time-consuming step of our implementation is the solution of linear problems of the type Ax=b with A large, and non-Hermitian matrices. We present and discuss our strategy to speed up the solution of this problem that is based on an interface with SIRIUS, a domain-specific library for electronic-structure calculations. The flexibility of SIRIUS allows for a robust deployment on complex accelerated architectures, specifically Alps (NVIDIA Grace-Hopper) and LUMI (AMD Epyc/Instinct).

Author(s): Paolo Settembri (Paul Scherrer Institute), Nicola Colonna (Paul Scherrer Institute), Anton Kozhevnikov (ETH Zurich / CSCS), and Nicola Marzari (EPFL, Paul Scherrer Institute)

Domain: Physics

Coupling km-Scale Earth System Model to Hierarchical Output for Analysis-Ready Dataset

Kilometer-scale Earth System Model (ESM) simulations produce petabyte-scale outputs that are difficult to access, analyse, and share due to their size, heterogeneity, and the overhead of ad-hoc workflows. We introduce **Hiopy** (Hierarchical Output in Python), a lightweight in-situ output component, that is coupled to the ICON model with YAC (Yet Another Coupler) and writes multi-resolution, self-describing datasets directly to the cloud in Zarr format. Hiopy computes hierarchical temporal and spatial aggregates on-the-fly, aligning model domain decomposition with Zarr chunking to minimise communication, balance workload across processes, and preserve simulation throughput. The framework also streamlines metadata handling, eliminates redundant buffers, and streams data straight to its final storage location. Hiopy supports native ICON grids, regular latitude-longitude grids, and HEALPix, and has been validated by generating publicly accessible, analysis-ready datasets from several kilometre-scale ESM projects. This work demonstrates a practical, scalable addition to the high-resolution climate-modelling software stack, enabling seamless, cost-effective access from coarse to native resolution without post-processing bottlenecks.

Author(s): Nils-Arne Dreier (DKRZ), and Siddhant Tibrewal (Max Planck Institute for Meteorology)

Domain: Climate, Weather and Earth Sciences

Developing and Evaluating Performance-Portable Physical Parametrization Codes

We present results and ongoing work in the porting of physical parametrizations to Python using the GridTools for Python (GT4Py) library. Our basis is the Fortran code from the Integrated Forecasting System (IFS) which is run operationally at the European Centre for Medium-Range Weather Forecasts (ECMWF). Based on the IFS code, the parametrizations for cloud microphysics (CLOUDSC) and radiation (ecRad) have been rewritten as hardware-agnostic Python code. Using the different backends of the GT4Py library, the parametrizations can then be run on both CPUs and GPUs. In the first part of the poster, we highlight the performance results of the GT4Py codes for CLOUDSC and ecRad. Preliminary results for the ECMWF land-surface scheme ecLand are also presented. In the second part, we shed light on the computational challenges associated with ecRad, both in terms of memory footprint and time-to-solution. In the context of real-case high-resolution weather forecasts with PMAP, we present an evaluation of the model sensitivity to the calling frequency of the radiation parametrization.

Author(s): Gabriel Vollenweider (ETH Zurich), Stefano Ubbiali (ETH Zurich), Christian Kühnlein (ECMWF), and Heini Wernli (ETH Zurich)

Domain: Climate, Weather and Earth Sciences

Discretization Error Quantification in Plane-Wave Density Functional Theory

Density functional theory (DFT) has become a workhorse of computational materials science. DFT computations in materials typically use a plane wave basis set, truncated at a so-called kinetic energy cutoff Ecut. Estimates for the truncation error of the basis set open opportunities for error balancing, for example by reducing the size of the basis set if other sources of errors are found to be dominating, leading to cheaper simulations with the same overall accuracy. In this work, we follow up on promising recent developments by Cancès et al., who proposed an estimate for the discretization error due to the choice of the kinetic energy cutoff. Building on top of this method, we present our strategy to choose its key numerical parameters, with the goal of turning these error estimates into a routinely applicable technique. We then benchmark the method on an extended set of systems, demonstrating its accuracy on fundamental properties such as total energies and interatomic forces. Finally, we explore its usage to reduce the cost of data generation for the training of machine-learned interatomic potentials.

Author(s): Bruno Ploumhans (EPFL), and Michael Herbst (EPFL)

Domain: Chemistry and Materials

An Ensemble Machine Learning Model to Predict 2- and 10-Year Breast Cancer Recurrence Using Routine Hematological and Clinical Data

Accurate prediction of breast cancer recurrence remains difficult because prognosis varies across molecular subtypes, and genomic tests are often expensive or unavailable, leading to broad risk categories that may cause overtreatment or undertreatment. We developed machine learning models integrating routine hematological indices with clinicopathologic data to predict 2- and 10-year recurrence or death. We retrospectively analyzed 4,277 women with primary breast cancer (2008–2022) from a single institution. The cohort included hormone receptor-positive (HR+; 60%), HER2-positive (21%), and triple-negative (TNBC; 18%) subtypes. We trained multiple classifiers and integrated them into a stacked ensemble using logistic regression as the final learner. Class imbalance was addressed with SMOTE applied only to training sets. The ensemble achieved strong discrimination: general cohort AUC 0.859 (2-year) and 0.814 (10-year), with specificity 88–86% and sensitivity 67–59%. Subtype-specific performance remained robust: HR+ AUC 0.862/0.804, HER2+ AUC 0.892/0.831, and TNBC AUC 0.834/0.829 (2-year/10-year). SHAP analysis identified advanced tumor stage, elevated inflammatory ratios (NLR, PLR, MLR), elevated red cell distribution width, and older age as key adverse predictors with stronger effects on early recurrence. This interpretable tool uses only routine blood tests, requiring no additional infrastructure, enabling scalable risk stratification where genomic testing is unavailable.

Author(s): Patricia Moreira (Inteli – Institute of Technology and Leadership)

Domain: Computational Methods and Applied Mathematics

Estimation of Global Surface Carbon Fluxes at the Grid Scale Using Machine Learning Techniques

Machine learning (ML) techniques have recently been applied in the field of geoscience as in other fields, and has shown significant progress. One of the major advantages of ML is its remarkable effectiveness in overcoming the problem of realistic computational costs from a computational science perspective. This study applies ML techniques to inverse modeling for estimating global carbon dioxide emissions, to see how easily and accurately ML techniques could perform the calculations. With the experience using data assimilation techniques based on ensemble Kalman filters, the differences in methodology and the associated effort can be appreciated. In the meantime, applying ML techniques is also essential given recent changes in HPC architecture. KISTI (Korea Institute of Science and Technology Information) national supercomputing center is building the KISTI-6 HPC system with a performance of approximately 600 PF of which 588.28 PF will be from GPUs, aiming for official service in the send half of this year. Therefore, experimenting with whether GPU-based ML model can efficiently produce similar or more accurate results than existing CPU-based numerical model-based inversion modeling is also meaningful in terms of enhancing support capabilities for future users of KISTI’s supercomputing service, especially those in the geosciences field.

Author(s): Ji-Sun Kang (Korea Institute of Science and Technology Information)

Domain: Climate, Weather and Earth Sciences

Evaluating Open-Source Infrastructure-As-Code Virtual Clusters against SuperMUC-NG Phase 1

Traditional high-performance computing (tHPC) infrastructure requires weeks to months for hardware procurement, network configuration and software integration, which limits agility for short-term projects and hampers reproducibility through non-standardized configurations. Infrastructure-as-Code (IaC) promises rapid, version-controlled cluster deployment, yet production-grade open-source IaC frameworks for communication-intensive workloads remain underexplored. Prior work reports 5–10% single-node virtualization overhead but highlights multi-node scaling challenges dominated by network latency. We benchmark virtual HPC (vHPC) clusters deployed via Magic Castle within Germany’s InHPC-DE project, focusing on open-source IaC rather than proprietary offerings such as AWS ParallelCluster or Azure CycleCloud. A IaC-based vHPC cluster is compared against SuperMUC-NG Phase 1, a traditional bare-metal HPC system, using four biophysical/chemical simulation codes: Quantum ESPRESSO, GROMACS, LAMMPS, and CP2K. At single-node and low core counts, vHPC performance closely matches tHPC for all applications, indicating minimal computational overhead. For communication-intensive workloads (GROMACS, CP2K), strong-scaling efficiency degrades significantly beyond one node due to limited network bandwidth and high latency, thus far from 100 Gbit/s Omni-Path and sub-microsecond latency in tHPC systems. Our results show that IaC-based vHPC is production-ready for workloads with moderate communication requirements and is immediately applicable for burst computing, education, benchmarking, development workflows, and federated multi-site infrastructure.

Author(s): Prasanth Babu Ganta (Leibniz Supercomputing Centre), Elmira Birang (Leibniz Supercomputing Centre), Plamen Dobrev (Leibniz Supercomputing Centre), Birkan Emrem (Leibniz Supercomputing Centre), Matteo Foglieni (Leibniz Supercomputing Centre), and Ferdinand Jamitzky (Leibniz Supercomputing Centre)

Domain: Chemistry and Materials

Exploring Performance and Efficiency of State-Of-The-Art Deep Learning Protein Structure Prediction Frameworks on the Frontier Exascale Supercomputer

Accurately predicting the structure of a protein has been a long standing and extremely challenging problem in biology. In recent years, the rapid evolution and adoption of artificial intelligence have made the prediction of protein structures leveraging deep learning frameworks with accuracy rivaling that of experimental crystal structures possible. These advances are key to understanding protein function and play a central role in accelerating the drug discovery process. As the number of frameworks available continues to grow and improved versions emerge, determining which tool is best suited for an experiment has become increasingly challenging. This study presents an in-depth performance and efficiency comparison across AlphaFold3, AF3Complex, and Boltz-2 on the Oak Ridge Leadership Computing Facility’s Frontier supercomputer. The evaluation compares the performance, accuracy, energy and power consumption of each model across a predefined set of three biomolecule categories: (i) proteins with long intrinsically disordered regions, (ii) proteins with functional modified or mutated residues, and (iii) off-target effects in multimers. The results from this work and the lessons learned provide practical guidance for selecting appropriate protein structure prediction frameworks under performance and energy constraints and can be useful in the evaluation of future versions of the models discussed.

Author(s): Verónica G. Melesse Vergara (Oak Ridge National Laboratory), Elijah MacCarthy (Oak Ridge National Laboratory), Asim YarKhan (Oak Ridge National Laboratory), John Holmen (Oak Ridge National Laboratory), Manesh Shah (Oak Ridge National Laboratory), Érica Teixeira Prates (Oak Ridge National Laboratory), and Dan Jacobson (Oak Ridge National Laboratory)

Domain: Engineering

A Flexible Interface for Neural Network Potentials in GROMACS

We present a new interface for hybrid machine learning/molecular mechanics (ML/MM) simulations implemented in the molecular dynamics engine GROMACS. The interface enables neural network potentials (NNPs) trained in the PyTorch framework to contribute energies and forces during molecular dynamics (MD) simulations, either for selected subsystems or entire molecular systems. By defining a flexible set of model inputs and outputs, the interface is agnostic to specific NNP architectures and can accommodate a wide range of descriptor-based and message-passing models. The design integrates NNP inference into established GROMACS workflows while remaining compatible with advanced sampling and free energy methodologies. We demonstrate the capabilities of the interface using several representative applications, including solvation structure calculations, enhanced sampling of peptide torsional free energies, absolute solvation free energy calculations, and protein-ligand binding simulations. Across these examples, ML/MM simulations reproduce established reference results and, in some cases, improve upon classical force field descriptions, at substantially reduced cost compared to QM/MM approaches. The interface is available in recent GROMACS releases and provides a practical foundation for incorporating machine learning potentials into production MD simulations, with ongoing development aimed at extending embedding schemes and improving performance and scalability.

Author(s): Lukas Müllender (KTH Royal Institute of Technology), Berk Hess (KTH Royal Institute of Technology), and Erik Lindahl (KTH Royal Institute of Technology, Stockholm University)

Domain: Physics

A Flux-Form Semi-Lagrangian WENO Scheme on Triangular Meshes

The icosahedral model for weather and climate simulations utilises flux-form semi-Lagrangian (FFSL) schemes for the transport of species. The motivation is the higher Courant-Friedrich-Lewy (CFL) number compared to Eulerian approaches. The schemes are implemented on the triangular mesh on a sphere which we simplify for analysis to a planar equilateral triangular mesh. A second-order scheme with four-point stencil and a third-order scheme with ten-point stencil are considered. We extend these linear schemes to weighted essentially non-oscillatory (WENO) FFSL schemes. The efficient implementation of these schemes on graphic processors (GPUs) reveals that the more floating point operations demanding WENO scheme has only marginally more cost than the simple FFSL scheme. Since the latter scheme requires the application of a flux limiter, the WENO scheme has overall the lower cost. For the second order approximation the solutions of the linear FFSL scheme with one least square approximation and the sub-stencils of the WENO scheme are identical. For the third order scheme the dispersion relation for the WENO approach with three seven-point stencils is superior to the complete evaluated ten-point stencil. Steep gradient solutions are handled in a similar quality between linear FFSL scheme combined with flux limiter and the WENO FFSL scheme.

Author(s): Andreas Jocksch (ETH Zurich / CSCS), Daniel Reinert (Deutscher Wetterdienst (DWD)), Christoph Müller (MeteoSwiss), David Strassmann (ETH Zurich), and Nina Burgdorfer (MeteoSwiss)

Domain: Climate, Weather and Earth Sciences

FPGA-Specific Optimizations for Multi-Device Shallow Water Simulations with SYCL

The shallow water equations are an essential tool for modeling tides, tsunamis, and storm surges. At PASC 24, we presented an implementation of the shallow water equations running on CPUs, GPUs and FPGAs. While the numerical code is shared across the different architectures, the implementation uses SYCL as a portability layer to support architecture-specific memory layouts and communication routines. This poster provides a detailed overview of the FPGA-specific optimisations of this portable codebase. Unlike CPUs and GPUs, FPGAs do not provide a cache-based memory hierarchy that mitigates the cost of accessing slow off-chip memory. To reduce the bandwidth bottleneck, the shallow water solver makes use of RAM blocks on the FPGA device, which act as static, array-specific caches storing necessary data in fast on-chip memory. When the entire mesh fits into the on-chip caches, the FPGA designs nearly achieve the ideal throughput of one element per clock cycle. Along with an MPI-based communication scheme for CPUs and GPUs, the implementation also supports direct streaming communication for FPGAs. In combination with the on-chip caches, this achieves super-linear scaling in a strong scaling scenario, as the optimal performance per FPGA is reached when the complete partition fits into the on-chip caches.

Author(s): Christoph Alt (Paderborn University, Friedrich-Alexander-Universität Erlangen-Nürnberg), Markus Büttner (University of Bayreuth), Tobias Kenter (Paderborn University), Harald Köstler (Friedrich-Alexander-Universität Erlangen-Nürnberg), Christian Plessl (Paderborn University), and Vadym Aizinger (University of Bayreuth)

Domain: Computational Methods and Applied Mathematics

Generalization of Long-Range Machine Learning Potentials in Complex Chemical Spaces

The vastness of chemical space makes generalization a fundamental challenge for machine learning interatomic potentials (MLIPs). Although MLIPs enable near–quantum-accuracy atomistic simulations at greatly reduced computational cost, their practical reliability is often limited by poor transferability to out-of-distribution systems. Here, we systematically assess how explicit long-range modeling influences both accuracy and transferability of MLIPs across diverse chemical spaces, using metal–organic frameworks as a stringent test case. We benchmark Allegro, MACE, and DimeNet++ architectures combined with Euclidean Fast Attention (EFA), Charge Equilibration Layer for Long-range Interactions (CELLI), and Latent Ewald Summation (LES). To rigorously probe generalization, we introduce biased train–test splitting strategies that enforce structural and chemical dissimilarity between training and test sets. We find that long-range corrections are essential for robust transferability, with physics-based models showing the most consistent performance across out-of-distribution regimes. In contrast, neither CELLI nor LES can reliably infer meaningful partial charges from energies and forces alone in complex systems without reference data. These results demonstrate that out-of-distribution transferability, is a prerequisite for trust in MLIPs and provide a general framework for diagnosing systematic failures across chemical space.

Author(s): Michał Sanocki (Technical University of Munich)

Domain: Computational Methods and Applied Mathematics

GPU-Accelerated Methods for Numerically Stable Resampling in Fluid-Structure Interaction

Fluid-structure interaction simulations require accurate transfer of scalar fields between overlapping meshes with different topologies. We address the problem of transferring fields from unstructured tetrahedral to structured hexahedral meshes. This problem is challenging because direct quadrature methods suffer from numerical instability due to hexahedral grid undersampling, while more sophisticated solutions are difficult to parallelize on GPUs efficiently. We developed and compared three GPU-optimized approaches, tested on the Alps GH200 at CSCS, analyzing their trade-offs in stability, accuracy, and performance. The first approach uses iterative refinement of tetrahedral quadrature rules, achieving high accuracy but with limited GPU parallelism. The second employs geometric adaptivity through explicit refinement rules instead of recursion, demonstrating high GPU throughput. The third samples use hexahedral quadrature rules, which is ideal for GPU parallelization but creates topological conflicts when multiple tetrahedra contribute to single quadrature nodes. We resolve this using an adapted Cell-List data structure that efficiently handles conflicts while maintaining near-optimal GPU performance. By combining geometric refinement with GPU-optimized spatial indexing, we achieve both numerical stability and parallel efficiency. Our methods enable robust, high-performance mesh transfer for fluid-structure interaction and other multiphysics frameworks coupling different mesh topologies.

Author(s): Simone Riva (Università della Svizzera italiana), and Patrick Zulian (UniDistance Suisse, Università della Svizzera italiana)

Domain: Computational Methods and Applied Mathematics

Graph Neural Network Potentials for Million-Atom Molecular Dynamics Simulations of Aluminum Solidification

Solidification is ubiquitous in the fabrication of metal parts. Molecular dynamics simulations can predict the microstructure and the corresponding mechanical properties. However, both high accuracy of interatomic potential energy and scalability to millions of atoms are required to capture physically relevant grain and microstructure evolution. Classical potentials for metals are limited in accuracy and chemical complexity. Machine-learning interatomic potentials approach first-principles accuracy, but their applicability to large-scale systems remains an open challenge due to high computational cost. In this work, we train equivariant graph neural network potentials (GNNPs) for pure aluminum, assess their performance, and compare them to existing potentials. We find that classical potentials perform well in low-energy solid states but deteriorate in high-energy liquid states. Contrarily, our GNNPs remain accurate across all phases, improving the accuracy of the conducted simulations. We employ the developed model to predict key solidification properties and to conduct a million-atom solidification simulation, demonstrating a pathway for feasible simulations with near-first-principles-level accuracy at experimentally relevant system sizes, benefiting multiscale materials design.

Author(s): Ian Störmer (Technical University of Munich), and Julija Zavadlav (Technical University of Munich)

Domain: Chemistry and Materials

A High-Performance, GPGPU-Enabled Discontinuous Galërkin Solver Using OpenMP Offloading and MPI

We present a GPGPU-enabled modal Discontinuous Galërkin solver that uses OpenMP+MPI. Device code is generated by offloading OpenMP pragmas, and inter-device/inter-node communication is enabled by MPI. Our test case implements a diffusion-advection solver with a Runge-Kutta-Chebyshev time stepping scheme. The fully explicit, matrix-free operator evaluation aims to keep a minimal memory footprint to address both the comparatively low amount of RAM on the device and the high latency of its access. Thanks to the high locality of DG methods, differential operators can be evaluated on a per-element basis, using a SIMD scheme. We use a structured grid that is implicitly defined, to further align with the philosophy of reducing memory accesses in favour of more arithmetical computation. The grid is partitioned in rectangular regions, each assigned to an MPI process, with “ghost” cells at the boundary between processes; ghost DoF values are updated using a halo exchange scheme. We show efficiency and scalability results for a multi-device, multi-node configuration and compare it to the host-only counterpart, deriving efficiency metrics to assess whether the device-enabled implementation is actually advantageous from a time-to-solution standpoint.

Author(s): Marco Scarpelli (Politecnico di Milano), Paola Francesca Antonietti (Politecnico di Milano), Carlo De Falco (Politecnico di Milano), Luca Formaggia (Politecnico di Milano), and Giovanni Viciconte (ENI S.p.A.)

Domain: Computational Methods and Applied Mathematics

Hybrid Block-Structured Grids for Coastal Ocean Domains

Achieving high performance and performance portability is critical for next-generation climate and ocean modelling on heterogeneous computing systems. Ocean models face complex, fractal-like coastlines and rapidly varying bathymetry, making unstructured triangular meshes attractive for their flexibility, but these grids often incur significant performance penalties due to irregular memory access. We examine whether the newly proposed generation method for hybrid Block-Structured Grids (hBSG) can bridge the gap between geometric flexibility and computational efficiency. Using the shallow water equations solver UTBEST-SYCL, we evaluate hBSGs, comprising structured and unstructured blocks, both consisting of triangular elements, and relate them to unstructured grids and Block-Structured Grids (BSG) (only structured blocks). Preliminary results showcase both promising retained speedups in comparison to the unstructured calculation and challenges due to increased algorithmic complexity.

Author(s): Jonathan Schmalfuß (University of Bayreuth), and Vadym Aizinger (University of Bayreuth)

Domain: Climate, Weather and Earth Sciences

Hypergraph Partitioning for Sparse Matrix Reordering

Fill-in during sparse matrix factorization remains a critical bottleneck in scientific computing. We present an efficient hypergraph partitioning approach for sparse matrix reordering based on the Clique-Node Hypergraph (CNH) representation, building on prior work by Çatalyürek et al. and Selvitopi et al. Our method transforms the sparsity pattern through an edge-clique cover, creating a hypergraph where cliques become nodes and original vertices become nets. Using a hypergraph partitioner, we generate a symmetric diagonal block form with a separator, then apply established ordering methods to each block. Across a benchmark suite of SuiteSparse matrices, our approach achieves fill-in reductions competitive with METIS, often outperforming it. This work demonstrates that hypergraph partitioning is a practical alternative for fill-in minimization.

Author(s): Ritvik Ranjan (ETH Zurich), Vincent Maillou (ETH Zurich), Alexandros Nikolaos Ziogas (ETH Zurich), and Mathieu Luisier (ETH Zurich)

Domain: Computational Methods and Applied Mathematics

An Integrated HPC Workflow for AI-Driven Immunogenic Peptide Prediction

Immunogenic peptides play important roles as drivers for the adaptive immune response – our bodies’ ultimate protection against infections and cancers. Parts of these peptides, called epitopes, are recognized by either major histocompatibility complexes or antibodies, which then interact with T-cells or B-cells, respectively. Identifying and designing these epitopes is crucial for immunotherapy and vaccine development, yet remains challenging due to the vast possibilities in sequence combinations, limited experimental data, and the need to understand detailed atomic interactions and how they contribute to binding affinities. While AI tools, docking, and molecular dynamics simulations address parts of this problem, no single method is sufficiently accurate for practical vaccine development. We present an integrated computational workflow for AI-driven immunogenic peptide prediction designed for high-performance computing systems. The workflow combines three components: generative AI that designs immunogenic peptides using data from sequence and immunogenicity databases; docking that builds molecular complexes; and molecular dynamics simulations that characterize dynamic interactions and estimate binding affinities through an alchemical mutation screen. Crucially, structural and thermodynamic data from simulations feed back into the AI, iteratively improving predictions. Thus, by combining generative AI with physics-based methods, this workflow aims to approach the prediction accuracy necessary for practical vaccine development.

Author(s): Cathrine Bergh (KTH Royal Institute of Technology), Leonardo Salicari (CINECA), Danai Kotzampasi (Utrecht University), Victor Reys (Utrecht University), Narendra Kumar (National Institute of Immunology), Archana Achalere (Center for the Development of Advanced Computing), Sunitha Manjari Kasibhatla (Center for the Development of Advanced Computing), Alessandra Villa (KTH Royal Institute of Technology), Uddhavesh Sonavane (Center for the Development of Advanced Computing), and Alexandre Bonvin (Utrecht University)

Domain: Life Sciences

Large-Scale Molecular Dynamics Simulations for Advances in Biomimetic Carbon Capture Materials

Sustainable carbon capture and greenhouse gas mitigation require solutions that are innovative, reproducible, and scalable. Biomolecular catalysts are promising for low-energy CO2 capture, yet industrial deployment is limited by reduced stability under harsh operating conditions and the difficulty of systematic optimization. Using high-performance computing (HPC) and large-scale molecular dynamics (MD) simulations, we develop and validate reproducible workflows for biomimetic material design. Carbonic anhydrase (CA), a metalloenzyme that catalyzes reversible CO2 hydration, serves as the model system due to its high catalytic efficiency and well-characterized mechanism. We evaluate CA structure and dynamics across pH, temperature, and immobilization conditions relevant to industrial settings and benchmark simulations against experimental secondary-structure signatures and residue-level flexibility trends. These results demonstrate how HPC enables high-throughput, experimentally grounded screening of complex enzyme–surface systems, supporting scalable CA-based strategies for sustainable CO2 mitigation.

Author(s): Merve Fedai (North Carolina State University)

Domain: Chemistry and Materials

A Machine Learning Framework for CFD Applications

In the present study, an automated framework is prepared that contains two modules, Computational Fluid Dynamics (CFD) simulations and surrogate modelling. CFD simulations are performed to model and make thermal assessment of battery air cooling in different air stream conditions (i.e. stream velocity and initial temperature) and various battery cell generated heat. OpenFoam open-source code is used for the simulations. The surrogate modelling is used with a primary aim to train a predictive model that approximates the outcome of the CFD simulation based on previously provided CFD data. Due to limited number of CFD results and having the results as a smooth function of the input data, surrogate model is built based on Gaussian Process Regression (GPR). Quantities of Interest (QoI) were inlet velocity, initial temperature and heat source for battery heat generation. This method can also be useful for exploring large parameter spaces, performing sensitivity analyses, or enabling faster design iterations, where running many full CFD simulations would be too costly or time-consuming. It is noteworthy that machine learning is applied to the averaged data and not the instantaneous fluctuations. Hence, the predictions are made for the trend and not the seasonality of the result.

Author(s): Masumeh Gholamisheeri (STFC), Harry Durnberger (STFC), and Tim Powell (STFC)

Domain: Engineering

Maintainable, Sustainable, and Generalisable Datalayouts and Vectorisation for Rigid Body Molecular Dynamics

ls1-MarDyn (ls1) is a Molecular Dynamics (MD) simulator designed for large-scale simulations of multi-site molecules and has been successfully used in a variety of scientific studies. It represents molecules as rigid bodies composed of multiple interaction sites that each exert forces on their neighbours, which are determined by those molecules with centre-of-masses within some cutoff distance. Past works have integrated the modern algorithm-selection particle simulation library, AutoPas, into ls1, replacing its older particle container and force calculation mechanisms. However, limitations of AutoPas confine this integration to only single-site molecules. The main challenge lies in extending AutoPas’s compile-time automatic Structure-of-Array generation, which is unsuitable for variable site counts per molecule. In this work, we will discuss adding this functionality, including a novel data layout for this molecular representation, better suited for vectorisation, and explore modern C++20 features to mitigate the otherwise increased code complexity. Additionally, we wish to avoid the maintainability and sustainability issues of ls1’s vectorisation, where a different version of the force calculation needs to be implemented for each instruction set. To address this, we have introduced Google Highway to replace hand-written vector instructions with a portable, maintainable solution, and we will discuss its performance and its limitations.

Author(s): Samuel James Newcome (Technical University of Munich), Luis Gall (Technical University of Munich), David Martin (Technical University of Munich), Markus Mühlhäußer (Technical University of Munich), and Hans-Joachim Bungartz (Technical University of Munich)

Domain: Computational Methods and Applied Mathematics

Mapping the Productivity-To-Energy Trade-Off in Memory-Bound HPC via DVFS, Core Scaling, and C-State Control on Repurposed Hardware

High-performance computing (HPC) systems in resource-constrained environments, such as Africa, often rely on repurposed hardware, shifting the primary financial burden from capital expenditure to operational energy costs. This work aims to identify realistic and reproducible performance–energy sweet spots, providing practical guidance for cost-efficient HPC operation by balancing modest reductions in performance with substantial energy savings. In this context, Productivity-to-Energy (PTE) provides a more meaningful metric than raw performance alone for evaluating system efficiency and operational value. While Dynamic Voltage and Frequency Scaling (DVFS) is a well-known energy-optimisation tool, the combined effects of DVFS, core scaling, and deep CPU C-states remain poorly explored on repurposed HPC infrastructure. This project investigates these interactions for memory-bound HPC workloads on repurposed hardware. Guided by recent findings (PEARC25) and the Knights Landing (KNL) architecture, CPU configurations are adjusted on repurposed hardware to better align with memory throughput and reduce latency. Extensive benchmarks are conducted systematically varying CPU frequency and core-count configurations, with unused cores explicitly power-gated using deep C-states (C10) to ensure controlled hardware activation. Using real scientific workloads, such as HPCG, OpenFOAM, and Calculix, this research captures memory- and latency-dominated behaviours to define energy-aware operating points relevant for the African HPC community.

Author(s): Suné Toerien (University of the Witwatersrand), Vele Nefale (University of the Witwatersrand), Ntandoyenkosi Memela (University of the Witwatersrand), Mubeen Dewan (University of the Witwatersrand), Bryan Johnston (CSIR), Charles Crosby (CSIR), and Anand Patel (Private)

Domain: Applied Social Sciences and Humanities

On the Nexus of Data, Models, and Supercomputing: Optimization and Uncertainty Quantification in HPC

The future of HPC will blend advanced simulation with model training, integrating multi-fidelity stochastic ensembles, computational steering, active learning, and interactive visualization. As we move beyond single “hero” simulations, HPC must support dynamic workflows that allow scientific questions to be defined and redefined in real time. This shift demands not only high-resolution simulations but also the robust statistical treatment of uncertainty—from propagation and calibration to assimilation of streaming data. Concurrently, the advent of AI poses challenges in ensuring model trust and interpretability, crucial for high-stakes applications where traditional PDE-based models have set the standard. With this poster we present scientific results and current methodology to explore how HPC currently meets these evolving needs by balancing complex model evaluations and ensemble predictions efficiently under time and energy constraints.

Author(s): Antigni Georgiadou (Oak Ridge National Laboratory)

Domain: Computational Methods and Applied Mathematics

Optimizing the ICON Dynamical Core for GPUs Utilizing GT4Py and DaCe

Numerical weather predictions are based on a numerical model running on a large super computer. Improving the performance of these models is an active field of research which benefits society. The ICON model is a finite volume model running on an icosahedral mesh. Finite volume stencil computations on an icosahedral mesh pose a memory bound optimization problem which profits heavily from inlining and fusion, resulting in the demotion of fully realised fields, which are written and read from global memory, to scalars, which can exist in registers. In this poster we showcase an optimization pipeline which improves the performance of the dynamical core of ICON for production relevant MeteoSwiss experiments by 1.3x over OpenACC baseline on Nvidia H100 GPUs and 1.15x over OpenACC baseline on Nvidia A100 GPUs. The steps of the pipeline are a code elimination stage done in GT4Py where all dynamical core code branches not relevant for the current experiment are deleted, followed by an inlining and fusion stage in DaCe, which combines the remaining stencil computations into as few CUDA kernels as possible. Performance results for production MeteoSwiss experiments for A100 and H100 GPUs are presented and the difference to the OpenACC baseline is discussed.

Author(s): Christoph Müller (MeteoSwiss), Magdalena Luz (ETH Zurich / CSCS), Nicoletta Farabullini (ETH Zurich / CSCS), Till Ehrengruber (ETH Zurich / CSCS), Chia Rui Ong (ETH Zurich / CSCS), Daniel Hupp (ETH Zurich / CSCS), Philip Müller (ETH Zurich / CSCS), Edoardo Paone (ETH Zurich / CSCS), Ioannis Magkanaris (ETH Zurich / CSCS), Christos Kotsalos (ETH Zurich / CSCS), Yilu Chen (ETH Zurich / CSCS), Jacopo Canton (ETH Zurich / CSCS), Hannes Vogt (ETH Zurich / CSCS), Enrique González Paredes (ETH Zurich / CSCS), Rico Häuselmann (ETH Zurich / CSCS), Anurag Dipankar (ETH Zurich / CSCS), Mauro Bianco (ETH Zurich / CSCS), William Sawyer (ETH Zurich / CSCS), and Mikael Simberg (ETH Zurich / CSCS)

Domain: Climate, Weather and Earth Sciences

Parallel Tempering on Boundary Conditions with Normalizing Flows to Solve Topological Freezing

In particle physics, Lattice Quantum Chromodynamics (LQCD) studies the strong interaction, responsible, for example, for the binding of atomic nuclei, through computational methods. An essential part of LQCD consists on being able to sample high-dimensional multi-modal distributions, for which direct sampling methods are not available. Standard methods based on Markov Chain Monte Carlo algorithms have proven useful, but face short-comings such as long autocorrelations, often due to being unable to sample distributions where regions of high probability are separated by long regions of low probability, a problem known as Topological Freezing in LQCD. In this work, we explore a solution to Topological Freezing with Parallel Tempering on Boundary Conditions (PTBC) and normalizing flows. The former is an algorithm that allows traveling through low-probability regions by evolving several Markov chains in parallel with slightly different conditions and proposing exchanges between the different chains. Normalizing flows are a machine learning method that can learn how to generate samples for a complex distribution starting with samples from an easier distribution. By accelerating the PTBC algorithm with normalizing flows, we aim to obtain samples with lower autocorrelations.

Author(s): Victor Granados (University of Bern)

Domain: Physics

Performance-Portable and Highly Scalable Spectral Transforms with ecTrans

The continued increase in the skill of weather forecasts observed over the past decades depends crucially on the efficient exploitation of the next generation of high-performance computers by Earth-system models. The European Centre for Medium-Range Weather Forecast (ECMWF)’s model, the IFS, is one such model. The IFS atmospheric component relies on a spectral transform method, a very old technique that is still competitive today. This is one of the most important parts of the entire system, and has been tested even up to 750 m resolution. Development of the spectral transform method has accelerated since it was released as an open-source package in 2022, managed by ECMWF. Recently, support for both Nvidia and AMD GPUs has been added, and the spectral transform has proven to be a natural fit for accelerators. In this poster I will summarise the recent developments to ecTrans and present an overview in particular on its performance portability characteristics. I will also present a comparison against a new, simpler spectral transform recently implemented in pure PyTorch, developed in order to train data-driven weather forecasting models with spectral loss functions.

Author(s): Sam Hatfield (ECMWF)

Domain: Climate, Weather and Earth Sciences

Perspectives on Teamwork and AI in Scientific Computing

The development and use of high-quality software—a primary mechanism for sustained collaboration and progress in scientific computing—is undergoing profound change, driven by increasing complexity in scientific drivers and computing architectures and the rapid adoption of AI, placing demands on how teams collaborate and share expertise. The Next-Generation Ecosystems for Scientific Computing project addresses challenges in team-based scientific software through coordinated research and community engagement. A central component is understanding collaboration practices and patterns of AI use in scientific computing to inform the development and curation of resources that support cross-disciplinary teamwork. This poster presents foundational empirical work for the project, focusing on the design, execution, and analysis of a community survey and follow-up interviews. We surveyed 79 scientists and software professionals to understand needs for enhancing teamwork and leveraging AI in service of accelerated scientific discovery. Results indicate that while teamwork is widely valued in scientific computing, awareness of collaboration best practices and access to training remain limited. Although most participants reported using AI, less than half indicated that their teams had considered AI policies. Findings underscore the need for intentional socio-technical co-design in scientific computing (see https://arxiv.org/abs/2510.03413) to support reproducible, reliable, and trustworthy science.

Author(s): Olivia B. Newton (University of Montana), Anshu Dubey (Argonne National Laboratory), Denice Ward Hood (University of Illinois Urbana-Champaign), Lois Curfman McInnes (Argonne National Laboratory), and Santiago Ospina Tabares (University of Illinois Urbana-Champaign)

Domain: Applied Social Sciences and Humanities

The Portable Model for Multi-Scale Atmospheric Prediction (PMAP): Towards Sub-Kilometer Scale and Large-Eddy Simulation of Real Weather

The Portable Model for multi-scale Atmospheric Prediction (PMAP) is an advanced high-resolution numerical model. Written entirely in Python, it leverages the GT4Py domain-specific language to achieve high performance and portability – running straightforwardly on laptops and GPU-accelerated HPC systems alike. The systematic separation of concerns between domain science and performance engineering provides new avenues for model development, setup, and refinement. Here, we highlight PMAP’s strengths as a model framework to refine numerical algorithms, physical parameterizations, and diagnostics, as well as to optimize computational performance to enable efficient sub-kilometer-scale and large-eddy simulation of real weather. First, we demonstrate competitive performance of the model with respect to time-to-solution and weak scalability, as measured on the Alps infrastructure for a well-established benchmark. Second, we illustrate how the Python-based model formulation facilitates evaluating and improving numerical aspects of the model, exemplified here in terms of a tracer transport experiment in complex, exceptionally steep terrain. Third, we showcase PMAP’s capabilities for simulating extreme weather events at the hectometer-scale with the example of hurricane Melissa. The results not only show a much more realistic intensification of the storm, as compared to an established kilometer-scale model, but they also reveal exceptional detail in extreme winds and precipitation.

Author(s): Lukas Papritz (ETH Zurich), Nicolai Krieger (ETH Zurich), Christian Kühnlein (ECMWF), Till Ehrengruber (ETH Zurich / CSCS), Sara Faghih-Naini (ECMWF), Stefano Ubbiali (ETH Zurich), Gabriel Vollenweider (ETH Zurich), Heini Wernli (ETH Zurich), and Jan Zibell (ETH Zurich)

Domain: Climate, Weather and Earth Sciences

Predictive Alerts and Atmospheric Data for Airport Windshear by CPAS 200m Weather Model

Low-level wind shear (LLWS) at Hong Kong International Airport (HKIA) poses significant aviation risks, primarily driven by terrain-induced turbulence from Lantau Island and sea-breeze interactions. While current mitigation relies on real-time detection and short-term forecasting, for operational purposes, it is ideal to extend horizons to manage wind shear (WS). This study evaluates a 200-m resolution limited-area simulation using the MPAS-based ClusterTech Platform for Atmospheric Simulation (CPAS) to predict sea-breeze shear 24 hours in advance based on historical success in resolving coastal and terrain features. The framework utilises Adaptive Mesh Refinement (AMR) to optimise computational efficiency, triggering the 200-m high-resolution Lantau mesh, all running on a conventional CPU-based HPC system. Validated against 10 cases from spring 2025—7 WS and 3 non-WS events—the model successfully captured sea-breeze onset within 2 hours of METAR warnings in 6 out of 7 WS cases. Furthermore, it correctly predicted the 3 non-WS cases, accurately simulating the suppression of local circulation by slightly strong prevailing winds. These results demonstrate the viability of the 200-m CPAS configuration for day-ahead operational forecasting. Future work will expand validation to include terrain-induced wake turbulence and microburst scenarios.

Author(s): Sai Lun Tin (ClusterTech Limited), Chi Chiu Cheung (ClusterTech Limited), Ka Ki Ng (ClusterTech Limited), and Wai Pang Sze (ClusterTech Limited)

Domain: Climate, Weather and Earth Sciences

Profile-Guided-Optimisation of Lattice QCD Contractions on CPU and GPU

We present a performance optimisation study of the 2+2 disconnected component bottleneck of Lattice QCD computation of the hadronic light-by-light contribution to the muon’s anomalous magnetic moment. Optimization on CPU and GPU architectures were guided by popular profiling tools perf, valgrind, nsys, and ncu. The 2+2 contraction is first decomposed into smaller modules for easier profiling and optimisation. The CPU optimisations included loop re-arrangement to improve data locality, the replacement of multi-level nested memory allocations with contiguous 1D pointers, and the resolution of thread-synchronization bottlenecks in parallel regions. L1 cache miss rates see multi-factor reductions and parallelised regions show near perfect strong scaling across the modules. On GPUs, we rearranged data access pattern, improved parallel work distribution, and identified latency issues. Finally, kernels are tied together with asynchronous streaming to overlap workload at small problem size and hide communication and data copy latencies. The optimized code achieved an overall 30% runtime reduction in production environments (GPU) on CSCS Daint. This project highlights how systematic profiling and targeted optimizations can yield significant resource savings in computationally intensive legacy code.

Author(s): JingJing Li (University of Bern), Urs Wenger (University of Bern), and Roman Gruber (University of Bern)

Domain: Physics

RDQ: A Zero-Copy Remote Data Queue for In-Situ Machine Learning in HPC

The integration of machine learning into the computational sciences is increasingly pursued to reduce time-to-solution, alleviate I/O bottlenecks, and enable adaptive analysis during simulation. We present RDQ (Remote Data Queue), a library for coupling HPC simulations and machine learning training using an MPMD (Multiple Program Multiple Data) MPI (Message Passing Interface) approach. RDQ is part of an ongoing effort to combine HPC and ML workflows and serves as a research platform for investigating in-situ training and data reduction techniques. RDQ extends existing data staging mechanisms by enabling zero-copy tensor concatenation and put-side data sampling for in-situ learning scenarios. RDQ is based on MPI-3 RMA operations and requires no user-managed compute resources, allowing flexible placement of data queues within the communicator. Internally, RDQ employs a multi–ring buffer design that supports zero-copy receives, including direct access from Python and PyTorch via the DLPack tensor interface. Performance measurements show that RDQ achieves up to 2600 messages per second for small messages and saturates a 100 Gb/s network interface with message sizes of 8 MiB.

Author(s): Maximilian Sander (TU Dresden), and Jens Domke (RIKEN)

Domain: Engineering

Scaling Linear Algebra: Eigenvalue Solvers and Performance Trends on Contemporary HPC Systems

Large-scale eigenvalue problems are a fundamental component of modern scientific simulations in fields such as materials science, computational chemistry, and theoretical physics, where they often represent a dominant computational bottleneck. The efficiency and scalability of linear algebra and eigenvalue algorithms are therefore critical to fully exploiting contemporary high-performance computing (HPC) architectures. This work presents a systematic performance comparison of widely used dense eigensolver libraries (ScaLAPACK, cuSOLVER, rocSOLVER, cuSOLVERMp, ELPA) across several HPC platforms, including CPU-only and heterogeneous GPU-accelerated systems (Leonardo, Pitagora, LUMI, MareNostrum 5, and Fugaku). The study evaluates performance and scaling behaviour over a broad range of matrix sizes, data types, and precision levels. Our results show that GPU-based cuSOLVER library achieves the best performance for small matrix sizes (up to 10^8 elements), while cuSOLVERMp library provides better performance for medium to large matrices on a limited number of nodes. However, ScaLAPACK and ELPA show the best strong-scaling behaviour. The reproducible benchmark suite provides a practical reference for selecting and tuning eigensolvers in large-scale scientific applications in various HPC environments.

Author(s): Maria Montagna (CINECA), and Sergio Orlandini (CINECA)

Domain: Computational Methods and Applied Mathematics

Shedding Light on the Solar Dynamo Using Bayesian Data Science

Solar magnetic activity exhibits an approximately 11-year cycle that is far from stable, showing strong long-term variability and recurrent episodes of strongly reduced activity known as Grand Minima. Understanding the origin of these features remains a fundamental challenge in solar physics and is crucial for improving predictions of solar activity. We investigate the hypothesis that the solar dynamo operates near a critical bifurcation point at which two oscillatory regimes coexist, corresponding to normal and Grand-Minimum-like activity. We employ a zero-dimensional stochastic dynamo model formulated as a stochastic delay differential equation, which exhibits bistability and noise-induced transitions between weak and strong dynamo modes. Model parameters are inferred using simulation-based Bayesian methods that rely on forward simulations. We apply two independent approaches: a Simulated Annealing Approximate Bayesian Computation algorithm and a neural simulation-based inference method based on normalizing flows. Using both the observed sunspot record and a millennial-scale reconstruction, we obtain consistent posterior distributions concentrated near the critical bifurcation point, supporting a near-critical interpretation of the solar dynamo.

Author(s): Simone Ulzega (Zurich University of Applied Sciences, Institute of Computational Life Sciences), and Carlo Albert (Swiss Federal Institute of Aquatic Science and Technology (EAWAG))

Domain: Physics

siaMAP: A Sequence Integrity–Aware Mapping Framework for Genome-Wide Pooled Genetic Screens

Genome-wide pooled genetic screens enabled by CRISPR-Cas9 allow high-throughput functional interrogation of thousands of genes in a single experiment. Analysis of these screens relies on computational pipelines that process massive NGS datasets through quality control, large-scale read mapping, and statistical modeling, where the mapping stage must efficiently handle hundreds of millions of short reads. Current analysis pipelines primarily focus on read-level quality metrics while largely overlooking the structural integrity of perturbation-identifying sequences such as sgRNAs or barcodes. As a result, reads containing truncated or structurally altered sequences may be incorporated into quantification, potentially introducing systematic bias. Previous studies have shown that molecular processes during library construction and delivery can compromise perturbation sequence structures, underscoring the need for integrity-aware analysis. Here, we present siaMAP, a sequence integrity–aware mapping framework that integrates structural quality assessment directly into the mapping process. siaMAP detects and quantifies structural defects in perturbation-identifying sequences and selectively maps integrity-preserving reads. Although demonstrated using genome-wide shRNA screen data, the framework is designed to be readily applicable to CRISPR-Cas9 sgRNA-based pooled screens. Our results indicate that integrity-aware mapping significantly alters quantitative signals and hit composition, highlighting sequence integrity as a critical factor for reliable large-scale pooled genetic screen analysis.

Author(s): Jihyeob Mun (Korea Institute of Science and Technology Information)

Domain: Life Sciences

Stencil Computation on Tenstorrent Wormhole

The rapid ascent of large language models (LLMs) has prioritized domain-specific accelerators (DSAs) optimized for dense matrix-based deep learning. However, the suitability of these architectures for traditional high-performance computing (HPC) kernels, like stencil-based partial differential equation (PDE) solvers, remains largely unexplored. This research investigates mapping stencil operations, the computational core of weather forecasting, fluid dynamics, and seismic imaging, onto Tenstorrent’s Wormhole, a RISC-V-based AI accelerator. The study evaluates two heterogeneous CPU-Wormhole methodologies: an “axpy-style” approach using scaled vector additions and a matrix-multiplication formulation requiring complex stencil-to-row transformations. By delegating irregular scalar logic and boundary conditions to the CPU while leveraging Wormhole for parallel tiled execution, the design maximizes the accelerator’s throughput. Experimental results indicate that while optimized CPUs currently lead in raw speed, Wormhole demonstrates superior energy efficiency per grid update. Profiling identifies bottlenecks in fixed tile layouts and scratchpad data movement. Consequently, this work proposes hardware and software refinements, including flexible tile sizes, enhanced scalar units, and unified memory architectures, to evolve AI accelerators into competitive, general-purpose engines for scientific discovery. This work highlights that AI accelerators represent a promising path forward; their efficiency is vital for reducing energy consumption in HPC environments while maintaining high performance.

Author(s): Lorenzo Piarulli (Sapienza University), and Daniele De Sensi (Sapienza University)

Domain: Computational Methods and Applied Mathematics

Structured Reinforcement Learning for Loop Transformation in MLIR

Optimizing polyhedral kernels for modern multicore architectures is a high-dimensional, non-convex problem where small structural changes often yield orders-of-magnitude runtime variation. While traditional compilers rely on rigid static heuristics and autotuners require prohibitive search times, this poster presents RL Tuner, a reinforcement learning-based compilation framework that automatically discovers high-performance optimization schedules within the MLIR ecosystem. Unlike prior approaches that treat code as text, RL Tuner utilizes a novel, geometry-aware state representation derived from Linalg iterator semantics and affine access patterns. This allows the agent to reason directly about loop hierarchy, parallelism, and data reuse. To guarantee semantic correctness, we implement a dependency-driven action masking mechanism that restricts the agent to a subspace of legal transformations. By leveraging the MLIR Transform Dialect, RL Tuner applies hierarchical optimizations—including tiling, interchange, and fusion—via handle-based scheduling. Experimental evaluations on PolyBench kernels demonstrate that RL Tuner consistently outperforms the state-of-the-art Pluto compiler, achieving up to a 12.9× speedup on complex triangular kernels where traditional heuristics stagnate. These results highlight the promise of structure-aware reinforcement learning as a scalable and effective approach for automated, performance-portable kernel optimization.

Author(s): Abrar Hossain (University of Toledo)

Domain: Computational Methods and Applied Mathematics

Tracking Mechanistic Evolution Across Brain Tissues and Cell Types Using Multiplex Networks

Tracking the evolution of biological function remains a major challenge in computational biology, as existing approaches are often limited to sequence conservation, gene presence, or predefined pathways. These methods can fail to identify conserved functional mechanisms even though constituent genes change. To address these limitations, we present a network-native framework for tracking functional evolution in the brain using large multiplex biological networks. We construct whole-brain, brain tissue–specific, and brain cell type–specific multiplexes by integrating hundreds of human biological networks. For each gene, we compute Random Walk with Restart (RWR) embeddings that capture topological context within the multiplex. Pairwise distances between embeddings are hierarchically clustered to identify data-driven mechanistic modules without reliance on predefined pathways. To assess evolutionary conservation, we integrate comparative genomics data across multiple species and align ortholog presence with modules using phylogenetically ordered heat maps. This enables quantitative comparison of conserved, diverged, and context-specific mechanisms across biological scales. All RWR and distance matrix computations are implemented using an MPI-distributed, GPU-accelerated pipeline and executed on the Oak Ridge Leadership Computing Facility Frontier supercomputer. This framework enables analysis of functional evolution and supports informed model organism selection based on conserved biological mechanisms.

Author(s): Kenneth Smith (Oak Ridge National Laboratory), Matthew Lane (University of Tennessee), Alice Townsend (Oak Ridge National Laboratory, University of Tennessee), Jean Merlet (Oak Ridge National Laboratory, University of Tennessee), Anna Vlot (Oak Ridge National Laboratory), Alana Wells (Oak Ridge National Laboratory), and Daniel Jacobson (Oak Ridge National Laboratory, University of Tennessee)

Domain: Life Sciences

Tuning the Performance of Three-Body Interactions in Molecular Dynamics

Molecular Dynamics (MD) simulations predict thermophysical properties, yet standard pair potentials can lack the desired accuracy for certain applications. Introducing three-body potentials, such as the Axilrod-Teller-Muto model, improves results but poses significant computational challenges. This work investigates neighbor-finding algorithms using AutoPas, a C++ library for auto-tuning short-ranged MD simulations. We adapt classical pairwise approaches, such as Linked Cells and Verlet Lists, to handle three-body interactions. Results show that runtime disparities between three-body configurations are much larger than for their pairwise counterparts. We identify the “hit rate” (ratio of non-zero force computations to distance checks) as a critical factor, which is inherently lower for particle triplets than for pairs. This requires a good algorithm to maintain a balance between the number of redundant checks and the memory costs. We provide an extensive comparison of cell traversals and neighbor list generation methods and discuss performance results across multiple scenarios. Furthermore, we highlight optimizations for SIMD vectorization of the force kernel and analyze how the lower hit rate hinders achieving optimal theoretical speedups. Consequently, we demonstrate why dynamic auto-tuning is an even more important tool for the computation of three-body interactions.

Author(s): Markus Mühlhäußer (Technical University of Munich), Samuel James Newcome (Technical University of Munich), Fabio Gratl-Gaßner (Technical University of Munich), Manish Kumar Mishra (Technical University of Munich), and Hans-Joachim Bungartz (Technical University of Munich)

Domain: Computational Methods and Applied Mathematics

Uncertainty Quantification for Energy Efficiency Analysis of Scientific Applications at Exascale

Energy efficiency has become a critical constraint in high-performance computing (HPC) as systems scale toward larger node counts. In modern HPC platforms, energy consumption is influenced by complex interactions among hardware and software parameters, including operating frequencies, concurrency levels, memory behavior, and runtime policies. These interactions are further affected by execution-time variability arising from hardware heterogeneity, resource contention, and system noise. As a result, energy measurements often exhibit significant fluctuations that are not captured by deterministic models or single-run experiments, limiting the reliability of traditional energy-aware optimization approaches. This work proposes using uncertainty quantification (UQ) as a systematic method for analyzing energy efficiency in scientific HPC applications. Instead of relying on average-case behavior, we explicitly model the variability in execution time and energy consumption as functions of system configuration parameters. We present a tool-based methodology that combines controlled experimental measurements with variance-based sensitivity analysis, surrogate modeling, and uncertainty propagation. This approach enables the identification of configuration parameters that most strongly affect energy consumption and performance variability, as well as the assessment of energy-performance trade-offs under uncertainty. Moreover, the proposed tool is application-agnostic and applicable across a wide range of scientific HPC workloads and architectures.

Author(s): Matheus Machado (UFRGS), Mariana Costa (UFRGS), Matheus Costa (UFRGS), Philippe Navaux (UFRGS), Arthur Lorenzon (UFRGS), Antigoni Georgiadou (Oak Ridge National Laboratory), and Bronson Messer (Oak Ridge National Laboratory)

Domain: Computational Methods and Applied Mathematics

Using Generative Machine Learning to Produce High-Resolution Weather Data

Generative machine learning techniques show promise for performing atmospheric downscaling (super-resolution for meteorological data) to produce high-resolution weather and climate simulations. Previous work has evaluated the quality of these models using standard error scores such as RMSE, absolute error, or power spectra. These techniques are somewhat limited in evaluating the quality of meteorological data, where long-term trends and prediction of extreme events are important, so to truly test the performance of ML methods for downscaling, we must use metrics better suited to meteorological models. In this work, we adapt CorrDiff downscaling model (Mardani et al. [1]) to the Alpen region, projecting from a global dataset at ~30-km resolution grid (ERA5), to target regional datasets at 2km (COSMO) and 1km (ICON) resolution grids. We report on the computational performance of the model and introduce evaluation techniques to analyze model quality and compare with the results of numerical weather prediction. We evaluate the quality of the model’s outputs on a single sample, a day, and seasonal basis.

Author(s): Petar Stamenkovic (MeteoSwiss, ETH Zurich), Mary McGlohon (MeteoSwiss, ETH Zurich), David Leutwyler (MeteoSwiss), Xavier Lapillonne (MeteoSwiss), Fabian Bösch (ETH Zurich / CSCS), Lukas Drescher (ETH Zurich / CSCS), Henrique Mendonça (ETH Zurich / CSCS), Sebastian Schemm (University of Cambridge), Siddhartha Mishra (ETH Zurich), and Oliver Fuhrer (MeteoSwiss)

Domain: Climate, Weather and Earth Sciences