Abstract
Significant innovations in the field of High Performance Computing (HPC) have contributed to applicationreproducibility and portability, enabled power-aware schedulers, and integrated microgrids into the
datacenter ecosystem. The unprecedented growth in Artificial Intelligence and Machine Learning, driven in
part by HPC capabilities, has led to increasing integration of computational methods into primary
supercomputer software components, such as schedulers. However, despite these advancements, long-term impact
analysis is currently lacking, as are tools designed to lower the barrier to HPC adoption and advanced
frameworks for ensuring long-term software reproducibility and portability. Additionally, while microgrid
integration with HPC systems has potential to alleviate concerns with power quality and load-following
capabilities, current implementations have limited focus with integration of microreactors with microgrids
for use with HPC.
This research introduces a novel Reinforcement Learning (RL) scheduler based on Decentralized DistributedProximal Policy Optimization (DD-PPO) algorithm, which supports large-scale distributed training across
multiple workers without requiring parameter synchronization at every step. By eliminating reliance on
centralized updates to a shared policy, the DD-PPO scheduler enhances scalability, efficiency, and sample
utilization. Experimental validation using a large real-world dataset containing 11.5 million job traces
collected over seven years demonstrates superior performance in comparison to both heuristic-based schedulers
and existing RL-based scheduling algorithms. Additionally, our work quantifies the impact of a science
gateway on HPC access and provides a detailed Software Quality Assurance framework for HPC software.
Furthermore, this research introduces the Power Gateway, a novel approach to managing power in HPC systems
through the integration of a microgrid with a mobile HPC datacenter. Power Gateway explores the use of a
microreactor as the primary power source supplemented by other distributed energy sources. The integration is
examined from multiple perspectives, such as utilizing it for peak shaving, addressing load-following
challenges in microreactors, and enabling the HPC scheduler to influence its decisions.