What Is HPC Performance Engineering?Performance engineering for HPC refers to techniques applied to software to optimize the performance of that software on given hardware without changing the functionality. This often translates to shorter run-times and consumption of fewer resources. Shorter run-times result in a faster time-to-solution. Sometimes this is an absolute requirement—imagine large weather-modeling software that must run within an hour window to provide an immediate forecast for daily decision-making. Consuming fewer resources can be important as well. Besides saving on costs, using fewer resources can leave compute cycles, memory, disk space, etc., for other important applications.
HPC Performance Engineering (HPCPE) follows a certain process. One typically starts with a profiling tool that shows the parts of the code consuming the most time. Those parts get examined first. This is like taking a physical to determine your own body’s performance. One can then run different types of profiling experiments on those code sections to determine what aspect of the system is involved. HPCPE is multifaceted. Imagine you want to enhance your own performance. Optimal training, diet, exercise, and equipment are just a few factors that fuel success. Likewise, there are several HPC components that can be optimized:
- Pure computation - the fastest possible execution of computational instructions on the hardware. This can include improved math, efficient branch prediction and newer concerns like vectorization, out-of-order execution and maximizing the complex interactions between all of the above and cache hierarchies. Profiling and optional compiler reports can reveal when these issues arise, and standard solutions exist for addressing many performance situations.
- Synchronization - modern processors consist of many independent processing cores that must synchronize with each other to achieve complex parallelization as well as coordinate shared resources. Good synchronization prevents cores from stalling—they don’t need to wait for other cores or wait to access shared resources.
- Memory hierarchy - this includes cache, traditional memory and possibly non-volatile memory, all of which function as a hierarchy of faster to slower (but larger) workspaces for the computations. Newer architectures are introducing new levels on the “slow end “of the hierarchy, while others are simplifying the “faster levels” of the hierarchy. There are numerous strategies for improving cache and memory utilization depending upon how the code needs to access memory.
- Communication - larger scale computing and HPC require numerous computing units, commonly called “nodes,” to be linked together by a very fast network, the basis of which forms an HPC system capable of parallel computing. Similar to synchronization above, these nodes must coordinate efficiently via the network. What constitutes efficient synchronization and efficient communication can vary significantly.
- I/O - input is the ingestion of data into software, and output is the production of data out of the software for analysis and/or permanent storage. This is often the slowest part of the process. I/O and file devices are similar to memory in that they store data, but they are orders of magnitude slower. Parallel file systems provide the best overall performance, but they are complex to use effectively, so optimization of I/O should often occur first.
- Energy - all of the above components use energy—some more than others. For example, moving data can use more energy than computing, so given the option to reuse items from memory/storage (which requires moving those items) or recompute the items, it may make sense to recompute. This is the opposite of the strategy that one might have used on only moderately older platforms. Energy is expected to become a major factor in HPC in the future.
- Software constructs and design - this touches all of the above components but is geared toward allowing the programmer to express algorithms in human-understandable fashion. Sometimes those goals compete; that is, an elegant-looking algorithm may not perform very well. This is a complex issue that involves choice of programming language, synchronization/communication methods and libraries, math libraries, etc., and how those choices interact with the above components. By rearranging the tools applied and introducing abstraction layers to hide complex computation and data structures, it is possible to write elegant, maintainable, extensible code that performs well.
The PayoffThere is a lot to consider in performance engineering and it is rarely easy – a large reason why our customers seek out the specialized knowledge of Engility’s HPC computational scientists. However, the payoffs can be huge:
- Cost-effective HPC and HPC-in-the-Cloud - reduction of resources translates directly to improved cost-effectiveness, especially for pay-per-time cloud configurations.
- Fast Decision Support - fast turnaround times for simulations and analytics yield actionable information sooner.
- Extreme Parallelism - simulating and/or analyzing very large and complex problems requires significantly more processors and other resources executing in parallel. HPC Performance Engineering is often required, sometimes several times, to break through performance ceilings when adding more parallelism.