Study of the processor and memory power consumption of coupled sparse/dense solvers

1. Introduction

Aeroacoustics is the study of the coupling between acoustic phenomena and fluid mechanics. In the aeronautical industry, this discipline is used to model the propagation of acoustic waves in air flows enveloping an aircraft in flight. In particular, it allows one to simulate the noise produced at ground level by an aircraft during the takeoff and landing phases. Beyond its technical importance to validate that the regulatory environmental standards are met, it is also both a societal (ensuring noise reduction for alleviating the impact on health) and an economical (designing future planes that can get authorized to access suburban airports) challenge.

The approach adopted in this study consists in considering the airflow around the aircraft as uniform in almost all the space (in a zone modeled by Boundary Elements Method or BEM) except for the jet of the engines where the flow is considered as non-uniform (in a zone modeled by volume Finite Elements Method or FEM). The resulting linear system couples FEM and BEM, and therefore has both dense parts (from BEM) and sparse parts (from FEM). The coupled linear system, called FEM/BEM, has subsequently a singular composition that must be taken into account during its resolution. Moreover, in order to produce a result that is physically realistic, the number of unknowns can be extremely important, which makes the treatment of this system a computational challenge. In a previous study RRIPDPS, we presented two classes of algorithms, namely the multi-solve and multi-factorization algorithms, for solving such systems. They rely on numerical compression techniques, which allows aircraft designers to process relatively large industrial test cases on a shared-memory workstation.

Until now, the dimensioning variables for such a calculation were the computation time, the consumed RAM and the disk occupation. The study of these quantities allowed us to understand the behavior of the software and to push its limits in order to handle larger cases. The consideration of carbon footprint issues in industry in general and in computing centers in particular leads the research community and the industry to consider another physical dimension: the power consumption of computations. The objective of this paper is to study the power consumption of the solution of a coupled FEM/BEM linear system, and assess how the energy consumption of the processor and the memory varies with the computation time, the amount of memory used and with the multiple algorithmic choices possible at the solver level. We consider both shared-memory multicore machines and small clusters of such nodes, typically used for relatively large problems in aeroacoustics industry.

The rest of the paper is organized as follows. In Section 2, we further introduce the coupled FEM/BEM systems arising in aeroacoustics and then present the multi-solve and the multi-factorization algorithms from RRIPDPS we rely on to solve them. We also discuss related work on the analysis of the power consumption in the context of numerical simulation in general and numerical linear algebra in particular. The hardware, software and instrumentation setup we employ for the present study is introduced in Section 3. The power consumption of the multi-solve and multi-factorization algorithms on a shared-memory node is then analyzed in sections 4 and 5, respectively, before being prolonged in a multi-node context in Section 6. We conclude in Section 7.

2. Background

2.1. Coupled FEM/BEM systems arising in aeroacoustics

We are interested in the solution of very large linear systems of equations $$Ax = b$$ with the particularity of having both sparse and dense parts. Such systems appear in an industrial context when we couple two types of finite elements methods, namely the volume Finite Element Method (FEM) RaviartThomas1977,ern2013theory and the Boundary Element Method (BEM) banerjee1981boundary,Wang2004413. This coupling is used to simulate the propagation of acoustic waves around aircrafts (see Fig. 2, left). In the jet flow created by the reactors, the propagation media (the air) is highly heterogeneous in terms of temperature, density, flow, etc. Hence we need a FEM approach to compute acoustic waves propagation in it. Elsewhere, we approximate the media as homogeneous and the flow as uniform, and use BEM to compute the waves propagation. This leads casenave-phd,casenave2014coupled to a coupled sparse/dense FEM/BEM linear system (\eqref{org6865ac1}) with two groups of unknowns: $$x_v$$ related to a FEM volume mesh \colorbox{green}{$v$} of the jet flow and $$x_s$$ related to a BEM surface mesh \colorbox{red}{$s$} covering the surface of the aircraft as well as the outer surface of the volume mesh (see Fig. 2, right). The linear system $$Ax = b$$ to be solved may be more finely written as:

\label{org6865ac1} \begin{tikzpicture}[ baseline = -.8ex, mystyle/.style = { matrix of math nodes, every node/.append style = { text width = #1, align = center, minimum height = 4ex }, nodes in empty cells, left delimiter = [, right delimiter = ], inner sep = .5pt }, plain/.style = { matrix of math nodes, every node/.append style = { text width = #1, align = center, minimum height = 4ex, font = \footnotesize }, nodes in empty cells } ] \matrix[mystyle = .6cm] (A) { A_{vv} & A_{sv}^T \\ A_{sv} & A_{ss} \\ }; \draw[line width = 3pt, green] ([xshift = -12pt, yshift = 2pt]A-1-1.north west) -- ([xshift = -12pt, yshift = 2pt]A-1-1.south west); \draw[line width = 3pt, red] ([xshift = -12pt, yshift = -2pt]A-2-1.north west) -- ([xshift = -12pt, yshift = -2pt]A-2-1.south west); \draw[line width = 3pt, green] ([xshift = -2pt, yshift = 6pt]A-1-1.north west) -- ([xshift = -2pt, yshift = 6pt]A-1-1.north east); \draw[line width = 3pt, red] ([xshift = 2pt, yshift = 6pt]A-1-2.north west) -- ([xshift = 2pt, yshift = 6pt]A-1-2.north east); \node at ([xshift = 16pt, yshift = -1.2pt]A.east) {\times}; \matrix[mystyle = .3cm, right = 30pt of A] (X) { x_v \\ x_s \\ }; \node at ([xshift = 15pt, yshift = -1.2pt]X.east) {=}; \matrix[mystyle = .3cm, right = 30pt of X] (B) { b_v \\ b_s \\ }; \end{tikzpicture}.

In (\eqref{org6865ac1}), $$A$$ is a 2 $$\times$$ 2 block symmetric coefficient matrix, where $$A_{vv}$$ is a large sparse submatrix representing the action of the volume part on itself, $$A_{ss}$$ is a smaller dense submatrix representing the action of the exterior surface on itself, $$A_{sv}$$ is a sparse submatrix representing the action of the volume part on the exterior surface (see Fig. 1).

2.2. Multi-solve and multi-factorization algorithms

Various approaches may be employed to solve such coupled sparse/dense systems. In this work, we seek to solve (\eqref{org6865ac1}) using a direct method. In a sparse context, direct methods are known to possibly consume a lot of memory due to a phenomenon referred to as fill-in duff2017direct (zeros of the original matrix $$A$$ become non-zeros in factorized matrix). On the other hand, when they fit in memory, they are extremely robust from a numerical point of view and represent a must-have in an industrial solution where they are commonly employed to solve moderate to relatively large problems.

In the present study, we consider the multi-solve and multi-factorization algorithms proposed in RRIPDPS (see also references therein for a discussion of other approaches of the literature). Their common principle consists in composing existing parallel sparse and dense methods on well chosen submatrices. Their main strength is to rely on state-of-the-art sparse and dense direct solvers and exploit their most advanced features such as compression techniques strumpack,amestoy2015improving,PastixLR,falco2017compas in an effort to lower the memory footprint and potentially reduce the computation time so as to process larger problems. In this section, we present the main algorithmic steps of both these methods. The objective is neither to motivate them nor to describe them in details (we refer the reader to RRIPDPS for that) but to provide a high-level view of the steps and their nature (such as whether they involve dense, sparse or compressed computation). Both methods must assemble the following dense matrix $$S = A_{ss} - A_{sv}A_{vv}^{-1}A_{sv}^T$$ associated with the $$A_{ss}$$ block and referred to as the Schur complement.

2.2.1. Multi-solve algorithm.

Most sparse direct solvers do not provide an API to handle coupled sparse/dense systems and can process exclusively sparse systems. The multi-solve approach accommodates with this constraint by delegating only the $$A_{vv}$$ block to the sparse direct solver. Using the latter, the $$A_{vv}$$ block is factorized through a so-called sparse factorization. The $$A_{ss}$$ block is handled by the dense direct solver. Because this block may not fully fit in memory, it is split into multiple vertical slices (see Fig. 3) which are assembled one by one, all the processing units tackling the same slice $$i$$ at the same time. To compute such a slice $$S_i$$ of $$A_{ss}$$, a slice $$A_{{sv}_i}$$ is first processed through a sparse solve step of the sparse direct solver, yielding a dense temporary slice $$Y_i$$. The latter is multiplied by the sparse $$A_{sv}$$ block. Then, we perform a final assembly ($$A_{{ss}_i} - A_{sv}Y_i$$) to produce the dense $$S_i$$ slice.

In the baseline multi-solve case, the block $$S_i$$ is kept dense. Conversely, in the compressed Schur multi-solve variant, it is compressed (through hierarchically low-rank techniques). Note that $$A_{{ss}_i}$$ is initially compressed, but this operation implies a recompression of the block at each iteration of the loop on $$i$$. This is why this variant allows for computing multiple (typically 4 in the experiments below) slices $$S_i$$ before compressing and assembling them (see Fig. 4).

2.2.2. Multi-factorization algorithm.

The multi-factorization algorithm is based on a more advanced usage of sparse direct methods consisting in delegating also the management of the dense $$A_{ss}$$ block to the sparse direct solver. Only supported by a few fully-featured sparse direct solvers, this functionality (referred to as Schur) has the advantage of efficiently handling off-diagonal blocks thanks to the advanced combinatorial (such as management of the fill-in), numerical (such as low-rank compression) and computational (such as level-3 BLAS usage) features of modern sparse direct solvers when processing the off-diagonal $$A_{sv}^T$$ and $$A_{sv}$$ sparse-dense coupling parts (see RRIPDPS for more details). The computation of the Schur complement $$S$$ in the baseline multi-factorization algorithm is not anymore computed by vertical slices but tile-wise. Computing a tile $$S_{ij}$$ (see Fig. 5) amounts to form a temporary non-symmetric (except when $$i = j$$) submatrix $$W$$ from $$A_{vv}$$, $$A_{{sv}_i}$$ and $$A^T_{{sv}_j}$$ and call a sparse factorization+Schur step on $$W$$ relying on the Schur feature of the sparse direct solver. This call returns the Schur complement block $$X_{ij} = - A_{{sv}_i}(L_{vv}U_{vv})^{-1}A^T_{{sv}_j}$$ associated with $$W$$. To determine $$S_{ij}$$, we perform a final assembly ($$A_{{ss}_{ij}} + X_{ij}$$). Due to a current limitation in the API of modern sparse direct solver (see extended discussion in RRIPDPS), the sparse factorization+Schur step involving $$W$$ implies a re-factorization of $$A_{vv}$$ in $$W$$ at each iteration, although it does not change during the computation. Furthermore, the API of modern fully-featured sparse direct solvers (see once again discussion in RRIPDPS) only allows to retrieve the Schur complement itself as a non compressed dense matrix, even if compression occurs within the rest of processing.

In the compressed Schur multi-factorization variant (see Fig. 6), we compress the $$X_{ij}$$ Schur block into a temporary compressed matrix as soon as the sparse solver returns it. Hence, the final assembly step becomes a compressed assembly $$A_{{ss}_{ij}} \leftarrow A_{{ss}_{ij}} + {\tt Compress}(X_{ij})$$. Like in the case of compressed Schur multi-solve, this operation implies a recompression of the initally compressed $$A_{{ss}_{ij}}$$.

2.3. Related work on studies of the power consumption of numerical algorithms

While many studies like 10.1007/s00450-014-0267-7,9092332 analyze the power and energy consumption of various applications on different architectures, fewer studies focus on dense or sparse solvers. doi:10.1177/1094342018792079 presents the energy consumption of OpenMP runtime systems on three dense linear algebra kernels. 10.1002/cpe.3341 and doi:10.1177/1094342016672081 focus on sparse solvers. While the former studies the behavior of the Conjugate Gradient method on different CPU-only architectures, the latter focuses on sparse solvers on heterogeneous architectures. In 10.1145/2712386.2712387, the authors present an energy and performance study of state-of-the-art sparse matrix solvers on GPUs. Note that many studies have been conducted regarding the improvement of the energy consumption of sparse or dense linear algebra algorithms https://doi.org/10.1002/cpe.4460,8370686. In this work, we focus on the analysis of an application coupling both sparse and dense techniques.

3. Experimental setup

We conducted an energy consumption study of the previously discussed algorithms for solving larger coupled sparse/dense FEM/BEM linear systems such as defined in (\eqref{org6865ac1}). The multi-solve and multi-factorization algorithms (see Section 2.2) are implemented on top of the coupling of the sparse direct solver MUMPS amestoy1998mumps with either the proprietary scalapack-like dense direct solver SPIDO (for the baseline variants) or the hierarchical low-rank $$\mathcal{H}$$-matrix compressed solver HMAT lize2014 (for the compressed variants). In the following, we thus refer to these baseline and compressed couplings as to MUMPS/SPIDO and MUMPS/HMAT, respectively. For the experiments, we used a short pipe test case (see Fig. 7) yielding linear systems close enough to those arising from real life models (see Fig. 2) while relying on a reproducible example (https://gitlab.inria.fr/solverstack/test_fembem) available for the scientific community. MUMPS and HMAT both provide low-rank compression and expose a precision parameter $$\epsilon$$ set to $$10^{-3}$$. Low-rank compression in the sparse solver MUMPS is enabled for all the benchmarks presented in this paper. The solver test suite is compiled with GNU C Compiler (gcc) 9.4.0, OpenMPI 4.1.1, Intel(R) MKL library 2019.1.144, and MUMPS 5.2.1.

Our experiments were carried on the PlaFRIM platform (https://plafrim.fr/) where we used the miriel computing nodes equipped with 2 $$\times$$ 12-core Haswell Intel(R) Xeon(R) E5-2680 v3 @ 2.5 GHz with a Thermal Design Power (TDP) of 120 W, 128 GiB (5.3 GiB/core) RAM bank @ 2933 MT/s, an OmniPath 100 Gbit/s and a 10 Gbit/s Ethernet network links. Note that TDP represents the average power, in watts, the processor dissipates when operating at base frequency with all cores active under an Intel-defined, high-complexity workload. Finally, Turbo-Boost and Hyperthreading are disabled on all the nodes in order to ensure the reproducibility of the experiments.

We measure the energy consumption of our application with energy_scope. It is a software package dedicated to performing energy measurements and identifying energy profiles of HPC codes ESjcad2021. It consists of an acquisition and energy statistics delivering module running on the cluster and a post-processing and data analysis module running on a dedicated server. Measurements are performed at a user-defined frequency on both the processor and the RAM. We monitor also RAM usage and flop rate. To measure RAM usage, we use a Python script relying on the /proc/[PID]/statm file (the resident size field). Finally, for the measurement of the flop rate, we use the likwid likwid software tool. All the data have been acquired at a 1 Hz frequency. We assessed the potential overhead of using these three software probes on our application by running the tests with and without using the software probes. Results show an overhead under 5%.

4. Study of the multi-solve algorithm in shared-memory

At first, we study the evolution of power consumption (in Watts) with respect to execution time of the multi-solve algorithm (see Section 2.2) on a single computational node. We consider both the baseline MUMPS/SPIDO and further compressed MUMPS/HMAT coupling assessed on a coupled FEM/BEM linear system of 1,000,000 unknowns in total.

Regarding the baseline multi-solve algorithm relying on the MUMPS/SPIDO coupling, we set the size $$n_c$$ of the $$A_{{sv}_j}^T$$ and $$S_i$$ slices to 256 columns. For the compressed Schur multi-solve relying on the MUMPS/HMAT coupling, the size of the $$S_i$$ and the size of the $$A_{{sv}_j}^T$$ slices are handled by two different parameters, $$n_S$$ and $$n_c$$ respectively. In this case, we set $$n_c$$ to 256 and $$n_S$$ to 1,024 columns so that compression is delayed until four slices involving $$A_{{sv}_j}^T$$ are completed. The choice of these values is motivated by our previous study of the multi-solve algorithm in RRIPDPS. In Fig. 8, for each solver coupling, the two plots at the top show the evolution of the CPU and RAM power consumption as well as the flop rate for the multi-solve algorithm. Then, the two other plots at the bottom of the figure, show the corresponding RAM usage evolution. The labels mark the alpha and the omega, i.e. the beginning and the end, of the most important computation phases.

In the case of multi-solve, the Schur complement computation dominates the execution time as well as the RAM usage. From the point of view of the computational intensity, illustrated by the flop rate, it is the opposite. The factorization phase of the dense Schur complement matrix is very short for the MUMPS/HMAT coupling thanks to the usage of low-rank compression. Nevertheless, as of the MUMPS/SPIDO coupling, this phase consists of a dense factorization which is a computationally intensive operation. Indeed, in this case, we reach the flop rate peak.

Fig. 9 is a zoom of Fig. 8 between the execution times 498 and 525 s, i.e. within the Schur complement computation phase. In this figure, we can observe cycles of high and low power consumption and flop rate. As of the RAM usage, the cycles are also present but less noticeable. Based on the labels, we can see that these cycles are almost entirely due to the sparse solve operation involved in the computation of each of the Schur complement blocks $$S_i$$ (see Section 2.2).

Fig. 10 finally compares the total energy consumption (in Joules), the total execution time and the peak RAM usage, once again of both variants (baseline MUMPS/SPIDO and further compressed MUMPS/HMAT) of the multi-solve algorithm. We consider three coupled FEM/BEM systems of a total of 1,000,000, 3,000,000 and 5,000,000 unknowns, respectively. The results confirm, as one may expect, that the energy consumption, the execution time as well as the peak RAM usage rise with increasing size of the linear system. Moreover, the results show that the compressed MUMPS/HMAT variant (the one that also performs compression within the Schur) consumes less energy, in addition to being faster. This is an interesting result from an industrial point of view, further motivating the usage of low-rank compression techniques.

5. Study of the multi-factorization algorithm in shared-memory

We now consider the multi-factorization algorithm. For the 1,000,000 unknowns test case (Fig. 11), we have set the number of Schur complement block rows and columns $$n_b$$ to 3 for both baseline and further compressed solver couplings. As for multi-solve, in the case of multi-factorization, the execution time is dominated by the Schur complement computation phase. With $$n_b$$ = 3, we have a total of 6 Schur complement blocks $$S_{ij}$$ to compute. We can identify the moment of computation of each $$S_{ij}$$ thanks to the apparent cycles of high and low power consumption and flop rate and especially of RAM usage. Here, the Schur complement computation phase again consumes most of the RAM but is not the most computationally intensive part of the algorithm. The peak power consumption and flop rates are met during the dense factorization of $$S$$ in case of the baseline MUMPS/SPIDO coupling.

Fig. 12 compares the total energy consumption (in Joules), the total execution time and the peak RAM usage of the multi-factorization algorithm with problems of 1,000,000, 1,500,000 and 2,000,000 total unknowns. We can draw the same conclusion as in the case of multi-solve.

6. Multi-node study

We now consider a platform composed of four computational nodes and assess a coupled system of 2,000,000 total unknowns.

6.1. Multi-solve algorithm

Fig. 13 shows the processor and RAM power consumption over all four monitored nodes for both multi-solve variants. The results show that the power consumption in a distributed parallel test case evolves similarly to the single node test case (see Section 4). Similarly, the RAM usage follows the pattern of the single node test case.

6.2. Multi-factorization algorithm

Fig. 14 shows the behavior of the multi-factorization algorithm. The evolution of RAM usage of MUMPS/HMAT follows the pattern of the single node test case (see Section 5). Regarding the power consumption, the high and low cycles corresponding to the computation of different Schur complement blocks differ considerably compared to the single-node test case. At the beginning of each cycle, there is a peak but the consumption falls down long before the end of the sparse factorization+Schur step. This may indicate an under-optimized usage of this routine in a parallel distributed environment. Interestingly, this study allowed us to potentially identify a bottleneck in our usage of the distributed-memory parallelization of a key component of our solver stack, which we had not identified in previous performance-only studies we conducted, showing the interest of such multi-metric profiles beyond the assessment of the overall energy consumption.

7. Conclusion and perspectives

We have studied the energetic profile of a complex HPC application, involving dense, sparse, and compressed operations. The study confirmed that the further compressed algorithms (MUMPS/HMAT) are also worth from an energetic perspective. The profiles of the processor and memory power together with the memory usage and flop rate allowed us to have a more comprehensive understanding of the behavior of the application, up to the point that we identified a potential improvement in our usage of the Schur functionality of the sparse direct solver.

Acknowledgements

This work was supported by the 'Projet Région Nouvelle-Aquitaine 2018-1R50119 « HPC scalable ecosystem »', the European High-Performance Computing Joint Undertaking EuroHPC under grant agreement №955495 (MICROCARD) co-funded by the Horizon 2020 program of the European Union (EU), the French National Research Agency ANR, the German Federal Ministry of Education and Research, the Italian ministry of economic development, the Swiss State Secretariat for Education, Research and Innovation, the Austrian Research Promotion Agency FFG, and the Research Council of Norway.

Bibliography

• [RRIPDPS] Agullo, Fel\v s\"oci & Sylvand, Direct solution of larger coupled sparse/dense linear systems using low-rank compression on single-node multi-core machines in an industrial context, Inria Bordeaux Sud-Ouest, (2022).
• [RaviartThomas1977] @incollectionRaviartThomas1977, title=A mixed finite element method for 2-nd order elliptic problems, year=1977, isbn=978-3-540-08432-7, booktitle=Mathematical Aspects of Finite Element Methods, volume=606, series=Lecture Notes in Mathematics, editor=Galligani, Ilio and Magenes, Enrico, doi=10.1007/BFb0064470, url=http://dx.doi.org/10.1007/BFb0064470, publisher=Springer Berlin Heidelberg, author=Raviart, P.A. and Thomas, J.M., pages=292-315
• [ern2013theory] Ern & Guermond, Theory and practice of finite elements, Springer Science & Business Media (2013).
• [banerjee1981boundary] Banerjee & Butterfield, Boundary element methods in engineering science, McGraw-Hill London (1981).
• [Wang2004413] "Wang, Vlahopoulos & Wu", Development of an energy boundary element formulation for computing high-frequency sound radiation from incoherent intensity boundary conditions, "Journal of Sound and Vibration", 278(1-2), 413 - 436 (2004). link. doi.
• [casenave-phd] @phdthesiscasenave-phd, author="Fabien Casenave", title="Méthodes de réduction de modèles appliquées à des problèmes d’aéroacoustique résolus par équations intégrales", year = "2013", school = "Université Paris-Est"
• [casenave2014coupled] Casenave, Ern & Sylvand, Coupled BEM-FEM for the convected Helmholtz equation with non-uniform flow in a bounded domain, Journal of Computational Physics, 257, 627-644 (2014).
• [Sebaso] @miscSebaso, title = "Jet engine airflow during take-off", author = Sebaso, howpublished = \urlhttps://commons.wikimedia.org/wiki/File:20140308-Jet_engine_airflow_during_take-off.jpg
• [duff2017direct] Duff, Erisman & Reid, Direct methods for sparse matrices, Oxford University Press (2017).
• [strumpack] Ghysels, Li, Rouet, Williams & Napov, An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling, SIAM Journal on Scientific Computing, 38, (2015). doi.
• [amestoy2015improving] Amestoy, Ashcraft, Boiteau, Buttari, L'Excellent & Weisbecker, Improving multifrontal methods by means of block low-rank representations, SIAM Journal on Scientific Computing, 37(3), A1451-A1474 (2015).
• [PastixLR] Grégoire Pichon, Eric Darve, Mathieu Faverge, Pierre Ramet & Jean Roman, Sparse supernodal solver using block low-rank compression: Design, performance and analysis, Journal of Computational Science, 27, 255-270 (2018). link. doi.
• [falco2017compas] Agullo, Falco, Giraud & Sylvand, Vers une factorisation symbolique hi\'erarchique de rang faible pour des matrices creuses, in in: Conf\'erence d'informatique en Parall\'elisme, Architecture et Syst\eme (ComPAS'17), edited by (2017)
• [10.1007/s00450-014-0267-7] Charles, Sawyer, Dolz & Catal\'an, Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System, Comput. Sci., 30(2), 177–186 (2015). link. doi.
• [9092332] Solis-Vasquez, Santos-Martins, Koch & Forli, Evaluating the Energy Efficiency of OpenCL-accelerated AutoDock Molecular Docking, 162-166, in in: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), edited by (2020)
• [doi:10.1177/1094342018792079] Jo\~ao Vicente Ferreira Lima, Issam Ra\"is, Laurent Lef\evre & Thierry Gautier, Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms, The International Journal of High Performance Computing Applications, 33(3), 431-443 (2019). link. doi.
• [10.1002/cpe.3341] Aliaga, Anzt, Castillo, Fern\'andez, Le\'on, P\'erez & Quintana-Ort\'\i, Unveiling the Performance-Energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors, Concurrency and Computation : Practice and Experience, 27(4), 885–904 (2015). link. doi.
• [doi:10.1177/1094342016672081] Hartwig Anzt, Stanimire Tomov & Jack Dongarra, On the performance and energy efficiency of sparse linear algebra on GPUs, The International Journal of High Performance Computing Applications, 31(5), 375-390 (2017). link. doi.
• [10.1145/2712386.2712387] Anzt, Tomov & Dongarra, Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers, 1–10, in in: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, edited by Association for Computing Machinery (2015)
• [https://doi.org/10.1002/cpe.4460] Anzt, Dongarra, Flegar, Higham & Quintana-Ortí, Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers, Concurrency and Computation: Practice and Experience, 31(6), e4460 (2019). link. doi.
• [8370686] Abdelfattah, Haidar, Tomov & Dongarra, Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs, IEEE Transactions on Parallel and Distributed Systems, 29(12), 2700-2712 (2018). doi.
• [amestoy1998mumps] Amestoy, Duff & L'Excellent, MUMPS multifrontal massively parallel solver version 2.0, , (1998).
• [lize2014] @phdthesislize2014, author="Benoît Lizé", title="Résolution Directe Rapide pour les Éléments Finis de Frontière en Électromagnétisme et Acoustique : $$\mathcalH$$-Matrices. Parallélisme et Applications Industrielles.", year="2014", school="Université Paris 13"
• [ESjcad2021] @miscESjcad2021, title = Energy Scope: a tool for measuring the energy profile of HPC and AI applications, author = Hervé Mathieu, year = 2021, howpublished = \urlhttps://jcad2021.sciencesconf.org/data/Herve_Mathieu_energy_scope.pdf
• [likwid] @misclikwid, author = Thomas Gruber and Jan Eitzinger and Georg Hager and Gerhard Wellein, title = LIKWID, month = 12, year = 2021, note = This research has been partially funded by grants: BMBF 01IH13009 and BMBF 01IH16012C, howpublished = \urlhttps://doi.org/10.5281/zenodo.5752537
• [ESstatm] @miscESstatm, title = \textttproc/[pid]/statm man page, howpublished = \urlhttps://man7.org/linux/man-pages/man5/proc.5.html

8. Appendix   htmlonly

8.3. Energy scope

A more detailled description can be found in the energy_scope website https://sed-bso.gitlabpages.inria.fr/datacenter/energy_scope.html

On the same website there the information how to get energy_scope. The more comfortable way is to ask the system administrator of the cluster to install energy_scope as a module (environment modules).

The basic use of energy_scope is to measure the energetic consumption of an HPC application or sub parts of an HPC application. Then you can compare energy consumption when using different algorithms or running on different hardware architectures. You can also within speedup experiments add the energy consumption to quantify the energetics cost of increasing the number of nodes. To go beyond just having the global consumption, you can move to the energy profile of the applictaion, to discover and explore the energetics behavior of the application, when the energetics efficency is on the top, and when it could be more optimized.

The energy_scope software behavior can be configured mainly by the sampling of data acquired and the type of data: energy, core temperature

For measuring the overhead induced by energy_scope on the application we used the following configurations:

• 2 Hertz sampling and energy+core temperature: this configuration bring a high quality energy image of the application. The overhead may be tangible depending on the codes.
• 1 Hertz sampling and only energy: this configuration bring less data, but it is sufficient for the majority of applications among the ones described in this paper. The overhead is then minimum.

For the measurements shown in the other sections we use the second configuration.

CPU memory dump We use the information available in proc[PID]/statm ESstatm. Inside the file we read every second the resident value with a timestamp (absolute date).

Data are synchronized by construction because each one owns the absolute date of the acquisition.

Table 1: Overhead of energy_scope (detailed table).
Solver coupling With energy_scope Total energy consumption Without energy_scope Delta Overhead
MUMPS/HMAT 1059.22 s 220335 J 1033.07 s 26.19 s 2.5%
MUMPS/HMAT 1046.52 s 214192 J 1028.43 s 18.04 s 1.8%
MUMPS/HMAT 1041.74 s 215934 J 1002.76 s 39.05 s 3.9%
MUMPS/SPIDO 1349.64 s 280491 J 1295.93 s 53.79 s 4.1%
MUMPS/SPIDO 1366.57 s 275652 J 1314.22 s 52.8 s 4%
MUMPS/SPIDO 1323.55 s 272902 J 1251.67 s 71.06 s 5.7%

Created: 2022-02-25 Fri 18:12

Validate