# Study of the processor and memory power consumption of coupled sparse/dense solvers

## Table of Contents

## 1. Introduction

Aeroacoustics is the study of the coupling between acoustic phenomena and fluid mechanics. In the aeronautical industry, this discipline is used to model the propagation of acoustic waves in air flows enveloping an aircraft in flight. In particular, it allows one to simulate the noise produced at ground level by an aircraft during the takeoff and landing phases. Beyond its technical importance to validate that the regulatory environmental standards are met, it is also both a societal (ensuring noise reduction for alleviating the impact on health) and an economical (designing future planes that can get authorized to access suburban airports) challenge.

The approach adopted in this study consists in considering the airflow around the aircraft as uniform in almost all the space (in a zone modeled by Boundary Elements Method or BEM) except for the jet of the engines where the flow is considered as non-uniform (in a zone modeled by volume Finite Elements Method or FEM). The resulting linear system couples FEM and BEM, and therefore has both dense parts (from BEM) and sparse parts (from FEM). The coupled linear system, called FEM/BEM, has subsequently a singular composition that must be taken into account during its resolution. Moreover, in order to produce a result that is physically realistic, the number of unknowns can be extremely important, which makes the treatment of this system a computational challenge. In a previous study RRIPDPS, we presented two classes of algorithms, namely the multi-solve and multi-factorization algorithms, for solving such systems. They rely on numerical compression techniques, which allows aircraft designers to process relatively large industrial test cases on a shared-memory workstation.

Until now, the dimensioning variables for such a calculation were the computation time, the consumed RAM and the disk occupation. The study of these quantities allowed us to understand the behavior of the software and to push its limits in order to handle larger cases. The consideration of carbon footprint issues in industry in general and in computing centers in particular leads the research community and the industry to consider another physical dimension: the power consumption of computations. The objective of this paper is to study the power consumption of the solution of a coupled FEM/BEM linear system, and assess how the energy consumption of the processor and the memory varies with the computation time, the amount of memory used and with the multiple algorithmic choices possible at the solver level. We consider both shared-memory multicore machines and small clusters of such nodes, typically used for relatively large problems in aeroacoustics industry.

The rest of the paper is organized as follows. In Section 2, we further introduce the coupled FEM/BEM systems arising in aeroacoustics and then present the multi-solve and the multi-factorization algorithms from RRIPDPS we rely on to solve them. We also discuss related work on the analysis of the power consumption in the context of numerical simulation in general and numerical linear algebra in particular. The hardware, software and instrumentation setup we employ for the present study is introduced in Section 3. The power consumption of the multi-solve and multi-factorization algorithms on a shared-memory node is then analyzed in sections 4 and 5, respectively, before being prolonged in a multi-node context in Section 6. We conclude in Section 7.

## 2. Background

### 2.1. Coupled FEM/BEM systems arising in aeroacoustics

We are interested in the solution of very large linear systems of equations \(Ax = b\) with the particularity of having both sparse and dense parts. Such systems appear in an industrial context when we couple two types of finite elements methods, namely the volume Finite Element Method (FEM) RaviartThomas1977,ern2013theory and the Boundary Element Method (BEM) banerjee1981boundary,Wang2004413. This coupling is used to simulate the propagation of acoustic waves around aircrafts (see Fig. 2, left). In the jet flow created by the reactors, the propagation media (the air) is highly heterogeneous in terms of temperature, density, flow, etc. Hence we need a FEM approach to compute acoustic waves propagation in it. Elsewhere, we approximate the media as homogeneous and the flow as uniform, and use BEM to compute the waves propagation. This leads casenave-phd,casenave2014coupled to a coupled sparse/dense FEM/BEM linear system (\eqref{org6865ac1}) with two groups of unknowns: \(x_v\) related to a FEM volume mesh \colorbox{green}{$v$} of the jet flow and \(x_s\) related to a BEM surface mesh \colorbox{red}{$s$} covering the surface of the aircraft as well as the outer surface of the volume mesh (see Fig. 2, right). The linear system \(Ax = b\) to be solved may be more finely written as:

Figure 1: Partitionning and sparsity of \(A\) in (\eqref{org6865ac1}).

In (\eqref{org6865ac1}), \(A\) is a 2 \(\times\) 2 block symmetric coefficient matrix, where \(A_{vv}\) is a large sparse submatrix representing the action of the volume part on itself, \(A_{ss}\) is a smaller dense submatrix representing the action of the exterior surface on itself, \(A_{sv}\) is a sparse submatrix representing the action of the volume part on the exterior surface (see Fig. 1).

Figure 2: An acoustic wave (blue arrow) emitted by the aircraft's engine, reflected on the wing and crossing the jet flow. Real-life case (left) Sebaso and a numerical model example (right). The red surface mesh represents the aircraft's surface and the surface of the green volume mesh of the jet flow.

### 2.2. Multi-solve and multi-factorization algorithms

Various approaches may be employed to solve such coupled sparse/dense systems.
In this work, we seek to solve (\eqref{org6865ac1}) using a direct method.
In a sparse context, direct methods are known to possibly consume a lot of
memory due to a phenomenon referred to as *fill-in* duff2017direct (zeros
of the original matrix \(A\) become non-zeros in factorized matrix). On the other
hand, when they fit in memory, they are extremely robust from a numerical point
of view and represent a must-have in an industrial solution where they are
commonly employed to solve moderate to relatively large problems.

In the present study, we consider the multi-solve and multi-factorization algorithms proposed in RRIPDPS (see also references therein for a discussion of other approaches of the literature). Their common principle consists in composing existing parallel sparse and dense methods on well chosen submatrices. Their main strength is to rely on state-of-the-art sparse and dense direct solvers and exploit their most advanced features such as compression techniques strumpack,amestoy2015improving,PastixLR,falco2017compas in an effort to lower the memory footprint and potentially reduce the computation time so as to process larger problems. In this section, we present the main algorithmic steps of both these methods. The objective is neither to motivate them nor to describe them in details (we refer the reader to RRIPDPS for that) but to provide a high-level view of the steps and their nature (such as whether they involve dense, sparse or compressed computation). Both methods must assemble the following dense matrix \(S = A_{ss} - A_{sv}A_{vv}^{-1}A_{sv}^T\) associated with the \(A_{ss}\) block and referred to as the Schur complement.

#### 2.2.1. Multi-solve algorithm.

Most sparse direct solvers do not provide an API to handle coupled sparse/dense
systems and can process exclusively sparse systems. The multi-solve approach
accommodates with this constraint by delegating only the \(A_{vv}\) block to the
sparse direct solver. Using the latter, the \(A_{vv}\) block is factorized through
a so-called *sparse factorization*. The \(A_{ss}\) block is handled by the dense direct
solver. Because this block may not fully fit in memory, it is split into
multiple vertical slices (see Fig. 3) which are assembled one by
one, all the processing units tackling the same slice \(i\) at the same time. To
compute such a slice \(S_i\) of \(A_{ss}\), a slice \(A_{{sv}_i}\) is first processed
through a *sparse solve* step of the sparse direct solver, yielding a dense
temporary slice \(Y_i\). The latter is multiplied by the sparse \(A_{sv}\) block.
Then, we perform a final assembly (\(A_{{ss}_i} - A_{sv}Y_i\)) to produce the
dense \(S_i\) slice.

Figure 3: *baseline multi-solve*.

Figure 4: *compressed Schur multi-solve*.

In the *baseline multi-solve* case, the block \(S_i\) is kept dense. Conversely, in the
*compressed Schur multi-solve* variant, it is compressed (through hierarchically low-rank
techniques). Note that \(A_{{ss}_i}\) is initially compressed, but this operation
implies a recompression of the block at each iteration of the loop on \(i\). This
is why this variant allows for computing multiple (typically 4 in the
experiments below) slices \(S_i\) before compressing and assembling them (see Fig.
4).

#### 2.2.2. Multi-factorization algorithm.

The multi-factorization algorithm is based on a more advanced usage of sparse
direct methods consisting in delegating also the management of the dense
\(A_{ss}\) block to the sparse direct solver. Only supported by a few
fully-featured sparse direct solvers, this functionality (referred to as Schur)
has the advantage of efficiently handling off-diagonal blocks thanks to the
advanced combinatorial (such as management of the fill-in), numerical (such as
low-rank compression) and computational (such as level-3 BLAS usage) features of
modern sparse direct solvers when processing the off-diagonal \(A_{sv}^T\) and
\(A_{sv}\) sparse-dense coupling parts (see RRIPDPS for more details). The
computation of the Schur complement \(S\) in the *baseline multi-factorization* algorithm is not
anymore computed by vertical slices but tile-wise. Computing a tile \(S_{ij}\)
(see Fig. 5) amounts to form a temporary *non-symmetric* (except
when \(i = j\)) submatrix \(W\) from \(A_{vv}\), \(A_{{sv}_i}\) and \(A^T_{{sv}_j}\) and
call a *sparse factorization+Schur* step on \(W\) relying on the Schur feature of the sparse
direct solver. This call returns the Schur complement block
\(X_{ij} = - A_{{sv}_i}(L_{vv}U_{vv})^{-1}A^T_{{sv}_j}\) associated with \(W\). To
determine \(S_{ij}\), we perform a final assembly (\(A_{{ss}_{ij}} + X_{ij}\)). Due
to a current limitation in the API of modern sparse direct solver (see extended
discussion in RRIPDPS), the *sparse factorization+Schur* step involving \(W\) implies a
re-factorization of \(A_{vv}\) in \(W\) at each iteration, although it
does not change during the computation. Furthermore, the API of
modern fully-featured sparse direct solvers (see once again discussion in
RRIPDPS) only allows to retrieve the Schur complement itself as a non
compressed dense matrix, even if compression occurs within the rest of
processing.

Figure 5: *baseline multi-factorization*.

Figure 6: *compressed Schur multi-factorization*.

In the *compressed Schur multi-factorization* variant (see Fig. 6), we compress the \(X_{ij}\) Schur
block into a temporary compressed matrix as soon as the sparse solver returns
it. Hence, the final assembly step becomes a compressed assembly \(A_{{ss}_{ij}}
\leftarrow A_{{ss}_{ij}} + {\tt Compress}(X_{ij})\). Like in the case of
*compressed Schur multi-solve*, this operation implies a recompression of the initally compressed
\(A_{{ss}_{ij}}\).

### 2.3. Related work on studies of the power consumption of numerical algorithms

While many studies like 10.1007/s00450-014-0267-7,9092332 analyze the power and energy consumption of various applications on different architectures, fewer studies focus on dense or sparse solvers. doi:10.1177/1094342018792079 presents the energy consumption of OpenMP runtime systems on three dense linear algebra kernels. 10.1002/cpe.3341 and doi:10.1177/1094342016672081 focus on sparse solvers. While the former studies the behavior of the Conjugate Gradient method on different CPU-only architectures, the latter focuses on sparse solvers on heterogeneous architectures. In 10.1145/2712386.2712387, the authors present an energy and performance study of state-of-the-art sparse matrix solvers on GPUs. Note that many studies have been conducted regarding the improvement of the energy consumption of sparse or dense linear algebra algorithms https://doi.org/10.1002/cpe.4460,8370686. In this work, we focus on the analysis of an application coupling both sparse and dense techniques.

## 3. Experimental setup

Figure 7: Short pipe test case (length: 2 m, radius: 4 m) with BEM surface mesh in red and FEM volume mesh in green.

We conducted an energy consumption study of the previously discussed algorithms
for solving larger coupled sparse/dense FEM/BEM linear systems such as defined
in (\eqref{org6865ac1}). The multi-solve and multi-factorization
algorithms (see Section 2.2) are implemented on top of the coupling
of the sparse direct solver MUMPS amestoy1998mumps with either the
proprietary scalapack-like dense direct solver SPIDO (for the *baseline*
variants) or the hierarchical low-rank \(\mathcal{H}\)-matrix compressed solver
HMAT lize2014 (for the *compressed* variants). In the following,
we thus refer to these *baseline* and *compressed* couplings as to MUMPS/SPIDO
and MUMPS/HMAT, respectively. For the experiments, we used a short pipe test
case (see Fig. 7) yielding linear systems close enough to
those arising from real life models (see Fig. 2) while
relying on a reproducible example
(https://gitlab.inria.fr/solverstack/test_fembem) available for the
scientific community. MUMPS and HMAT both provide low-rank compression and
expose a precision parameter \(\epsilon\) set to \(10^{-3}\). Low-rank compression
in the sparse solver MUMPS is enabled for all the benchmarks presented in this
paper. The solver test suite is compiled with GNU C Compiler (gcc) 9.4.0,
OpenMPI 4.1.1, Intel(R) MKL library 2019.1.144, and MUMPS 5.2.1.

Our experiments were carried on the PlaFRIM platform (https://plafrim.fr/)
where we used the `miriel`

computing nodes equipped with 2 \(\times\) 12-core
Haswell Intel(R) Xeon(R) E5-2680 v3 @ 2.5 GHz with a Thermal Design Power (TDP)
of 120 W, 128 GiB (5.3 GiB/core) RAM bank @ 2933 MT/s, an OmniPath 100 Gbit/s
and a 10 Gbit/s Ethernet network links. Note that TDP represents the average
power, in watts, the processor dissipates when operating at base frequency with
all cores active under an Intel-defined, high-complexity workload. Finally,
Turbo-Boost and Hyperthreading are disabled on all the nodes in order to ensure
the reproducibility of the experiments.

We measure the energy consumption of our application with `energy_scope`

. It
is a software package dedicated to performing energy measurements and
identifying energy profiles of HPC codes ESjcad2021. It consists of an
acquisition and energy statistics delivering module running on the cluster and a
post-processing and data analysis module running on a dedicated server.
Measurements are performed at a user-defined frequency on both the processor and
the RAM. We monitor also RAM usage and flop rate. To measure RAM usage, we use a
Python script relying on the `/proc/[PID]/statm`

file (the resident size field).
Finally, for the measurement of the flop rate, we use the `likwid`

likwid software tool. All the data have been acquired at a 1 Hz frequency.
We assessed the potential overhead of using these three software probes on our
application by running the tests with and without using the software probes.
Results show an overhead under 5%.

## 4. Study of the multi-solve algorithm in shared-memory

At first, we study the evolution of power consumption (in Watts) with respect to execution time of the multi-solve algorithm (see Section 2.2) on a single computational node. We consider both the baseline MUMPS/SPIDO and further compressed MUMPS/HMAT coupling assessed on a coupled FEM/BEM linear system of 1,000,000 unknowns in total.

Regarding the *baseline multi-solve* algorithm relying on the MUMPS/SPIDO coupling, we set
the size \(n_c\) of the \(A_{{sv}_j}^T\) and \(S_i\) slices to 256 columns. For the
*compressed Schur multi-solve* relying on the MUMPS/HMAT coupling, the size of the \(S_i\) and the size
of the \(A_{{sv}_j}^T\) slices are handled by two different parameters, \(n_S\) and
\(n_c\) respectively. In this case, we set \(n_c\) to 256 and \(n_S\) to 1,024 columns
so that compression is delayed until four slices involving \(A_{{sv}_j}^T\) are
completed. The choice of these values is motivated by our previous study of the
multi-solve algorithm in RRIPDPS. In Fig.
8, for each solver coupling, the two plots
at the top show the evolution of the CPU and RAM power consumption as well as
the flop rate for the multi-solve algorithm. Then, the two other plots at the
bottom of the figure, show the corresponding RAM usage evolution. The labels
mark the alpha and the omega, i.e. the beginning and the end, of the most
important computation phases.

In the case of multi-solve, the Schur complement computation dominates the execution time as well as the RAM usage. From the point of view of the computational intensity, illustrated by the flop rate, it is the opposite. The factorization phase of the dense Schur complement matrix is very short for the MUMPS/HMAT coupling thanks to the usage of low-rank compression. Nevertheless, as of the MUMPS/SPIDO coupling, this phase consists of a dense factorization which is a computationally intensive operation. Indeed, in this case, we reach the flop rate peak.

Figure 8: Power consumption, flop rate and RAM usage evolution of **multi-solve** for the MUMPS/SPIDO and the MUMPS/HMAT couplings on a FEM/BEM system with 1,000,000 unknowns. Parallel runs using 24 threads on single `miriel`

node. \(\alpha\) and \(\omega\) mark the beginning and the end of a computation phase.

Fig. 9 is a zoom of Fig.
8 between the execution times 498 and 525 s,
i.e. within the Schur complement computation phase. In this figure, we can
observe cycles of high and low power consumption and flop rate. As of the RAM
usage, the cycles are also present but less noticeable. Based on the labels, we
can see that these cycles are almost entirely due to the *sparse solve* operation
involved in the computation of each of the Schur complement blocks \(S_i\) (see
Section 2.2).

Figure 9: Zoom on Fig. 8 between the execution times 498 and 525 s. The Y-axis has been adapted to the extremum values in the selected time span.

Fig. 10 finally compares the total energy consumption (in Joules), the total execution time and the peak RAM usage, once again of both variants (baseline MUMPS/SPIDO and further compressed MUMPS/HMAT) of the multi-solve algorithm. We consider three coupled FEM/BEM systems of a total of 1,000,000, 3,000,000 and 5,000,000 unknowns, respectively. The results confirm, as one may expect, that the energy consumption, the execution time as well as the peak RAM usage rise with increasing size of the linear system. Moreover, the results show that the compressed MUMPS/HMAT variant (the one that also performs compression within the Schur) consumes less energy, in addition to being faster. This is an interesting result from an industrial point of view, further motivating the usage of low-rank compression techniques.

Figure 10: Total energy consumption, computation time and peak RAM usage of **multi-solve** for the MUMPS/SPIDO and the MUMPS/HMAT couplings on FEM/BEM systems with 1,000,000, 3,000,000 and 5,000,000 unknowns. Parallel runs using 24 threads on single `miriel`

node.

## 5. Study of the multi-factorization algorithm in shared-memory

We now consider the multi-factorization algorithm. For the 1,000,000 unknowns test case (Fig. 11), we have set the number of Schur complement block rows and columns \(n_b\) to 3 for both baseline and further compressed solver couplings. As for multi-solve, in the case of multi-factorization, the execution time is dominated by the Schur complement computation phase. With \(n_b\) = 3, we have a total of 6 Schur complement blocks \(S_{ij}\) to compute. We can identify the moment of computation of each \(S_{ij}\) thanks to the apparent cycles of high and low power consumption and flop rate and especially of RAM usage. Here, the Schur complement computation phase again consumes most of the RAM but is not the most computationally intensive part of the algorithm. The peak power consumption and flop rates are met during the dense factorization of \(S\) in case of the baseline MUMPS/SPIDO coupling.

Figure 11: Power consumption, flop rate and RAM usage evolution of **multi-factorization** for the MUMPS/SPIDO and the MUMPS/HMAT couplings on a FEM/BEM system with 1,000,000 unknowns. Parallel runs using 24 threads on single `miriel`

node. \(\alpha\) and \(\omega\) mark the beginning and the end of a computation phase.

Fig. 12 compares the total energy consumption (in Joules), the total execution time and the peak RAM usage of the multi-factorization algorithm with problems of 1,000,000, 1,500,000 and 2,000,000 total unknowns. We can draw the same conclusion as in the case of multi-solve.

Figure 12: Total energy consumption, computation time and peak RAM usage of **multi-factorization** for the MUMPS/SPIDO and the MUMPS/HMAT couplings on FEM/BEM systems with 1,000,000, 1,500,000 and 2,000,000 unknowns. Parallel runs using 24 threads on single `miriel`

node.

## 6. Multi-node study

We now consider a platform composed of four computational nodes and assess a coupled system of 2,000,000 total unknowns.

### 6.1. Multi-solve algorithm

Fig. 13 shows the processor and RAM power consumption over all four monitored nodes for both multi-solve variants. The results show that the power consumption in a distributed parallel test case evolves similarly to the single node test case (see Section 4). Similarly, the RAM usage follows the pattern of the single node test case.

Figure 13: Power consumption, flop rate and RAM usage evolution of **multi-solve** for the MUMPS/SPIDO and the MUMPS/HMAT couplings on a FEM/BEM system with 2,000,000 unknowns. Distributed parallel runs using a total of 96 threads and 4 `miriel`

nodes. \(\alpha\) and \(\omega\) mark the beginning and the end of a computation phase.

### 6.2. Multi-factorization algorithm

Fig. 14 shows the behavior of the
multi-factorization algorithm. The evolution of RAM usage of MUMPS/HMAT follows
the pattern of the single node test case (see Section 5).
Regarding the power consumption, the high and low cycles corresponding to the
computation of different Schur complement blocks differ considerably compared to
the single-node test case. At the beginning of each cycle, there is a peak but
the consumption falls down long before the end of the *sparse factorization+Schur* step.
This may indicate an under-optimized usage of this routine in a parallel
distributed environment. Interestingly, this study allowed us to
potentially identify a bottleneck in our usage of the distributed-memory
parallelization of a key component of our solver stack, which we had not
identified in previous performance-only studies we conducted, showing the
interest of such multi-metric profiles beyond the assessment of the overall
energy consumption.

Figure 14: Power consumption, flop rate and RAM usage evolution of **multi-factorization** for the MUMPS/SPIDO and the MUMPS/HMAT couplings on a FEM/BEM system with 2,000,000 unknowns. Distributed parallel runs using a total of 96 threads and 4 `miriel`

nodes. \(\alpha\) and \(\omega\) mark the beginning and the end of a computation phase.

## 7. Conclusion and perspectives

We have studied the energetic profile of a complex HPC application, involving dense, sparse, and compressed operations. The study confirmed that the further compressed algorithms (MUMPS/HMAT) are also worth from an energetic perspective. The profiles of the processor and memory power together with the memory usage and flop rate allowed us to have a more comprehensive understanding of the behavior of the application, up to the point that we identified a potential improvement in our usage of the Schur functionality of the sparse direct solver.

## Acknowledgements

This work was supported by the 'Projet Région Nouvelle-Aquitaine 2018-1R50119 « HPC scalable ecosystem »', the European High-Performance Computing Joint Undertaking EuroHPC under grant agreement №955495 (MICROCARD) co-funded by the Horizon 2020 program of the European Union (EU), the French National Research Agency ANR, the German Federal Ministry of Education and Research, the Italian ministry of economic development, the Swiss State Secretariat for Education, Research and Innovation, the Austrian Research Promotion Agency FFG, and the Research Council of Norway.

# Bibliography

- [RRIPDPS] Agullo, Fel\v s\"oci & Sylvand, Direct solution of larger coupled sparse/dense linear systems using low-rank compression on single-node multi-core machines in an industrial context, Inria Bordeaux Sud-Ouest, (2022).
- [RaviartThomas1977] @incollectionRaviartThomas1977, title=A mixed finite element method for 2-nd order elliptic problems, year=1977, isbn=978-3-540-08432-7, booktitle=Mathematical Aspects of Finite Element Methods, volume=606, series=Lecture Notes in Mathematics, editor=Galligani, Ilio and Magenes, Enrico, doi=10.1007/BFb0064470, url=http://dx.doi.org/10.1007/BFb0064470, publisher=Springer Berlin Heidelberg, author=Raviart, P.A. and Thomas, J.M., pages=292-315
- [ern2013theory] Ern & Guermond, Theory and practice of finite elements, Springer Science & Business Media (2013).
- [banerjee1981boundary] Banerjee & Butterfield, Boundary element methods in engineering science, McGraw-Hill London (1981).
- [Wang2004413] "Wang, Vlahopoulos & Wu", Development of an energy boundary element formulation for computing high-frequency sound radiation from incoherent intensity boundary conditions,
*"Journal of Sound and Vibration"*,**278(1-2)**, 413 - 436 (2004). link. doi. - [casenave-phd] @phdthesiscasenave-phd, author="Fabien Casenave", title="Méthodes de réduction de modèles appliquées à des problèmes d’aéroacoustique résolus par équations intégrales", year = "2013", school = "Université Paris-Est"
- [casenave2014coupled] Casenave, Ern & Sylvand, Coupled BEM-FEM for the convected Helmholtz equation with non-uniform flow in a bounded domain,
*Journal of Computational Physics*,**257**, 627-644 (2014). - [Sebaso] @miscSebaso, title = "Jet engine airflow during take-off", author = Sebaso, howpublished = \urlhttps://commons.wikimedia.org/wiki/File:20140308-Jet_engine_airflow_during_take-off.jpg
- [duff2017direct] Duff, Erisman & Reid, Direct methods for sparse matrices, Oxford University Press (2017).
- [strumpack] Ghysels, Li, Rouet, Williams & Napov, An Efficient Multicore Implementation of a Novel HSS-Structured Multifrontal Solver Using Randomized Sampling,
*SIAM Journal on Scientific Computing*,**38**, (2015). doi. - [amestoy2015improving] Amestoy, Ashcraft, Boiteau, Buttari, L'Excellent & Weisbecker, Improving multifrontal methods by means of block low-rank representations,
*SIAM Journal on Scientific Computing*,**37(3)**, A1451-A1474 (2015). - [PastixLR] Grégoire Pichon, Eric Darve, Mathieu Faverge, Pierre Ramet & Jean Roman, Sparse supernodal solver using block low-rank compression: Design, performance and analysis,
*Journal of Computational Science*,**27**, 255-270 (2018). link. doi. - [falco2017compas] Agullo, Falco, Giraud & Sylvand, Vers une factorisation symbolique hi\'erarchique de rang faible pour des matrices creuses, in in: Conf\'erence d'informatique en Parall\'elisme, Architecture et Syst\`eme (ComPAS'17), edited by (2017)
- [10.1007/s00450-014-0267-7] Charles, Sawyer, Dolz & Catal\'an, Evaluating the Performance and Energy Efficiency of the COSMO-ART Model System,
*Comput. Sci.*,**30(2)**, 177–186 (2015). link. doi. - [9092332] Solis-Vasquez, Santos-Martins, Koch & Forli, Evaluating the Energy Efficiency of OpenCL-accelerated AutoDock Molecular Docking, 162-166, in in: 2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), edited by (2020)
- [doi:10.1177/1094342018792079] Jo\~ao Vicente Ferreira Lima, Issam Ra\"is, Laurent Lef\`evre & Thierry Gautier, Performance and energy analysis of OpenMP runtime systems with dense linear algebra algorithms,
*The International Journal of High Performance Computing Applications*,**33(3)**, 431-443 (2019). link. doi. - [10.1002/cpe.3341] Aliaga, Anzt, Castillo, Fern\'andez, Le\'on, P\'erez & Quintana-Ort\'\i, Unveiling the Performance-Energy Trade-off in Iterative Linear System Solvers for Multithreaded Processors,
*Concurrency and Computation : Practice and Experience*,**27(4)**, 885–904 (2015). link. doi. - [doi:10.1177/1094342016672081] Hartwig Anzt, Stanimire Tomov & Jack Dongarra, On the performance and energy efficiency of sparse linear algebra on GPUs,
*The International Journal of High Performance Computing Applications*,**31(5)**, 375-390 (2017). link. doi. - [10.1145/2712386.2712387] Anzt, Tomov & Dongarra, Energy Efficiency and Performance Frontiers for Sparse Computations on GPU Supercomputers, 1–10, in in: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, edited by Association for Computing Machinery (2015)
- [https://doi.org/10.1002/cpe.4460] Anzt, Dongarra, Flegar, Higham & Quintana-Ortí, Adaptive precision in block-Jacobi preconditioning for iterative sparse linear system solvers,
*Concurrency and Computation: Practice and Experience*,**31(6)**, e4460 (2019). link. doi. - [8370686] Abdelfattah, Haidar, Tomov & Dongarra, Analysis and Design Techniques towards High-Performance and Energy-Efficient Dense Linear Solvers on GPUs,
*IEEE Transactions on Parallel and Distributed Systems*,**29(12)**, 2700-2712 (2018). doi. - [amestoy1998mumps] Amestoy, Duff & L'Excellent, MUMPS multifrontal massively parallel solver version 2.0, , (1998).
- [lize2014] @phdthesislize2014, author="Benoît Lizé", title="Résolution Directe Rapide pour les Éléments Finis de Frontière en Électromagnétisme et Acoustique : \(\mathcalH\)-Matrices. Parallélisme et Applications Industrielles.", year="2014", school="Université Paris 13"
- [ESjcad2021] @miscESjcad2021, title = Energy Scope: a tool for measuring the energy profile of HPC and AI applications, author = Hervé Mathieu, year = 2021, howpublished = \urlhttps://jcad2021.sciencesconf.org/data/Herve_Mathieu_energy_scope.pdf
- [likwid] @misclikwid, author = Thomas Gruber and Jan Eitzinger and Georg Hager and Gerhard Wellein, title = LIKWID, month = 12, year = 2021, note = This research has been partially funded by grants: BMBF 01IH13009 and BMBF 01IH16012C, howpublished = \urlhttps://doi.org/10.5281/zenodo.5752537
- [ESstatm] @miscESstatm, title = \textttproc/[pid]/statm man page, howpublished = \urlhttps://man7.org/linux/man-pages/man5/proc.5.html

## 8. Appendix htmlonly

### 8.1. Sparse FEM systems

### 8.2. Dense BEM systems

### 8.3. Energy scope

A more detailled description can be found in the `energy_scope`

website https://sed-bso.gitlabpages.inria.fr/datacenter/energy_scope.html

On the same website there the information how to get `energy_scope`

. The more comfortable way is to ask the system administrator of the cluster to install `energy_scope`

as a module (environment modules).

The basic use of `energy_scope`

is to measure the energetic consumption of an HPC application or sub parts of an HPC application. Then you can compare energy consumption when using different algorithms or running on different hardware architectures. You can also within speedup experiments add the energy consumption to quantify the energetics cost of increasing the number of nodes. To go beyond just having the global consumption, you can move to the energy profile of the applictaion, to discover and explore the energetics behavior of the application, when the energetics efficency is on the top, and when it could be more optimized.

The `energy_scope`

software behavior can be configured mainly by the sampling of data acquired and the type of data: energy, core temperature

For measuring the overhead induced by `energy_scope`

on the application we used the following configurations:

- 2 Hertz sampling and energy+core temperature: this configuration bring a high quality energy image of the application. The overhead may be tangible depending on the codes.
- 1 Hertz sampling and only energy: this configuration bring less data, but it is sufficient for the majority of applications among the ones described in this paper. The overhead is then minimum.

For the measurements shown in the other sections we use the second configuration.

CPU memory dump
We use the information available in *proc*[PID]/statm ESstatm. Inside the file we read every second the resident value with a timestamp (absolute date).

Data are synchronized by construction because each one owns the absolute date of the acquisition.

Solver coupling | With `energy_scope` |
Total energy consumption | Without `energy_scope` |
Delta | Overhead |
---|---|---|---|---|---|

MUMPS/HMAT | 1059.22 s | 220335 J | 1033.07 s | 26.19 s | 2.5% |

MUMPS/HMAT | 1046.52 s | 214192 J | 1028.43 s | 18.04 s | 1.8% |

MUMPS/HMAT | 1041.74 s | 215934 J | 1002.76 s | 39.05 s | 3.9% |

MUMPS/SPIDO | 1349.64 s | 280491 J | 1295.93 s | 53.79 s | 4.1% |

MUMPS/SPIDO | 1366.57 s | 275652 J | 1314.22 s | 52.8 s | 4% |

MUMPS/SPIDO | 1323.55 s | 272902 J | 1251.67 s | 71.06 s | 5.7% |