# IPDPS 2022 Rebuttal

## Table of Contents

## Feedback on Reviews

### Instructions

Use this space to provide feedback on your reviews. For instance you may want to clarify points of confusion, address misunderstandings, or provide additional information. Indicate paragraph breaks by a blank line in the text field. All other text formatting will be lost.

### Our answer

Thank you for the very thorough feedback.

We completely understand the concerns of reviewers 1 and 2 related to sparsity,
as it is indeed key in the design of our algorithms. Is the result of
L_{vv}^{-1} A_{sv}^T dense? From a combinatorial perspective, no. Is it
difficult to exploit sparsity here? To fully exploit it: Yes, a lot, as much as
any part of the matrix in a sparse direct factorization, requiring to resort to
advanced techniques, including symbolic factorization, management of complex
data-structures (e.g. to cope with arising dense submatrices due to fill-in).
This is what direct solver are meant for (MUMPS 5.1.2 has 418,556 lines of
code). We rely on the (unique among fully-featured solvers to the best of our
knowledge) multiple right-hand-sides feature of MUMPS to computationally exploit
the sparsity A_{sv}^T. We'll make it explicit in the revision. However, because
the internal data structures are complex, the user eventually still gets (as in
all fully-featured direct solvers we are aware of) the output as dense.

Let M := L_{vv}^{-1} A_{sv}^T, how about computing M^T M? This is indeed what
sparse direct solvers do internally. However, as discussed above, M is retrieved
as a dense matrix by the sparse direct solver. This scheme would thus make both
the (2,1) and (1,2) blocks dense!

In A_{sv} times (A_{vv}^{-1} A_{sv}^T), why is it difficult to exploit the
sparsity? Our original sentence was oversimplified and misleading, sorry. A more
elaborated statement can be related to the above remark on M: sparse direct
solvers follow the above suggested scheme. By doing so they exploit a much more
appropriate data structure (M has a complex block structure) to exhibit BLAS-3,
compression, … while A_{sv} is a "raw" sparse matrix yielding a sub-optimal
arithmetic intensity (in addition to useless computation on the zeros stored in
A_{vv}^{-1} A_{sv}^T!).

The motivation of the multi-factorization algorithm is precisely that, through the Schur API, we manage to fully delegate these complicated operations to the sparse direct solver, hence benefiting of its super-optimized processing.

We also understand the concerns of reviewers 3 and 4 on the possible loss of accuracy. It comes from the low-rank compression introduced either in MUMPS (with BLR option, see [14]) or in the H-Matrix solver (see https://tel.archives-ouvertes.fr/tel-01244260 ). In both cases, the accuracy parameter (epsilon) is set to 10^-3. The test in Section V is designed so that we know the exact solution of the problem, hence we can compute the relative error of the results. In all scenarios, we obtain a value below 10^-3 (see https://zenodo.org/record/5749314 ), which is completely satisfactory according to industrial end-users.

We will update the revised version making clear these concerns on sparsity and accuracy (for which we will also add the proposed plot).

### Figure

Figure 1: Relative error \(E_{rel}\) for the runs of **multi-solve** and **multi-factorization** having the best execution times and for both of the solver couplings MUMPS/HMAT and MUMPS/SPIDO. Parallel runs using 24 threads on single `miriel`

node.

## Meta-review

This paper is well written, and presented approach is novel and effective in practical applications. On the other hand, several weaknesses and questions were pointed out by reviewers. Among them, please give your priority to the answers to the following questions:

### Sparsity (see comments by Reviewer 1 and 2)

#### \(L_{vv}^{-1} A_{sv}^T\) dense?

Is the result of L_{vv}^{-1} A_{sv}^T dense? (Is it difficult to exploit sparsity here?)

#### \(M := L_{vv}^{-1} A_{sv}^T\)

Let M := L_{vv}^{-1} A_{sv}^T, how about computing M^T M?

#### \(A_{sv} times (A_{vv}^{-1} A_{sv}^T)\)

In the multiplication of A_{sv} times (A_{vv}^{-1} A_{sv}^T), why is it difficult to exploit the sparsity?

### Accuracy (see comments by Reviewer 3 and 4)

Please give comments for those pointed by these reviewers.

### Other comments and questions

If possible, please response other comments and questions.

## 1. Reviewer 1 accept

### 1.1. Summary

This paper presents an algorithm for the solution of a large linear system with a sparse and a dense component, resulting from the coupling of a FEM and a BEM model. This is coming from an industrial application. The algorithm makes use of state-of-the-art fast sparse and dense direct solvers, using low-rank compression in both the dense and sparse solvers. Two versions are presented: the first one constructs the Schur complement in column blocks. The second one calls the sparse direct solver multiple times, with the advanced feature to return the (dense) Schur complement for different sub-parts of the dense Schur block.

### 1.2. Strengths

- clear presentation - novel approach for a real use case

### 1.3. Weaknesses

#### 1.3.1. TODO current limitations (due to the API)

The algorithms work around the (current) limitations in the (interface) to the used sparse direct solver, by, for instance, redoing the factorization multiple times while this could in theory be avoided.

En théorie oui, mais c'est justement une des limitations de l'API du solveur creux direct. On l'a déjà évoqué dans l'article. Je ne vois pas trop ce que l'on pourrait en dire d'autre. Peut-être mettre ça en perspective dans la conclusion ?

#### 1.3.2. DONE sparsity of \(A_{sv}\)

"for which it is not evident to exploit the sparsity of A_{sv}" why not, if A_{vv}^{-1}A_{sv}^T is already dense?

#### 1.3.3. TODO randomized algorithms

Randomized construction algorithms can be used to compress the dense Schur complement, with a baseline usage of the sparse direct solver, without forming the dense Schur complement first. See for instance: - https://arxiv.org/abs/1001.0149 - https://epubs.siam.org/doi/abs/10.1137/20M1315853?journalCode=sjoce3 - https://epubs.siam.org/doi/abs/10.1137/15M1016679?casa_token=O5RBWussf-sAAAAA:g_ADav8QFsod0nPXToM_7mt7E0R4XYV3IRASbUtCz2vSBJTSPuuQgCjPGe4azYRvKravsZFElA1x

Il faut étudier ces articles de plus près.

### 1.4. DONE Minor

- "mentionned" - "unknwons" - "additionnally" - "unknwons" - "ondulated"?

## 2. Reviewer 2 weakreject

### 2.1. Summary

Parallel sparse direct solvers are among the fundamental tools used in scientific simulations, hence are extensively studied in the past from a combinatorial, mathematical, and HPC perspective. Among the recent trends in this community is the employment of low-rank compression techniques, in an effort reduce the memory as well as complexity of the solver at different levels. The focus of this paper is to investigate the potential of low rank compression when the factorization is expected to generate a dense Schur complement at some point (reference application being FEM+BEM coupled systems in aircraft simulations for instance). Existing sparse direct solver packages such as Mumps and Pastix seem to already provide the functionality that performs a partial factorization that generates a dense Schur complement. However, when the dimensions of this dense submatrix grow (which is said to grow quadratically with the chosen frequency), it becomes infeasible to even form this matrix, which is expected to be of low-rank. The focus of this paper is to devise algorithms that form this Schur complement in low-rank small pieces to prevent this explosion in dense matrix size. Two algorithms are proposed; the first being called "multi-solve" that forms the Schur complement column-wise(for a set of columns) then recompresses, and the second being called "multi-factorization" that forms the Schur complement block-by-block (on a n_c x n_c tiling), then compresses each block after low-rank updates, thereby reducing the Schur complement size.

### 2.2. Pros

The article is really well written technically; language is clear, figures appropriate, provided background easy to follow for someone who is not from this domain. I learned a lot and enjoyed while reading it.

The problem is of potentially high impact as dense Schur complements arise in many scientific simulations.

Two algorithms methods seem to provide complementary benefits; multi-solve enabling larger problem sizes while multi-factorization being faster for smaller problems.

The paper uses existing methods/librarys as toolboxes and combines them adequately to implement the algorithms, obtaining respectable speedup.

### 2.3. Cons

#### 2.3.1. Optimality of the proposed method

I have a couple of questions and concerns that makes me question the optimality of the proposed method, and that the baseline sparse/dens solver coupling being an underperformer:

- DONE Dense \(L^-T L^-1 A_{sv}^T\)

Why is L^-T L^-1 A_{sv}^T dense, given that L and columns of A_{sv}^T (rhs) are sparse? Is this particular to the application?

- DONE \(M = L^-1 A_{sv}^T\)

How about M = L^-1 A_{sv}^T? If this is not dense (which I believe, for instance L = D, or a column of A_{sv}^T being zero), why not update the Schur complement with sparse GEMM with S -= M^T M? This would prevent the second triangular solve. Then perform b_s -= M^T L^-1 b_v in a single triangular solve plus SpMV?

#### 2.3.2. DONE Utility of the baseline multi-factorization

What is the utility of the baseline multi-factorization (w.r.t Mumps complete Schur), given that it does not compress the Schur complement? In particular, why not simply generate the entire Schur complement all-at-once using Mumps (meaning setting n_b = 1)?

To provide a base for the compressed version of the algorithm which would not provide any RAM consumption reduction if \(n_b = 1\).

#### 2.3.3. DONE Alg. 4 l. 7

Alg.4 line 7: is A_{s,s_ij} is it re-compressed after each update, or after a certain number of updates, or only once at the end? Could you elaborate on this decision? Is there an initial compression on the sparse block A_{s,s_ij}?

\(A_{ss}\) est compressée depuis le début. La notation utilisée a été choisie pour nous permettre de ne pas avoir à développer sur la manière dont c'est implémenté avec HMAT, c'est-à-dire en initialisant \(X_{ij}\) comme une matrice hiérarchique ayant les mêmes dimensions (mathématiquement parlant) que \(A_{ss}\) mais en ne remplissant que la partie correspodant au bloc Schur courant. Mais, pour moi, ceci est purement une question d'implémentation avec HMAT. Je ne sais pas si ça conviendrait de rajouter cette explication dans la section purement algorithmique. J'ai rajouté quelques mots pour préciser quand est-ce qu'on recompresse, mais cela reste à voir (en fonction de la place qui nous reste).

#### 2.3.4. DONE Alg. 2 l. 6

Alg.2 line 6: Are you using sparse matrix dense matrix multiplication kernels or performing dense GEMM?

Je ne suis pas sûr si ce que l'on fait s'appelle un GEMM. On multiplie une matrice creuse au format CSR avec un bloc dense classique. Comment est-ce qu'on appelle cette opération proprement ?

SpMM: Sparse Matrix-Matrix product (produit d'une matrice creuse par une matrice dense)

#### 2.3.5. TODO blocksize

One of the main weakness in the experiments is that only the best-case scenario considered and provided in some experiments w.r.t. blocksize parameters, while I do not see a convincing way to choose these best-case parameters for any matrix in practice.

Il fait sûrement allusion à la figure 10. Là, on montre que les meilleurs temps obtenus pour multi-solve et multi-facto. Il est vrai que l'on a pas encore proprement réfléchi à un critère pour choisir les paramètres des deux alogos ou de choisir entre les deux algos tout court. À rajouter en perspectives dans la conclusion je pense.

Il n'en est pas moins que l'on a testé plusieurs cas avec des paramètres différents pour pourvoir déterminer quels sont les meilleurs temps. En plus, on discute de l'influence des différents paramètres dans les sections qui suivent.

### 2.4. Verdict

The paper provides two complementary algorithms with good results on an important problem. However, I do have some major concerns, particularly w.r.t. some sparsity claims made in the paper which can potentially change the utility of the methods proposed in the paper, and the parameter choice in the experiments. For these reasons, I see this paper as a borderline.

### 2.5. Other remarks

#### 2.5.1. DONE add x_v marek

Please add a line in the algorithms to compute x_v at the end as well for the sake of completeness of the solver.

#### 2.5.2. DONE 2x2 legend matrix / fig. 10 shape - colors

I must say I am not a big fan of this 2x2 "legend matrix"; it took me a while to figure out. Colors and line types in Fig.10 is difficult to decipher at first sight. Maybe consider putting another shape having only colors for multi-factorization and and multi-solve legends (to underline that that legend only refers to colors)?

Oui, c'est vrai, on pourrait retirer les points sur la partie droite de la légende.

### 2.6. Minor

#### 2.6.1. DONE Minor

-sumatrix -> submatrix -unknwons -> unknowns

#### 2.6.2. DONE To be discussed (4 instances)

-consists in -> consists of

Changé en « consists of » là où c'est approprié. Voir https://blog.harwardcommunications.com/2014/06/11/the-difference-between-consist-of-and-consist-in/ pour l'explication.

## 3. Reviewer 3 weakaccept

### 3.1. Summary

The authors proposed blocked algorithms for the direct solution of the coupled sparse and dense systems. In particular, they looked into the coupling of the FEM (finite element method) and BEM (boundary element method). The FEM part is a large sparse matrix, and the BEM part is a smaller dense matrix. The authors indicated that the existing solvers cannot do the coupling in a memory-efficient way, and they built their implementation using MUMPS and ScaLAPACK libraries (HMAT and SPIDO).

The authors proposed two algorithms: multi-solve and multi-factorization. In the multi-solve implementation, the Schur complement is calculated explicitly by performing sparse solve using the factorized matrices. In the multi-factorization implementation, they used the built-in capability of MUMPS to get the Schur complement from MUMPS. For each variant, they also implemented low-rank compression variants.

The main contributions are the blocking scheme and hierarchical low-rank compression to reduce the memory requirement of the coupled direct solver. Their new implementation managed to solve significantly larger problems up to 9M unknowns compared to 1.3M baseline coupling. The authors also reported the factorization time and memory footprint for different problem sizes and alternative blocking schemes.

### 3.2. Strengths

A well-written paper where the algorithms are described clearly with illustrative sketches. The proposed blocking scheme and low-rank approximation allow solving 7X larger problems with the same memory footprint. Impact of different blocking on timings and memory footprint are reported. There are also some unexpected sparse solve perf data from MUMPS/HMAT reported as it is. The findings and the proposed blocking scheme may inspire API changes and improvements in sparse direct solvers.

### 3.3. Weaknesses

There is only a sentence on the accuracy of the results. The paper would benefit greatly from a more detailed reporting of accuracy for alternative blockings (especially nb) and low-rank compression schemes.

There is no discussion on how the existing solvers could be improved to solve the coupled problems more efficiently in terms of memory and execution time.

There is no performance comparison against baseline implementation for 1M (1.3M with compression) unknowns problem.

Temps d'exécution à retrouver.

#### 3.3.1. TODO Blocking technique

Application of known blocking techniques and existing solvers to reduce the memory consumption on a single node.

These known blocking techniques are used in a distributed environment to share the work between different processes (see refs in related work). Here, we make use of blocking to reduce the memory consumption (by compression for instance). from my point of view this is what is novel and not the block-wise computation (generally speaking) itself.

Rajoute-t-on quelque chose par rapport à ça dans l'article ?

### 3.4. Modifications for improving

#### 3.4.1. DONE More details on accuracy

Could you report the accuracy in more detail? I would expect that there would be no significant accuracy difference between multi-solve vs. multi-factorize, however, it will still be good to see if there is some unexpected difference due to round-off / extra compute. It would also be interesting to see whether accuracy changes for different blocking values (especially nb since this seems to break the symmetry).

#### 3.4.2. TODO include a section on existing sparse/dense solvers

Would it be possible to include a section summarizing how the existing direct sparse/dense solvers could be improved to solve your problems more efficiently?

#### 3.4.3. DONE vanilla

Could you report the execution time of the vanilla implementations II.C and II.D. I understand that this could only be done for the smallest problem size (1M) due to memory consumption. However, it may still be good to have this as a reference. Perhaps Fig 12 already reports this for nb=1 values? If so, it would be good to highlight it in the text.

#### 3.4.4. A few detailed suggestions for the text/plots:

- DONE II.C

II.C: "A matrix-matrix product then follows to compute Asv A−1 vv ATsv, for which it is not evident to exploit the sparsity of Asv." Couldn't we use sparse matrix - dense matrix multiply here? Could you provide more details on why we cannot exploit the sparsity of Asv?

This is indeed what we do here, a sparse matrix-matrix product (SpMM). I don't remember why we added this part of the sentence: '…for which it is not evident to exploit the sparsity of A

_{sv}.' - DONE graph legend in Fig 10

The graph legend in Figure 10 is a bit confusing. First, there are no black lines in the plot itself. My suggestion would be to explicitly mention what each line is using the exact color in the plot.

- DONE Fig 11 and 12 legend color match plot itself

Fig 11 and Fig 12 are clear since there are only two lines, however, it would still be good to match the legend color with the plot itself.

Je ne sais pas comment faire ça pour le moment.

- DONE Fig 11 and 12 caption

Fig 11 and 12 caption says "Residual memory (RAM) usage". What does residual memory mean here? Consider changing it to RAM usage unless residual has a special meaning here, which needs to be described in text.

## 4. Reviewer 4 weakaccept

### 4.1. Summary

Solving larger coupled sparse/dense linear systems by a direct solver is considered. The main difficulty is required memory footprint in naive computation of the Schur complement, in which the result is in dense form. Two classes of algorithms are proposed based on block-wise computation, and for each class, two variants with/without low-rank compression are presented. In the experiments, mainly the tradeoff between memory consumption and computational time, which is determined by the block size parameter, is investigated. The results of the industrial application are also provided.

### 4.2. Strengths

The problem appearing in industrial application is considered, and the examples of applying the proposed algorithms to the industrial problem are also shown. The proposed algorithms are clearly described; it is easy to understand the differences among the algorithms. Experimental results seem to be helpful for those who will use the proposed algorithms in their applications.

### 4.3. Weaknesses

The idea of block-wise algorithm seems to be not novel because similar approaches have been widely used in the context of out-of-core matrix algorithms. The use of low-rank compression seems to be one of features of the proposed algorithms, however, almost no information on it is provided in the current manuscript. Since there are several styles of low-rank compression as far as the reviewer knows, it is difficult to get the details of algorithms employing low-rank compression. One of issues when using low-rank compression is the computational accuracy, however, there is no investigations of the algorithms from this viewpoint.

### 4.4. Overall evaluation

The manuscript is well written and provides interesting results, especially those for actual industrial problems. However, the contribution seems to be limited. Thus, the review concludes that weak accept is currently reasonable, and this manuscript should be kept in the second round.

### 4.5. Modifications for improving

#### 4.5.1. DONE more info on low-rank

More information for low-rank compression should be provided.

#### 4.5.2. DONE more on accuracy

Differences in the accuracy of the computed results among algorithms should be more discussed (and investigated).