IPDPS 2022 Rebuttal

Table of Contents

Feedback on Reviews

Instructions

Use this space to provide feedback on your reviews. For instance you may want to
clarify points of confusion, address misunderstandings, or provide additional
information. Indicate paragraph breaks by a blank line in the text field. All
other text formatting will be lost.

Our answer

Thank you for the very thorough feedback.

We completely understand the concerns of reviewers 1 and 2 related to sparsity, as it is indeed key in the design of our algorithms. Is the result of Lvv-1 Asv^T dense? From a combinatorial perspective, no. Is it difficult to exploit sparsity here? To fully exploit it: Yes, a lot, as much as any part of the matrix in a sparse direct factorization, requiring to resort to advanced techniques, including symbolic factorization, management of complex data-structures (e.g. to cope with arising dense submatrices due to fill-in). This is what direct solver are meant for (MUMPS 5.1.2 has 418,556 lines of code). We rely on the (unique among fully-featured solvers to the best of our knowledge) multiple right-hand-sides feature of MUMPS to computationally exploit the sparsity Asv^T. We'll make it explicit in the revision. However, because the internal data structures are complex, the user eventually still gets (as in all fully-featured direct solvers we are aware of) the output as dense.

Let M := Lvv-1 Asv^T, how about computing M^T M? This is indeed what sparse direct solvers do internally. However, as discussed above, M is retrieved as a dense matrix by the sparse direct solver. This scheme would thus make both the (2,1) and (1,2) blocks dense!

In Asv times (Avv-1 Asv^T), why is it difficult to exploit the sparsity? Our original sentence was oversimplified and misleading, sorry. A more elaborated statement can be related to the above remark on M: sparse direct solvers follow the above suggested scheme. By doing so they exploit a much more appropriate data structure (M has a complex block structure) to exhibit BLAS-3, compression, … while Asv is a "raw" sparse matrix yielding a sub-optimal arithmetic intensity (in addition to useless computation on the zeros stored in Avv-1 Asv^T!).

The motivation of the multi-factorization algorithm is precisely that, through the Schur API, we manage to fully delegate these complicated operations to the sparse direct solver, hence benefiting of its super-optimized processing.

We also understand the concerns of reviewers 3 and 4 on the possible loss of accuracy. It comes from the low-rank compression introduced either in MUMPS (with BLR option, see [14]) or in the H-Matrix solver (see https://tel.archives-ouvertes.fr/tel-01244260 ). In both cases, the accuracy parameter (epsilon) is set to 10^-3. The test in Section V is designed so that we know the exact solution of the problem, hence we can compute the relative error of the results. In all scenarios, we obtain a value below 10^-3 (see https://zenodo.org/record/5749314 ), which is completely satisfactory according to industrial end-users.

We will update the revised version making clear these concerns on sparsity and accuracy (for which we will also add the proposed plot).

Figure

coupled-epsilon-ipdps.svg

Figure 1: Relative error \(E_{rel}\) for the runs of multi-solve and multi-factorization having the best execution times and for both of the solver couplings MUMPS/HMAT and MUMPS/SPIDO. Parallel runs using 24 threads on single miriel node.

Meta-review

This paper is well written, and presented approach is novel and effective in
practical applications. On the other hand, several weaknesses and questions were
pointed out by reviewers. Among them, please give your priority to the answers
to the following questions:

Sparsity (see comments by Reviewer 1 and 2)

\(L_{vv}^{-1} A_{sv}^T\) dense?

Is the result of L_{vv}^{-1} A_{sv}^T dense? (Is it difficult to exploit
sparsity here?)

\(M := L_{vv}^{-1} A_{sv}^T\)

Let M := L_{vv}^{-1} A_{sv}^T, how about computing M^T M?

\(A_{sv} times (A_{vv}^{-1} A_{sv}^T)\)

In the multiplication of A_{sv} times (A_{vv}^{-1} A_{sv}^T), why is it
difficult to exploit the sparsity?

Accuracy (see comments by Reviewer 3 and 4)

Please give comments for those pointed by these reviewers.

Other comments and questions

If possible, please response other comments and questions.

1. Reviewer 1   accept

1.1. Summary

This paper presents an algorithm for the solution of a large linear system with
a sparse and a dense component, resulting from the coupling of a FEM and a BEM
model. This is coming from an industrial application. The algorithm makes use of
state-of-the-art fast sparse and dense direct solvers, using low-rank
compression in both the dense and sparse solvers.

Two versions are presented: the first one constructs the Schur complement in
column blocks. The second one calls the sparse direct solver multiple times,
with the advanced feature to return the (dense) Schur complement for different
sub-parts of the dense Schur block.

1.2. Strengths

- clear presentation
- novel approach for a real use case

1.3. Weaknesses

1.3.1. TODO current limitations (due to the API)

The algorithms work around the (current) limitations in the (interface) to the
used sparse direct solver, by, for instance, redoing the factorization multiple
times while this could in theory be avoided.

En théorie oui, mais c'est justement une des limitations de l'API du solveur creux direct. On l'a déjà évoqué dans l'article. Je ne vois pas trop ce que l'on pourrait en dire d'autre. Peut-être mettre ça en perspective dans la conclusion ?

1.3.2. DONE sparsity of \(A_{sv}\)

"for which it is not evident to exploit the sparsity of A_{sv}" why
not, if A_{vv}^{-1}A_{sv}^T is already dense?

1.3.3. TODO randomized algorithms

Randomized construction algorithms can be used to compress the dense Schur
complement, with a baseline usage of the sparse direct solver, without forming
the dense Schur complement first. See for instance:
- https://arxiv.org/abs/1001.0149
- https://epubs.siam.org/doi/abs/10.1137/20M1315853?journalCode=sjoce3
- https://epubs.siam.org/doi/abs/10.1137/15M1016679?casa_token=O5RBWussf-sAAAAA:g_ADav8QFsod0nPXToM_7mt7E0R4XYV3IRASbUtCz2vSBJTSPuuQgCjPGe4azYRvKravsZFElA1x

Il faut étudier ces articles de plus près.

1.4. DONE Minor

    - "mentionned"
    - "unknwons"
    - "additionnally"
    - "unknwons"
    - "ondulated"?

2. Reviewer 2   weakreject

2.1. Summary

Parallel sparse direct solvers are among the fundamental tools used in
scientific simulations, hence are extensively studied in the past from a
combinatorial, mathematical, and HPC perspective. Among the recent trends in
this community is the employment of low-rank compression techniques, in an
effort reduce the memory as well as complexity of the solver at different
levels. The focus of this paper is to investigate the potential of low rank
compression when the factorization is expected to generate a dense Schur
complement at some point (reference application being FEM+BEM coupled systems in
aircraft simulations for instance).

Existing sparse direct solver packages such as Mumps and Pastix seem to already
provide the functionality that performs a partial factorization that generates a
dense Schur complement. However, when the dimensions of this dense submatrix
grow (which is said to grow quadratically with the chosen frequency), it becomes
infeasible to even form this matrix, which is expected to be of low-rank. The
focus of this paper is to devise algorithms that form this Schur complement in
low-rank small pieces to prevent this explosion in dense matrix size.

Two algorithms are proposed; the first being called "multi-solve" that forms the
Schur complement column-wise(for a set of columns) then recompresses, and the
second being called "multi-factorization" that forms the Schur complement
block-by-block (on a n_c x n_c tiling), then compresses each block after
low-rank updates, thereby reducing the Schur complement size.

2.2. Pros

The article is really well written technically; language is clear, figures
appropriate, provided background easy to follow for someone who is not from this
domain. I learned a lot and enjoyed while reading it.
The problem is of potentially high impact as dense Schur complements arise in
many scientific simulations.
Two algorithms methods seem to provide complementary benefits; multi-solve
enabling larger problem sizes while multi-factorization being faster for smaller
problems.
The paper uses existing methods/librarys as toolboxes and combines them
adequately to implement the algorithms, obtaining respectable speedup.

2.3. Cons

2.3.1. Optimality of the proposed method

I have a couple of questions and concerns that makes me question the optimality of the proposed method, and that the baseline sparse/dens solver coupling being an underperformer:

  1. DONE Dense \(L^-T L^-1 A_{sv}^T\)
    Why is L^-T L^-1 A_{sv}^T dense, given that L and columns of A_{sv}^T (rhs) are
    sparse? Is this particular to the application?
    
  2. DONE \(M = L^-1 A_{sv}^T\)
    How about M = L^-1 A_{sv}^T? If this is not dense (which I believe, for instance
    L = D, or a column of A_{sv}^T being zero), why not update the Schur complement
    with sparse GEMM with S -= M^T M? This would prevent the second triangular
    solve. Then perform b_s -= M^T L^-1 b_v in a single triangular solve plus SpMV?
    

2.3.2. DONE Utility of the baseline multi-factorization

What is the utility of the baseline multi-factorization (w.r.t Mumps complete
Schur), given that it does not compress the Schur complement? In particular, why
not simply generate the entire Schur complement all-at-once using Mumps (meaning
setting n_b = 1)?

To provide a base for the compressed version of the algorithm which would not provide any RAM consumption reduction if \(n_b = 1\).

2.3.3. DONE Alg. 4 l. 7

Alg.4 line 7: is A_{s,s_ij} is it re-compressed after each update, or after a
certain number of updates, or only once at the end? Could you elaborate on this
decision? Is there an initial compression on the sparse block A_{s,s_ij}?

\(A_{ss}\) est compressée depuis le début. La notation utilisée a été choisie pour nous permettre de ne pas avoir à développer sur la manière dont c'est implémenté avec HMAT, c'est-à-dire en initialisant \(X_{ij}\) comme une matrice hiérarchique ayant les mêmes dimensions (mathématiquement parlant) que \(A_{ss}\) mais en ne remplissant que la partie correspodant au bloc Schur courant. Mais, pour moi, ceci est purement une question d'implémentation avec HMAT. Je ne sais pas si ça conviendrait de rajouter cette explication dans la section purement algorithmique. J'ai rajouté quelques mots pour préciser quand est-ce qu'on recompresse, mais cela reste à voir (en fonction de la place qui nous reste).

2.3.4. DONE Alg. 2 l. 6

Alg.2 line 6: Are you using sparse matrix dense matrix multiplication kernels or
performing dense GEMM?

Je ne suis pas sûr si ce que l'on fait s'appelle un GEMM. On multiplie une matrice creuse au format CSR avec un bloc dense classique. Comment est-ce qu'on appelle cette opération proprement ?

SpMM: Sparse Matrix-Matrix product (produit d'une matrice creuse par une matrice dense)

2.3.5. TODO blocksize

One of the main weakness in the experiments is that only the best-case scenario
considered and provided in some experiments w.r.t. blocksize parameters, while I
do not see a convincing way to choose these best-case parameters for any matrix
in practice.

Il fait sûrement allusion à la figure 10. Là, on montre que les meilleurs temps obtenus pour multi-solve et multi-facto. Il est vrai que l'on a pas encore proprement réfléchi à un critère pour choisir les paramètres des deux alogos ou de choisir entre les deux algos tout court. À rajouter en perspectives dans la conclusion je pense.

Il n'en est pas moins que l'on a testé plusieurs cas avec des paramètres différents pour pourvoir déterminer quels sont les meilleurs temps. En plus, on discute de l'influence des différents paramètres dans les sections qui suivent.

2.4. Verdict

The paper provides two complementary algorithms with good results on an
important problem. However, I do have some major concerns, particularly w.r.t.
some sparsity claims made in the paper which can potentially change the utility
of the methods proposed in the paper, and the parameter choice in the
experiments. For these reasons, I see this paper as a borderline.

2.5. Other remarks

2.5.1. DONE add x_v   marek

Please add a line in the algorithms to compute x_v at the end as well for the
sake of completeness of the solver.

2.5.2. DONE 2x2 legend matrix / fig. 10 shape - colors

I must say I am not a big fan of this 2x2 "legend matrix"; it took me a while to
figure out. Colors and line types in Fig.10 is difficult to decipher at first
sight. Maybe consider putting another shape having only colors for
multi-factorization and and multi-solve legends (to underline that that legend
only refers to colors)?

Oui, c'est vrai, on pourrait retirer les points sur la partie droite de la légende.

2.6. Minor

2.6.1. DONE Minor

-sumatrix -> submatrix
-unknwons -> unknowns

2.6.2. DONE To be discussed (4 instances)

-consists in -> consists of

Changé en « consists of » là où c'est approprié. Voir https://blog.harwardcommunications.com/2014/06/11/the-difference-between-consist-of-and-consist-in/ pour l'explication.

3. Reviewer 3   weakaccept

3.1. Summary

The authors proposed blocked algorithms for the direct solution of the coupled
sparse and dense systems. In particular, they looked into the coupling of the
FEM (finite element method) and BEM (boundary element method). The FEM part is a
large sparse matrix, and the BEM part is a smaller dense matrix. The authors
indicated that the existing solvers cannot do the coupling in a memory-efficient
way, and they built their implementation using MUMPS and ScaLAPACK libraries
(HMAT and SPIDO).
The authors proposed two algorithms: multi-solve and multi-factorization. In the
multi-solve implementation, the Schur complement is calculated explicitly by
performing sparse solve using the factorized matrices. In the
multi-factorization implementation, they used the built-in capability of MUMPS
to get the Schur complement from MUMPS. For each variant, they also implemented
low-rank compression variants.
The main contributions are the blocking scheme and hierarchical low-rank
compression to reduce the memory requirement of the coupled direct solver. Their
new implementation managed to solve significantly larger problems up to 9M
unknowns compared to 1.3M baseline coupling. The authors also reported the
factorization time and memory footprint for different problem sizes and
alternative blocking schemes.

3.2. Strengths

A well-written paper where the algorithms are described clearly with
illustrative sketches. The proposed blocking scheme and low-rank approximation
allow solving 7X larger problems with the same memory footprint. Impact of
different blocking on timings and memory footprint are reported. There are also
some unexpected sparse solve perf data from MUMPS/HMAT reported as it is. The
findings and the proposed blocking scheme may inspire API changes and
improvements in sparse direct solvers.

3.3. Weaknesses

There is only a sentence on the accuracy of the results. The paper would benefit
greatly from a more detailed reporting of accuracy for alternative blockings
(especially nb) and low-rank compression schemes.
There is no discussion on how the existing solvers could be improved to solve
the coupled problems more efficiently in terms of memory and execution time.
There is no performance comparison against baseline implementation for 1M (1.3M
with compression) unknowns problem.

Temps d'exécution à retrouver.

3.3.1. TODO Blocking technique

Application of known blocking techniques and existing solvers to reduce the
memory consumption on a single node.

These known blocking techniques are used in a distributed environment to share the work between different processes (see refs in related work). Here, we make use of blocking to reduce the memory consumption (by compression for instance). from my point of view this is what is novel and not the block-wise computation (generally speaking) itself.

Rajoute-t-on quelque chose par rapport à ça dans l'article ?

3.4. Modifications for improving

3.4.1. DONE More details on accuracy

Could you report the accuracy in more detail? I would expect that there would be
no significant accuracy difference between multi-solve vs. multi-factorize,
however, it will still be good to see if there is some unexpected difference due
to round-off / extra compute. It would also be interesting to see whether
accuracy changes for different blocking values (especially nb since this seems
to break the symmetry).

3.4.2. TODO include a section on existing sparse/dense solvers

Would it be possible to include a section summarizing how the existing direct
sparse/dense solvers could be improved to solve your problems more efficiently?

3.4.3. DONE vanilla

Could you report the execution time of the vanilla implementations II.C and
II.D. I understand that this could only be done for the smallest problem size
(1M) due to memory consumption. However, it may still be good to have this as a
reference. Perhaps Fig 12 already reports this for nb=1 values? If so, it would
be good to highlight it in the text.

3.4.4. A few detailed suggestions for the text/plots:

  1. DONE II.C
    II.C:
    "A matrix-matrix product then follows to compute Asv A−1 vv ATsv, for which it
    is not evident to exploit the sparsity of Asv." Couldn't we use sparse matrix -
    dense matrix multiply here? Could you provide more details on why we cannot
    exploit the sparsity of Asv?
    

    This is indeed what we do here, a sparse matrix-matrix product (SpMM). I don't remember why we added this part of the sentence: '…for which it is not evident to exploit the sparsity of Asv.'

  2. DONE graph legend in Fig 10
    The graph legend in Figure 10 is a bit confusing. First, there are no black
    lines in the plot itself. My suggestion would be to explicitly mention what
    each line is using the exact color in the plot.
    
  3. DONE Fig 11 and 12 legend color match plot itself
    Fig 11 and Fig 12 are clear since there are only two lines, however, it would
    still be good to match the legend color with the plot itself.
    

    Je ne sais pas comment faire ça pour le moment.

  4. DONE Fig 11 and 12 caption
    Fig 11 and 12 caption says "Residual memory (RAM) usage". What does residual
    memory mean here? Consider changing it to RAM usage unless residual has a
    special meaning here, which needs to be described in text.
    

4. Reviewer 4   weakaccept

4.1. Summary

Solving larger coupled sparse/dense linear systems by a direct solver is
considered. The main difficulty is required memory footprint in naive
computation of the Schur complement, in which the result is in dense form. Two
classes of algorithms are proposed based on block-wise computation, and for each
class, two variants with/without low-rank compression are presented. In the
experiments, mainly the tradeoff between memory consumption and computational
time, which is determined by the block size parameter, is investigated. The
results of the industrial application are also provided.

4.2. Strengths

The problem appearing in industrial application is considered, and the examples
of applying the proposed algorithms to the industrial problem are also shown.
The proposed algorithms are clearly described; it is easy to understand the
differences among the algorithms. Experimental results seem to be helpful for
those who will use the proposed algorithms in their applications.

4.3. Weaknesses

The idea of block-wise algorithm seems to be not novel because similar
approaches have been widely used in the context of out-of-core matrix
algorithms. The use of low-rank compression seems to be one of features of the
proposed algorithms, however, almost no information on it is provided in the
current manuscript. Since there are several styles of low-rank compression as
far as the reviewer knows, it is difficult to get the details of algorithms
employing low-rank compression. One of issues when using low-rank compression is
the computational accuracy, however, there is no investigations of the
algorithms from this viewpoint.

4.4. Overall evaluation

The manuscript is well written and provides interesting results, especially
those for actual industrial problems. However, the contribution seems to be
limited. Thus, the review concludes that weak accept is currently reasonable,
and this manuscript should be kept in the second round.

4.5. Modifications for improving

4.5.1. DONE more info on low-rank

More information for low-rank compression should be provided.

4.5.2. DONE more on accuracy

Differences in the accuracy of the computed results among algorithms should be
more discussed (and investigated).

Created: 2022-02-25 Fri 18:09

Validate