Some advice requested to decrease solution time

Question

I wrote some software using FEniCS for the finite element related parts and, although I am happy with the result, I have the feeling it could run a lot faster.

The time-consuming part can be boiled down to this:

Repeatedly solve the following system for (a few thousand) different values of a scalar $\varepsilon$:
$\quad \quad \qquad A + \varepsilon B= 0$
where A and B are known, non-symmetric matrices; I will spare you the weak form they originate from.

I used to re-assemble every time that $\varepsilon$ changed, but that is silly of course. So now I assemble $A$ and $B$ once, cast as a petsc4py matrix, and add using A.axpy(epsilon, B, 1), where the 1 makes sure the nonzero pattern is not recalculated every time. This works, but a few things bother me:

For every different $\epsilon$, it takes approximately 0.05 seconds to solve this say 9500x9500 sparse system. I tried LUSolver('petsc') and LUSolver('umfpack') as those are the options available to me. If I cast the sparse matrix to a scipy type (i.e. scipy.sparse.csr_matrix(A.array()) it solves a factor 6x or even 500x faster, using spsolve and bicg respectively. This is not worth doing as the casting itself takes too much time. Hence the questions: 1) Is there an efficient way to cast to a scipy-friendly format, ie. without the need for a dense array? 2) Any idea where the difference comes from, since PETSc should be on equal footing with scipy's libraries?
If I try the same with KrylovSolver every solution takes several
seconds. Is a Krylov-solver recommended for this kind of problem? I
tried using the solution of the previous iteration as an initial
guess for the next, but that didn't do a lot. Probably my
preconditioner is badly chosen but I don't really have an idea how to
choose it here. Any lights on the matter would be greatly
appreciated.

ps I am stuck with version 1.4

If I understand correctly, solving with bicg in scipy is very fast. This is a Krylov method without any preconditioner, right? Why don't you just use an unpreconditioned BiCGstab Krylov solver from PETSc? — Gregor Mitscha-Baude, Mar 30, 2016
Thanks for the comment. I noticed later that, although bicg in scipy was much faster, the solutions were wrong... so in conclusion there is no Krylov solver working, except very slow ones. The direct solvers in scipy are a few times faster than those of PETSc, but the conversion to a scipy matrix format copies data, which makes it the total solution time longer. — maartent, Mar 30, 2016
Hi Maartent, I am stuck with using the solution of the previous iteration as an initial
guess for the next and ask a question here. Could you please give me a example of how to implement this? Thank you very much in advance. — Wilhelm, Jun 20, 2016

Kent-Andre Mardal · Answer 1 · Mar 22, 2016

Best answer

For such small systems, I don't think Krylov solvers will help you without
you doing something smart. There are things like reuse of search
vectors that may be possible but it depend on A, B and eps and
may not be straightforward to implement.
Any way to reduce the number of epsilons would be the simplest approach, I guess.

answered Mar 22, 2016 by Kent-Andre Mardal FEniCS Expert (14,380 points)
selected Mar 23, 2016 by maartent

Reducing the number of epsilons is not an option I am afraid :-)
I will soon try the same for 3D geometries, and compare with direct solvers again. Is there a rule of thumb from which sizes on to use a Krylov solver?

I am still in the dark as to why direct solvers of PETSc are so slow compared to scipy (LAPACK?). It sometimes also takes slepc more than a second to solve a 20x20 eigenvalue problem. Are they just not meant to be used for small systems or is there another possible culprit?

commented Mar 22, 2016 by maartent FEniCS User (3,910 points)

Hi, there is no need to use a dense array and then convert to CSR matrices. Depending on your backend you can use sparray or the conversion function shown here. Without A, B answering 2) or the second question is very difficult. As for SLEPc - you are right, it needs large scale eigenvalue problems to shine.

commented Mar 22, 2016 by MiroK FEniCS Expert (80,920 points)

In general, preconditioned Krylov methods should be
better when you have more than a few hundred thousand unknowns -
if you have a good preconditioner. E.g. test the following code:

from dolfin import *
lu_time = []
cg_time = []
cgilu_time = []
cgamg_time = []
dofs = []

parameters["krylov_solver"]["relative_tolerance"] = 1.0e-8
parameters["krylov_solver"]["absolute_tolerance"] = 1.0e-8
for N in [32, 64, 128, 256]:
mesh = UnitSquareMesh(N, N)
print " N ", N, " dofs ", mesh.num_vertices()
dofs.append(mesh.num_vertices())
V = FunctionSpace(mesh, "Lagrange", 1)
u = TrialFunction(V)
v = TestFunction(V)

f = Expression("sin(x[0]*12) - x[1]") 
a = u*v*dx  + inner(grad(u), grad(v))*dx  
L = f*v*dx 

U = Function(V)

A = assemble(a) 
b = assemble(L)

t0 = time()
solve(A, U.vector(), b, "lu")
t1 = time()
print "Time for lu ", t1-t0
lu_time.append(t1-t0)

t0 = time()
U.vector()[:] = 0 
solve(A, U.vector(), b, "cg")
t1 = time()
print "Time for cg ", t1-t0
cg_time.append(t1-t0)

t0 = time()
U.vector()[:] = 0 
solve(A, U.vector(), b, "cg", "ilu")
t1 = time()
print "Time for cg/ilu ", t1-t0
cgilu_time.append(t1-t0)

t0 = time()
U.vector()[:] = 0 
solve(A, U.vector(), b, "cg", "amg")
t1 = time()
print "Time for cg/amg ", t1-t0
cgamg_time.append(t1-t0)

import pylab

pylab.plot(dofs, lu_time)
pylab.plot(dofs, cg_time)
pylab.plot(dofs, cgilu_time)
pylab.plot(dofs, cgamg_time)
pylab.legend(["lu", "cg", "cg/ilu", "cg/amg"])
pylab.show()

commented Mar 23, 2016 by Kent-Andre Mardal FEniCS Expert (14,380 points)

Your example shows that for 1000 unknowns, LU solvers are already worse than iterative solvers. In my case (ca. 10 000 unknowns) LU solvers need 0.05 seconds, whereas for cg or cg/ilu or cg/petsc_amg I get:

*** Error: Unable to solve linear system using PETSc Krylov solver.
*** Reason: Solution failed to converge in 10000 iterations (PETSc reason DIVERGED_ITS, residual norm ||r|| = 7.575692e-08).

I guess my system is particularly ill-conditioned or so, and not well-suited for Krylov-solvers. Anything I can do about that by any chance? Since I solve a similar system thousands of times, maybe it is worth finding the inverse once and use that as a preconditioner? My knowledge about iterative solvers is rather sparse so I am just guessing.

On the other hand, is there any explanation why direct solvers of scipy outperform those of PETSc and UMFPACK so much?

Thanks a lot for your time in any case.

commented Mar 23, 2016 by maartent FEniCS User (3,910 points)

@MiroK: I'm a bit in a pickle here. At the moment I cast the matrices A and B to a petsc4py type, add them and then cast to a PETScMatrix to use in the solver. Since I found the scipy sparse solvers to work ~100x faster, I need a scipy.sparse.csr_matrix type. However, I only have PETSc and uBLAS backends. So if I understand it, I could:

cast the petsc4py object to a uBLASMatrix, and use the method .sparray(). Unfortunately, I can find PETScMatrix and EigenMatrix in the docs, but nothing related to uBLAS.
extract the data from the petsc4py matrix. This will probably be easiest, but I am looking now for a way to do this (preferably without copying the data).

The function you linked to works perfectly, but is very slow (50 ms), so not very useful in my case, where the bottleneck is solving the sparse system (0.5 ms). Thanks though.

EDIT
I followed the second approach. It looks like this (can also be used instead of the function you linked to):

A = assemble(a)  # say A is a PETScMatrix
A_petsc4py = as_backend_type(A).mat()
ai, aj, av = A_petsc4py.getValuesCSR()
Asp = scipy.sparse.csr_matrix((av, aj, ai))

Unfortunately, this still copies the data ai, aj and av when creating the sparse scipy matrix

commented Mar 23, 2016 by maartent FEniCS User (3,910 points)
edited Mar 23, 2016 by maartent

In the example code, the preconditioner is very efficient because
the PDE is elliptic. It is _the_ example where preconditioners really work well. But there
is only a big difference between the different algorithms when you have
systems of say 50 K dofs. Your example is quite different
since you will do thousands of runs and even a small improvement
may give significant savings. You should look into reusing of sparsity patterns
or preconditioners from previous runs in your example. You could also
give us some more info on the problem so maybe we can help more.

commented Mar 23, 2016 by Kent-Andre Mardal FEniCS Expert (14,380 points)

To be a bit more specific, the problem is to solve:
$\mathbf{A} + \varepsilon \mathbf{B} + \sum_i k_i(\varepsilon) \mathbf{c_i} \mathbf{d_i^T} = 0$

To give an idea, I will give the weak forms of $\mathbf{A}$ and $\mathbf{B}$ below. $\mathbf{c}$'s and $\mathbf{d}$'s are a bit more complicated, but the important part of them is that they make the global matrix non-symmetric

V = FunctionSpace(mesh, 'CG', 1)
Vc = V * V
u_real, u_imag = TrialFunctions(Vc)
v_real, v_imag = TestFunctions(Vc)
rho = SpatialCoordinate(mesh)[1]
U = Function(V) # a known function
a = (u_real.dx(1) * v_real + u_imag.dx(1) * v_imag) / rho * dx \
    -(u_real.dx(0) * v_real.dx(0) + u_imag.dx(0) * v_imag.dx(0)) * dx \
    -(u_real.dx(1) * v_real.dx(1) + u_imag.dx(1) * v_imag.dx(1)) * dx
    -(u_real * v_real + u_imag * v_imag) / rho**2 * dx \
    + U * (u_real * v_real + u_imag * v_imag) * dx
b = (u_real * v_real + u_imag * v_imag) * dx

For the moment, I assemble $\mathbf{A}$, $\mathbf{B}$ and the different $\mathbf{c}$'s and $\mathbf{d}$'s once in advance. Subsequently, I convert them to petsc4py using dolfin.as_backend_type(x).mat() and then assemble them again for each $\varepsilon$. To this end, I defined a matrix M with a suitable nonzero pattern.

    M.zeroEntries()
    nonzero_pattern = 1 # `1` stands for `SUBSET_NONZERO_PATTERN`
    M.axpy(1.0, A, nonzero_pattern)
    M.axpy(epsilon, B, nonzero_pattern)
    for i in ...:
        M.axpy(k, cd, nonzero_pattern)

By far the most time-consuming part is to solve this however. I use a LU solver now. The convergence for Krylov solvers is terrible, if at all. The solutions for different $\varepsilon$'s are very similar, so it would be convenient if I could find a suitable preconditioner and use the previous solution as an initial guess. I also considered calculating the inverse of M for the first problem, and then use that as a preconditioner.
I have little expertise on the world of Krylov solvers though.

commented Mar 24, 2016 by maartent FEniCS User (3,910 points)

@Kent-Andre Mardal
Can you explain a little bit what you mean by "reusing preconditioners from previous runs?" and perhaps how it works in FEniCS?

commented Apr 13, 2016 by maartent FEniCS User (3,910 points)

The stokes-iterative demo illustrate how to create a preconditioner from
a matrix that is not the actual matrix you want to invert.

commented Apr 13, 2016 by Kent-Andre Mardal FEniCS Expert (14,380 points)

Some advice requested to decrease solution time

1 Answer