Improve performance for solving a coupled system in parallel

Question

Hey,

this is a follow up question to this question. I want solve a coupled system (3 equations) of (nonlinear) parabolic PDEs using an IMEX method in parallel. As written in my last comment, I'm able to solve them in parallel with multiprocessing and sparse matrices from scipy.sparse, but I'm not getting a notable performance boost as expected in comparison to solving the equations one after another in serial, solving it in parallel takes around 80% the time solving the equations in serial.

Is there any possibility to improve the performance any further? I'm assembling the matrices before the time loops and, as written, use sparse matrices and the sparse solver.

Following is the important part of my program, most of the code only listed for the first equation. I'm very grateful for any suggestion.

from __future__ import division

from dolfin import *
import numpy as np
import scipy.sparse.linalg as spla
import scipy.sparse as sparse
import multiprocessing as mp

import time

# to shut off the output
set_log_level(WARNING)

# ------------------- settings, definitions
nx = 320
T = 1.0
dt = T / nx

mesh = IntervalMesh(nx, 0, 1.0)

# definition of solving method
V = FunctionSpace(mesh, 'CG', 1)

# ------------------- boundary conditions
# defining subdomains and boundary conditions bcs1, bcs2 and bcs3
...
# ------------------- RHS, exact solution
# defining the right hand sides and the exact solution
...

# ------------------- IMEX
# ---------- Euler, defining the method, 
# listed only for m1_E, for m2_E and m3_E analogously
m1_E = TrialFunction(V)
m1_old_E = interpolate(u1_start, V)
n1_E = TestFunction(V)

G1_E = (m1_E - m1_old_E)*n1_E*dx + dt*inner(nabla_grad(m1_E), nabla_grad(n1_E))*dx - dt*rhs1_old(m1_old_E, m2_old_E, m3_old_E)*n1_E*dx
B1_E = lhs(G1_E)

A1_E = assemble(B1_E)
for bc in bcs1:
  bc.apply(A1_E)

A1_E_mat = A1_E.array()
sparse_A1_E_mat = sparse.csr_matrix(A1_E_mat)

M1_E = rhs(G1_E)
m1_E = Function(V)
m1_E = interpolate(u1_start, V)
# ---------- Euler
# ------------------- IMEX

t = dt

b1_E_vec = np.zeros(len(A1_E_mat))

# start 3 worker processes
pool = mp.Pool(3)

# ---------- time steps
while t <= T+1/2*dt:

    #---------- IMEX parallel
    # analogously for m2_E and m3_E 
    start_time1 = time.time()

    # assemble vector b and apply bdcs
    b1_E = assemble(M1_E)
    for bc in bcs1:
      bc.apply(b1_E)
    b1_E_vec[:] = b1_E.array()
    res1 = pool.apply_async(spla.spsolve, [sparse_A1_E_mat, b1_E_vec])

    b2_E = ...        
    b3_E = ...        

    m1_E.vector()[:] = res1.get()
    m2_E.vector()[:] = res2.get()
    m3_E.vector()[:] = res3.get()
    stop_time_parallel = time.time() - start_time1
    totaltime_parallel_IMEX += stop_time_parallel

    #---------- IMEX serial
    # analogously for m2_E and m3_E 
    start_time1 = time.time()

    # assemble vector b and apply bdcs
    b1_E = assemble(M1_E)
    for bc in bcs1:
      bc.apply(b1_E)

    m1_E = solve(A1_E, m1_E.vector(), b1_E)

    stop_time1 = time.time() - start_time1
    totaltime1_IMEX += stop_time1

    b2_E = ...        
    b3_E = ...    

    m1_old_E.assign(m1_E)
    m2_old_E.assign(m2_E)
    m3_old_E.assign(m3_E)

    # ---------- computation of error with exact solution

    t += dt

pool.close()

Kent-Andre Mardal · Answer 1 · Apr 1, 2014

The PETSc and Epetra backends are optimized for parallel performance.
You could consider these backends. They also
have a variety of Krylov solvers and preconditioners that
you can test.

answered Apr 1, 2014 by Kent-Andre Mardal FEniCS Expert (14,380 points)

What I want to do is to call the solvers on different cores in a fast way, rather than call the solvers after each other, even if the solvers work in parallel.

commented Apr 2, 2014 by bobby53 FEniCS User (1,260 points)

OK, I see. Then it might depend on the machine you are using.
The application will need quite a bit of memory and the performance
might be dominated by memory transfers rather than computations.
This is often the case for FEM codes using on unstructured meshes.

commented Apr 3, 2014 by Kent-Andre Mardal FEniCS Expert (14,380 points)

Yes, I thought so. What I wanted to know is whether there are possibilities to speed up the memory transfer and or other ways to reduce the overhead created by the parallelization.

commented Apr 3, 2014 by bobby53 FEniCS User (1,260 points)

Improve performance for solving a coupled system in parallel

1 Answer