Hello,
i have a question similar to the question here.
i would like to benchmark strong scaling using a modified version of the demo_poisson.py on our HPC cluster. This is the code that i am using:
from dolfin import *
import time
time0 = time.time()
mesh = UnitCubeMesh(100,100,100)
V = FunctionSpace(mesh, "Lagrange", 1)
time1 = time.time()
def boundary(x):
return x[0] < DOLFIN_EPS or x[0] > 1.0 - DOLFIN_EPS
u0 = Constant(0.0)
bc = DirichletBC(V, u0, boundary)
u = TrialFunction(V)
v = TestFunction(V)
f = Expression("10*exp(-(pow(x[0] - 0.5, 2) + pow(x[1] - 0.5, 2)) / 0.02)")
g = Expression("sin(5*x[0])")
a = inner(grad(u), grad(v))*dx
L = f*v*dx + g*v*ds
time2 = time.time()
A, b = assemble_system(a, L, bc)
time4 = time.time()
u = Function(V)
solver = KrylovSolver(A, "cg", "petsc_amg")
solver.solve(u.vector(), b)
time5 = time.time()
list_timings()
File("data/u.pvd") << u
File("data/partitions.pvd") << CellFunction("size_t", mesh, MPI.rank(mesh.mpi_comm()))
comm = mpi_comm_world()
rank = MPI.rank(comm)
if rank == 0:
print "Mesh+FunctionSpace:", time1-time0
print "Expressions+Forms:", time2-time1
print "Assemble Both:", time4-time2
print "Solve:", time5-time4
I run this example for varying number of MPI threads (one time for one MPI thread per core and one time for one thread per node) and get the following results:
These results deviate significantly from published results that I found here Compare e.g. Fig 3 where scaling seems to be smooth down to only a few hundred dofs per thread. On our cluster the best timings are achieved with 128 threads (approximately 10k dofs per thread, but it show a chaotic behaviour next to the optimum and even a large increase in runtime if the number of threads is further increased.
So here are my questions:
1.) Is this 'normal' behaviour, or is something broken with my example (or the hardware
setup)? the maximum speedup is only a factor of 10 on 128 nodes!?
2.) Unfortunately cg+petsc amg was the only combination that i got running in parallel. cg+ilu failed if i execute it with mpi-run on more than one thread. are there any alternative solvers that perform better in parallel?
3.) what is dolfin-hpc? is it still up-to-date? does it provide improved solvers for parallel problems?
thanks in advance for any advice
greetings Florian