How to achieve good parallel efficiency with the demo_poisson.py example (2)

Hello,

i have a question similar to the question here.

i would like to benchmark strong scaling using a modified version of the demo_poisson.py on our HPC cluster. This is the code that i am using:

from dolfin import *
import time

time0 = time.time()
mesh = UnitCubeMesh(100,100,100)
V = FunctionSpace(mesh, "Lagrange", 1)
time1 = time.time()

def boundary(x):
  return x[0] < DOLFIN_EPS or x[0] > 1.0 - DOLFIN_EPS

u0 = Constant(0.0)
bc = DirichletBC(V, u0, boundary)
u = TrialFunction(V)
v = TestFunction(V)
f = Expression("10*exp(-(pow(x[0] - 0.5, 2) + pow(x[1] - 0.5, 2)) / 0.02)")
g = Expression("sin(5*x[0])")
a = inner(grad(u), grad(v))*dx
L = f*v*dx + g*v*ds
time2 = time.time()
A, b = assemble_system(a, L, bc)
time4 = time.time()

u = Function(V)
solver = KrylovSolver(A, "cg", "petsc_amg")
solver.solve(u.vector(), b)
time5 = time.time()
list_timings()

File("data/u.pvd") << u
File("data/partitions.pvd") << CellFunction("size_t", mesh, MPI.rank(mesh.mpi_comm()))

comm = mpi_comm_world()
rank = MPI.rank(comm)
if rank == 0: 
  print "Mesh+FunctionSpace:", time1-time0
  print "Expressions+Forms:", time2-time1
  print "Assemble Both:", time4-time2
  print "Solve:", time5-time4

I run this example for varying number of MPI threads (one time for one MPI thread per core and one time for one thread per node) and get the following results: timings for poisson problem on 100x100x100 cube

These results deviate significantly from published results that I found here Compare e.g. Fig 3 where scaling seems to be smooth down to only a few hundred dofs per thread. On our cluster the best timings are achieved with 128 threads (approximately 10k dofs per thread, but it show a chaotic behaviour next to the optimum and even a large increase in runtime if the number of threads is further increased.

So here are my questions:
1.) Is this 'normal' behaviour, or is something broken with my example (or the hardware
setup)? the maximum speedup is only a factor of 10 on 128 nodes!?
2.) Unfortunately cg+petsc amg was the only combination that i got running in parallel. cg+ilu failed if i execute it with mpi-run on more than one thread. are there any alternative solvers that perform better in parallel?
3.) what is dolfin-hpc? is it still up-to-date? does it provide improved solvers for parallel problems?

thanks in advance for any advice
greetings Florian