I've been finding my Matrix assignment and addition operations are becoming the bottleneck when I start scaling out, with 256 cores being my test. It's enough that the performance goes from 60ms per DOF in a shared memory mpirun to 700ms per DOF in a infiniband mpirun.
The Matrix assignment and addition operations are the result of breaking apart my bilinear forms so that I avoid their assembly during integration - so I'm happy to be at this point. At each iteration I am doing A = A_part1 + dt*A_part2.
I'm using C++ to interface with DOLFIN, using PETSc as my LA backend. I did notice something that appears inconsistent between Matrix and Vector assignment operator implementations. The PETScMatrix assignment operator throws an error if the Matrix has already been initialized:
const PETScMatrix& PETScMatrix::operator= (const PETScMatrix& A)
{
if (!A.mat())
{
if (_matA)
{
dolfin_error("PETScMatrix.cpp",
"assign to PETSc matrix",
"PETScMatrix may not be initialized more than once.");
MatDestroy(&_matA);
}
_matA = NULL;
}
else if (this != &A) // Check for self-assignment
{
if (_matA)
{
// Get reference count to _matA
PetscInt ref_count = 0;
PetscObjectGetReference((PetscObject)_matA, &ref_count);
if (ref_count > 1)
{
dolfin_error("PETScMatrix.cpp",
"assign to PETSc matrix",
"More than one object points to the underlying PETSc object");
}
dolfin_error("PETScMatrix.cpp",
"assign to PETSc matrix",
"PETScMatrix may not be initialized more than once.");
MatDestroy(&_matA);
}
// Duplicate with the same pattern as A.A
PetscErrorCode ierr = MatDuplicate(A.mat(), MAT_COPY_VALUES, &_matA);
if (ierr != 0) petsc_error(ierr, __FILE__, "MatDuplicate");
}
return *this;
}
But what I need, at least I think I need, is this PETSc operation on my existing Matrix (a copy not duplicate would be ideal)
PetscErrorCode ierr = MatDuplicate(A.mat(), MAT_COPY_VALUES, &_matA);
Interestingly the PETScVector assignment operator has a different behavior:
const PETScVector& PETScVector::operator= (const PETScVector& v)
{
// Check that vector lengths are equal
if (size() != v.size())
{
dolfin_error("PETScVector.cpp",
"assign one vector to another",
"Vectors must be of the same length when assigning. "
"Consider using the copy constructor instead");
}
// Check that vector local ranges are equal (relevant in parallel)
if (local_range() != v.local_range())
{
dolfin_error("PETScVector.cpp",
"assign one vector to another",
"Vectors must have the same parallel layout when assigning. "
"Consider using the copy constructor instead");
}
// Check for self-assignment
if (this != &v)
{
// Copy data (local operation)
dolfin_assert(v._x);
dolfin_assert(_x);
PetscErrorCode ierr = VecCopy(v._x, _x);
if (ierr != 0) petsc_error(ierr, __FILE__, "VecCopy");
// Update ghost values
update_ghost_values();
}
return *this;
}
Where there is no error if the Vector has already been initialized, instead you get a nice PETSc copy from:
PetscErrorCode ierr = VecCopy(v._x, _x);
That aside, how can I implement A = A_part1 + dt*A_part2 as a nice copy + axpy?
Edit For vectors I can do the operations locally on each worker like so:
std::dynamic_pointer_cast<PETScVector>(other.c->vector())->apply("");
other.c->vector()->get_local(temp_val);
c->vector()->set_local(temp_val);
std::dynamic_pointer_cast<PETScVector>(c->vector())->apply("");
There is a set_local for a Matrix, but no get_local?
Edit2
Just wanted to note that these should be all local copies... I just need to do a purely local copy of the values with no update across workers.