Fast distributed PETSc Matrix assignment and additions (256 cores)

Question

I've been finding my Matrix assignment and addition operations are becoming the bottleneck when I start scaling out, with 256 cores being my test. It's enough that the performance goes from 60ms per DOF in a shared memory mpirun to 700ms per DOF in a infiniband mpirun.

The Matrix assignment and addition operations are the result of breaking apart my bilinear forms so that I avoid their assembly during integration - so I'm happy to be at this point. At each iteration I am doing A = A_part1 + dt*A_part2.

I'm using C++ to interface with DOLFIN, using PETSc as my LA backend. I did notice something that appears inconsistent between Matrix and Vector assignment operator implementations. The PETScMatrix assignment operator throws an error if the Matrix has already been initialized:

const PETScMatrix& PETScMatrix::operator= (const PETScMatrix& A)
{
  if (!A.mat())
  {
    if (_matA)
    {
      dolfin_error("PETScMatrix.cpp",
                   "assign to PETSc matrix",
                   "PETScMatrix may not be initialized more than once.");
      MatDestroy(&_matA);
    }
    _matA = NULL;
  }
  else if (this != &A) // Check for self-assignment
  {
    if (_matA)
    {
      // Get reference count to _matA
      PetscInt ref_count = 0;
      PetscObjectGetReference((PetscObject)_matA, &ref_count);
      if (ref_count > 1)
      {
        dolfin_error("PETScMatrix.cpp",
                     "assign to PETSc matrix",
                     "More than one object points to the underlying PETSc object");
      }
      dolfin_error("PETScMatrix.cpp",
                   "assign to PETSc matrix",
                   "PETScMatrix may not be initialized more than once.");
      MatDestroy(&_matA);
    }

    // Duplicate with the same pattern as A.A
    PetscErrorCode ierr = MatDuplicate(A.mat(), MAT_COPY_VALUES, &_matA);
    if (ierr != 0) petsc_error(ierr, __FILE__, "MatDuplicate");
  }
  return *this;
}

But what I need, at least I think I need, is this PETSc operation on my existing Matrix (a copy not duplicate would be ideal)

PetscErrorCode ierr = MatDuplicate(A.mat(), MAT_COPY_VALUES, &_matA);

Interestingly the PETScVector assignment operator has a different behavior:

const PETScVector& PETScVector::operator= (const PETScVector& v)
{
  // Check that vector lengths are equal
  if (size() != v.size())
  {
    dolfin_error("PETScVector.cpp",
                 "assign one vector to another",
                 "Vectors must be of the same length when assigning. "
                 "Consider using the copy constructor instead");
  }

  // Check that vector local ranges are equal (relevant in parallel)
  if (local_range() != v.local_range())
  {
    dolfin_error("PETScVector.cpp",
                 "assign one vector to another",
                 "Vectors must have the same parallel layout when assigning. "
                 "Consider using the copy constructor instead");
  }

  // Check for self-assignment
  if (this != &v)
  {
    // Copy data (local operation)
    dolfin_assert(v._x);
    dolfin_assert(_x);
    PetscErrorCode ierr = VecCopy(v._x, _x);
    if (ierr != 0) petsc_error(ierr, __FILE__, "VecCopy");

    // Update ghost values
    update_ghost_values();
  }
  return *this;
}

Where there is no error if the Vector has already been initialized, instead you get a nice PETSc copy from:

PetscErrorCode ierr = VecCopy(v._x, _x);

That aside, how can I implement A = A_part1 + dt*A_part2 as a nice copy + axpy?

Edit For vectors I can do the operations locally on each worker like so:

std::dynamic_pointer_cast<PETScVector>(other.c->vector())->apply("");
other.c->vector()->get_local(temp_val);
c->vector()->set_local(temp_val);
std::dynamic_pointer_cast<PETScVector>(c->vector())->apply("");

There is a set_local for a Matrix, but no get_local?

Edit2

Just wanted to note that these should be all local copies... I just need to do a purely local copy of the values with no update across workers.

Fast distributed PETSc Matrix assignment and additions (256 cores)

1 Answer