You probably want to go with HPC, but that depends on your problem. If you have one small problem that you want to solve sequentially for a wide range of parameters, then I guess HTC is the way to go. If you want to solve large problems by creating one large mesh that is subsequently decomposed between a large number of processors, then you need HPC.
You can generally use binary installations, but on clusters I think source installation is the way to go.
Almost all Fenics demos run in parallel without modification using mpirun. For example
mpirun -np 2 python demo_poisson.py
There is not much more to it, unless you want to dig deep and mess with dofs directly. assemble, solve, and most other functions work collectively with the same call as for the single-processor case.