ONETEP Performance on the ARCHER2 Supercomputer
Scaling tests have been run on the new ARCHER2 Supercomputer (the UK's national supercomputer), on a system comprising a multi-layer 2D heterostructure (hBN-BP-hBN).
ARCHER2 seems to be a very good machine for ONETEP. ONETEP compiled with no trouble on the gfortran compiler (PrgEnv-gnu). While the performance per core is notably worse than ARCHER, there are a lot more cores per node than ARCHER had, and the interconnect performance seems to be generally very good, so the scaling is excellent.
The graph below shows Speedup measured relative to the same runs on 256 cores for 2 different balances of MPI vs OMP (16 MPI procs per node, each with 8 thread, and 8 MPI procs per node, each with 16 threads). NB: spanning NUMA regions with OpenMP is a very bad idea - do not do it!
On the basis of the above, I would generally recommend 16 MPI x 8 OpenMP as a good balance for medium to large jobs. However, as seen on the last point, this can change when you get to very high MPI process counts. It's worth noting that for this particular system, on going from 75 nodes to 90 nodes, for 8 MPI * 16 OpenMP, the maximum number of NGWFs on any MPI process falls from 17 to 16. 16 is also the default size of an FFTbox batch, which means that using 90 nodes enables the processing of all density and potential operations in a single batch with exactly 1 thread per FFTbox, which is very efficient and causes a big performance boost. Be on the lookout for similar load balancing effects if you are pushing big systems to the limits of their scaling.
There is an ARCHER2 config file checked in to the repository, which should be compiled with the PrgEnv-gnu module loaded.
Nicholas Hine (Oct 2020)