ONETEP Performance on the ARCHER Supercomputer
Scaling tests have been run on the new ARCHER Supercomputer (the UK's national supercomputer), on various ~4000-atom systems: a CNT, a GaAs nanorod, bulk Silicon, and 64 base pairs of DNA. These were run for 1 NGWF iteration with production-quality settings (large kernel thresholds, 8.0a0 NGWFs, 800eV cutoffs).
ARCHER seems to be a genuinely excellent machine for ONETEP. It is set up very well and ONETEP compiled with no trouble. The speed is very impressive (one can hit 20Tflops FFT performance when running on 3840 cores), interconnect performance is generally very good and scaling is excellent.
The graphs below show Speedup and Parallel Efficiency measured relative to the same runs on just 48 cores with all-MPI, for 5 different balances of MPI vs OMP (24 MPI procs per node, each with 1 thread) up to (2 MPI procs per node, each with 12 threads). NB: spanning NUMA regions with OpenMP is a very bad idea - do not do it!
The figures show that going to more OpenMP threads does incur a parallel efficiency hit (there are still parts which do not scale with thread count) but that it preserves the MPI scaling to much higher core counts without nearly as much loss of PE. In all cases one can scale to 2000 cores before losing half the parallel efficiency of the small run.
There is an ARCHER config file checked in to the repository, which should be compiled with the PrgEnv-intel module loaded.
Nicholas Hine (Nov 2013)