High Performance Computing 

title Left  Bigger and faster ...

 

The current trend, that actually started 20 years ago with the coming of the first modern supercomputers and the MPI standard, is to compute bigger systems in a shorter time. The motivations for this are essentially threefold:

  1. for systems of constant size, get finer details in the dynamics of the flow (this direction has led to the development of Direct Numerical Simulation approaches)
  2. for a given accuracy of the computed solution (which translates basically into "for a given grid size"), the ability to simulate larger systems in an attempt to get closer to human-size systems,
  3. because human kind is largely impatient by nature but also to provide the ability to researchers to conduct a parametric survey in a decent time, make computations faster.

 

Nowadays, supercomputers are built on architectures composed of from a few hundreds to a few hundreds thousands cores, i.e., an unprecedented computing power at the disposal of researchers. And the power of supercomputers will keep on growing in the future at the pace on doubling every 18 months. This is a formidable opportunity to extend our comprehension of the surrounding world, and for fluid mechanicists a powerful mean to better understand the intricate dynamics of fluid flows.

 

Though the majority of Fluid Mechanics parallel codes are implemented with the classical MPI standard, recent advances in computer hardware are worth noticing:

  • the GPU technology and the CUDA-like programming languages,
  • the multi-core processor and their gathering on nodes in supercomputers with in total up to 16 or 32 cores sharing the same memory; this causes new challenges in memory access and data management that will require to update the (even serial) architecture of codes and possibly move to hybrid OpenMP/MPI strategy (shared memory OpenMP on each node and MPI communications between the nodes).

 

 

Juelich

(a) JUQUEEN, the IBM Blue Gene 458,752 cores Juliech

supercomputer

supercalculateur-ener-110

(b) ENER110, the Bull 6,048 cores IFPEN supercomputer

 

Fig 1. Supercomputers: (a) the largest European supercomputer in Germany, and (b) our more modest local supercomputer at IFPEN-Lyon

 

 

The PeliGRIFF team would like to acknowledge the GENCI (Grand Equipement National de Calcul Intensif) for its support in granting us every year since 2011 the access to the computing resources of OCCIGEN, the supercomputer of CINES, located in Montpellier in the South of France, and CURIE, the supercomputer of TGCC (CEA), located in Bruyères-le-Châtel in the south suburb of Paris, through DARI selection process.

 

 

title Left  PeliGRIFF and high performance computing

 

PeliGRIFF and its granular dynamics module Grains3D are fully parallel. They both use a classical domain decomposition technique and the MPI standard to implement inter-processor communications. PeliGRIFF is built on the Pelicans platform, developed and maintained by IRSN Cadarache, France, and freely available under the Cecill-C licence. Pelicans is essentially a C++ application framework with a set of integrated reusable components, designed to simplify the task of developing applications of numerical mathematics and scientific computing, particularly those concerning partial differential equations and initial boundary value problems (text picked up from the Pelicans documentation). A plug-in technique offers the possibility to couple PeliGRIFF with various granular dynamics solvers. For our own use at IFPEN, it is coupled with our own DEM solver Grains3D (see Figure 1).

 

 

StructurePeliGRIFF

 

Fig 2. PeliGRIFF code structure

 

 

The governing equations in the granular dynamics code Grains3D are solved by an explicit time algorithm, which means that there are no matrices to invert. In theory, such numerical method is supposed to scale pretty well, provided the load balancing is reasonable. Surprisingly, in the initial stages of the parallel developement of the code (which was previously a serial code), we faced significant troubles to provide an acceptable parallel efficiency. We found out that this was not related to any MPI issue but to the memory access and management on multi-core processor or node. Hence, we had to re-think the serial architecture, limiting as much as possible dynamic memory creation and destruction and promoting good data alignment wherever possible. These efforts hopefully resulted in a huge improvement of the parallel efficiency of Grains3D. In fact, on reasonably well load balanced configurations, Grains3D exhibits now a weak scaling between 0.5 in worst cases to 0.95 in the most favorable ones.  

 

 

On the fluid side, we implemented both Finite Element and Finite Volume/Staggerred Grid schemes and an operator-splitting time algorithm. Resulting discretized systems are stored in distributed matrices & vectors. The linear algebra part is based on PETSc for distributed matrices, vectors and linear system solvers and PETSc is coupled to HYPRE to get access to efficient Algebraic Multi-Grid (Boomer-AMG) and LU incomplete preconditioners. In particular, the pressure laplacian linear system involved in the L2 projection step to enforce mass conservation, costs the most in terms of computing time. In fact, since it does not contain any transient term, it is not well conditioned. For this particular system, we employ the Boomer-AMG parallel preconditioner of HYPRE with the PMIS/HMIS coarsening type and ext+i interpolation formula and usually obtain a good weak scalability on jobs up to 512 million cells and 1,000 cores (see Figure 3). Jobs on up to a few billion cells on a larger number of cores can be run with a weak scaling factor above 0.5, which is deemed to be quite good for this type of linear system solution.

 

PeliGRIFF weakScaling

Fig 3. Weak scaling test with PeliGRIFF on a classical 3D lid-driven cavity problem on Jade (the supercomputer of CINES, Montpellier, France): 512,000 cells per node, up to 512,000,000 cells on 1,000 cores

 

In the momentum equations, diffusive terms are treated implicity while convective ones are treated explicitly. However, due to the use of generally small time steps, the diffusion matrix is highly diagonally dominant; as a result, the solution of the diffusive linear system with the convective at the rhs usually takes a very limited part of the total computing time and scales quite well.

 

Finally, the Fictitious Domain saddle-point problem solved by an iterative Uzawa algorithm is implemented in a matrix-free fashion. The matrix-free feature is particularly well adapted to parallel computing as a particle moves from one sub-domain (one process) to another sub-domain (another process). However, additional tests are required to assess the scalability of the implementation.

 |  Legal information © IFP Energies nouvelles / IFP 2011