HPCSE 2013

High Performance Computing in Science and Engineering

organized by Centre of Excellence IT4Innovations, VSB-Technical University of Ostrava


Hotel Soláň, May 27 - 30, 2013

List of abstracts - invited talks


Daniel Kellenberger, Cyril Flaig, Peter Arbenz

Bone structure analysis with multiGPGPUs

The state of the art method to predict bone stiffness is micro finite element analysis based on high-resolution computed tomography (CT). Modern parallel solvers enable simulations with billions of degrees of freedom. In this paper we present a solver that works directly on the CT-image and exploits the geometric properties given by the 3D-pixel.

We first discuss our solver ParOSol that stores the data in a pointer-less octree. The tree data structure provides different resolutions of the image that is used for the design of a geometric multigrid preconditioner. It makes possible the use of matrix-free implementations on all levels. In this way the memory footprint is reduced by more than a factor of 10 compared to a solver that uses an algebraic preconditioner.

Then we introduce a a new version of ParOSol that makes use of attached graphics processing units (GPUs). The tree data structure is not optimal for its uneven memory accesses. It has been replaced by a structure that classifies element vertices according to their index-wise distance to the element origin. The treatment of each class is then parallelized individually. We also discuss how to distribute work among CPU and GPU in order to minimize execution time.

We present results that we have obtained on the Cray XK7 at the Swiss National Supercomputing Centre consisting of 272 nodes with a 16-core AMD Opteron CPU and one NVIDIA Tesla K20X GPU.


Peter Bastian

Efficient Numerical Simulation of Multi-phase Flow in Porous Media

Appropriate models, accurate discretization schemes and efficient solvers for the arising linear systems are the basis for any numerical simulation. In this talk I will address these aspects by first considering a model for compositional two-phase flow with equilibrium phase exchange that is able to handle phase appearance/disapperance properly. Then a new fully-coupled discontinuous Galerkin scheme for two-phase flow with heterogeneous capillary pressure will be presented. The third part of the talk is devoted to the efficient solution of the arising linear systems by means of algebraic multigrid methods. All numerical schemes have been implemented in the Distributed and Unified Numerics Environment and have been scaled up to 300000 cores. This is joint work with Olaf Ippisch and Rebecca Neumann.


Zlatko Drmac

Numerical stability issues in model order reduction for dynamical systems

The Iterative Rational Krylov (IRKA) algorithm for model order reduction (Gugercin, Antoulas, Beattie 2008.) has recently attracted attention because of its effectiveness in real world applications, as well as because of its mathematical elegance. The key idea is to construct a reduced order model that satisfies a set of necessary (Meier-Luenberger) optimality conditions formulated as interpolatory conditions: the reduced $r$-th order transfer function interpolates the full $n$-th order function and its derivative(s) in the reflected (about the imaginary axis) images of the reduced order poles. This is formulated as a fixed point problem, and the interpolation nodes are generated as ${\sigma}^{(k+1)} = \phi({\sigma}^{(k)})$. Here $\phi(\cdot)$ computes the eigenvalues of the Petrov-Galerking projection of the state matrix to rational Krylov subspaces computed at ${\sigma}^{(k)}=({\sigma}^{(k)}_1,\ldots, {\sigma}^{(k)}_r)$. We will discuss challenging numerical issues related to this and some other model order reduction methods.

This is a joint work with Chris Beattie and Serkan Gugercin from the Virginia Tech, Blacksburg.


Massimiliano Ferronato

Recent developments on parallel preconditioning techniques for sparse linear systems of equations

The efficient parallel solution to large sparse linear systems of equations is a central issue in many numerical computations in science and engineering. Iterative methods based on Krylov subspaces are in principle almost ideally parallel, as their kernels are matrix-vector and scalar products along with vector updates. Unfortunately, the same is not true for most algebraic preconditioners. There are basically two approaches for developing parallel preconditioners: the first tries to extract as much parallelism as possible from existing algorithms with the aim at transferring them to high-performance platforms, while the second is based on developing novel techniques which would not make sense on a scalar computer. Quite obviously, the former approach is easier to understand from a conceptual point of view, as the native algorithm is not modified and the difficulties are mainly technological. The latter implies an additional effort to develop new explicitly parallel methods. The present talk focuses on the most recent developments on parallel preconditioners for large sparse linear systems. One of the most promising approaches relies on state-of-the-art hybrid techniques that aim at combining the most attractive features of approximate inverse and incomplete LU preconditioners. The importance of using a dynamic pattern generation algorithm combined with domain decomposition techniques for the approximate inverse preconditioning is discussed. Finally, a possible generalization to nonsymmetric problems is presented with some numerical results.


Luc Giraud

Hybrid linear solvers on parallel hybrid computers

In this work we investigate the parallel scalability of variants of additive Schwarz preconditioners for three dimensional non-overlapping domain decomposition methods. To alleviate the computational cost, both in terms of memory and floating-point complexity, we investigate variants based on a sparse approximation. The robustness of the preconditioners is illustrated on a set of linear systems arising from the finite element discretization of academic convection-diffusion problems (un-symmetric matrices), and from real-life structural mechanical problems (symmetric indefinite matrices). Parallel experiments on up to a thousand processors on some problems will be presented and results of an ongoing implementation on top of runtime systems for heterogeneous computing will be discussed. The efficiency from a numerical and parallel performance view point are studied on problem ranging from a few hundred thousands unknowns up-to a few tens of millions.


Dominik Göddeke

Energy efficiency aspects of high performance computing for PDEs

Power consumption and energy efficiency are becoming critical aspects in the design and operation of large scale HPC facilities, and it is unanimously recognised that future exascale supercomputers will be strongly constrained by their power requirements. At current electricity costs, operating an HPC system over its lifetime can already be on par with the initial deployment cost. These power consumption constraints, and the benefits a more energy-efficient HPC platform may have on other societal areas, have motivated the HPC research community to investigate the use of energy-efficient technologies originally developed for the embedded and especially mobile markets. However, lower power does not always mean lower energy consumption, since execution time often also increases. In order to achieve competitive performance, applications then need to efficiently exploit a larger number of processors. In this talk, we discuss how applications can efficiently exploit this new class of low-power architectures to achieve competitive performance. We evaluate if they can benefit from the increased energy efficiency that the architecture is supposed to achieve. The applications that we consider cover three different classes of numerical solution methods for partial differential equations, namely a low-order finite element multigrid solver for huge sparse linear systems of equations, a Lattice-Boltzmann code for fluid simulation, and a high-order spectral element method for acoustic or seismic wave propagation modelling. We evaluate weak and strong scalability on a cluster of 96 ARM Cortex-A9 dual-core processors and demonstrate that the ARM-based cluster can be more efficient in terms of energy to solution when executing the three applications compared to an x86-based reference machine.


Laura Grigori

Recent progress in minimizing communication for dense and sparse linear algebra operations

The cost of moving data in an algorithm can surpass by several orders of magnitude the cost of performing arithmetics, and this gap has been steadily and exponentially growing over time. This talk will review work performed in the recent years on a new class of algorithms for numerical linear algebra that provably minimize communication. The novel numerical schemes employed, the speedups obtained with respect to conventional algorithms, as well as their impact on applications in computational science will be also discussed.


Gundolf Haase

GPU preconditioners for non-linear elasticity problems

Cardiovascular simulations require the solution of coupled PDE/ODE equations with several internal couplings of the PDEs depending on the underlying model and the available compute capabilities.

The MPI/OpenMP and GPU (CUDA) parallelization of the ODE solver and the elliptic/parabolic potential problem solver has been successfully performed in the past. We use a cg-iteration with an algebraic multigrid preconditioner (AMG) for the elliptic problem and its general parallelization will be presented in the talk. This parallelization concept has been revisited with respect to load balancing of the subdomain interfaces of the decomposed domain and resulted in a much better strong parallel efficiency, especially on clusters of GPUs. The matrices for the potential problems remain unchanged during the whole calculation, i.e., the matrices are computed and assembled on the CPU and transferred only once to the GPU. The same holds for the AMG setup.

In order to provide a GPU solver for elasticity we extended the AMG for coupled problems. Here, several versions have been investigated and the AMG with coupled degrees of freedom in each node together with a graph coarsening proved the best robustness and the best timings. The relation between costs for setup and solver change completely in case of non-linear elasticity. The original CPU code spent 50% of its time in matrix calculation and accumulation, i.e., transferring only the solver onto the GPU would result in a speedup less than 2 compared to one CPU core. Therefore we had a closer look at the matrix calculation and assembling. Currently, the calculation of the local stiffness matrices is accelerated by 1000 on a GTX 680. The assumption that the matrix graph won't change during the non-linear calculation supports the GPU acceleration of this step (and also the CPU acceleration). This assembling of the local contributions into the global stiffness matrix on the GPU is subject to ongoing work.

A similar assumption will be used for the AMG setup on the CPU/GPU resulting in several setup entry points with very different computational costs and data transfers between CPU and GPU.

Besides a setup phase, the full non-linear iteration algorithm will run completely on the GPU. Taking also into account the dramatically reduced data transfer between host and device we expect an acceleration of the non-linear iteration by a factor of 30 with respect to one CPU core.


Thomas Huckle

Data dependent regularization methods

We consider ill-posed inverse problems. To avoid the deterioration of the solution by noise usually the Tikhonov regularization minimizes also the norm of the solution vector in a suitable seminorm. This seminorm should have small influence on the signal components that we are looking for, but should be large on noise components. For a smooth signal the gradient operator gives good seminorms with this property. The 1-norm is often used, because it also discontinuities. Here we want to derive and compare good seminorms that are based on the 2-norm also for discontinuous signals. We consider this type of problems in the original signal space, but also in the Fourier- or Wavelet- transformed space.


Boris Khoromskij

Tensor numerical methods for the large-scale multidimensional and multiparametric PDEs

Tensor numerical methods provide the efficient low-parametric separable representation of multivariate functions and operators on large tensor grids, that allows the solution of d-dimensional PDEs with linear complexity scaling in the dimension. The recent quantized tensor approximation (QTT) is proven to provide the logarithmic data compression for the discretization of multidimensional steady-state and dynamical problems in quantized tensor spaces.

In this talk I show how the grid-based tensor approximation applies to hard problems arising in electronic structure calculations, such as multidimensional convolution, many-electron integrals, Hartree-Fock calculus for large molecular systems etc. The tensor approximation method was proved to be efficient for the stochastic/parametric PDEs as well as for high-dimensional time-dependent models, in particular, the molecular Schr{\"o}dinger, the Fokker-Planck and chemical master equations. Numerical tests indicating the efficiency of tensor methods in some electronic structure, parameter-dependent and dynamical calculations will be presented.

http://personal-homepages.mis.mpg.de/bokh


Axel Klawonn

An approach to adaptive coarse spaces in FETI-DP methods

Adaptive coarse spaces for domain decomposition methods can be used to obtain independence on coefficient jumps for highly heterogeneous problems, even when coefficient jumps inside subdomains are present. In this talk, for FETI-DP methods, we present a new approach to obtain independence of the coefficient jumps by solving certain local eigenvalue problems and enriching the FETI-DP coarse space with eigenvectors.


Rolf Krause

Parallel-in-time integration in high-performance computing

Upcoming high-performance computing architectures are anticipated to require million-way concurrency. This leads not only to technical and economical challenges but mandates the development of novel and inherently parallel numerical approaches. Parallelization in the temporal direction has been shown to be a promising approach to provide an additional level of concurrency already on the algorithmic level.

Parareal is a popular method, using a cheap, coarse and an accurate expensive integrator in an iterative fashion to compute a solution of multiple time-slices with some degree of concurrency. A more recent approach is the “parallel full approximation scheme in space and time” (PFASST). It improves parallel efficiency by intertwining the iterations of time-serial spectral deferred correction methods (SDC) with a Parareal-like outer iteration. PFASST features a FAS correction to efficiently employ different spatial coarsening strategies in order to reduce overhead of the coarse propagator.

The talk summarizes recent research on the implementation and analysis of parallelization in time on state-of-the-art high-performance computing systems. The potential of time-parallel methods is demonstrated to provide significant additional speedup beyond the inevitable strong scaling limit of a purely space-parallel approach. Furthermore, we report on progress on a hybrid space-time parallel implementation, combining shared-memory Parareal with MPI-based parallelization in space.


Wesley Petersen

A stable finite difference scheme for growth and diffusion on a map

In 1937, Fisher, and Kolmogoroff/Petrovskii/Piscunoff studied an equation for the diffusion and logistics growth of mutations which have genetic advantages. Human migrations can also be modeled by this equation when the environment, in the form of a "carrying capacity" (often called Net Primal Productivity (NPP)), is considered. In this talk I will show a stable, semi-implicit, finite difference Godunov scheme for integrating this F/KPP equation on a Mercator projection world map. The numerical solution accurately exhibits expected traveling wave characteristics. Since evironmental carrying capacity data are very difficult to obtain over 120 thousand years, some regularization is necessary against noise. The scheme has a sparce representation for land/water separation. Via its dimensional splitting, the method is also easily parallelizable.


Oliver Rheinbach

Nonlinear FETI-DP and BDDC Methods

Nonlinear approaches to domain decomposition are characterized by a geometrical decomposition of the nonlinear problem, i.e., before linearization. A well known overlapping DD method is the ASPIN approach. In this talk new, nonlinear nonoverlapping domain decomposition methods, i.e., nonlinear FETI-DP and nonlinear BDDC methods are presented. In these methods, in each iteration, local nonlinear problems are solved on the nonoverlapping subdomains. The new approaches have a potential to reduce communication and can show a significantly improved performance.


Olaf Schenk

Interior point methods for large scale stochastic optimization on high-performance computers

Stochastic programming is the leading optimization-under-uncertainty paradigm for large-scale problems and it requires thousands of simultaneous scenarios, giving problems with billions of variables that need to be solved within an operationally defined time interval. To address this challenge, we propose several algorithmic and implementation advances inside hybrid parallel interior-point optimization solvers. The new developments include a novel incomplete augmented multicore sparse factorization and new multicore- and GPUbased dense matrix implementations. We also adapt and improve the interprocess communication strategy and used the software to solve 24-hour horizon power grid problems with up to 1.95 billion decision variables and 1.94 billion constraints on "Titan" (Cray XK7) and "Piz Daint" (Cray XC30), where we observe very good parallel efficiencies and solution times within a operationally defined time interval. To our knowledge, "real-time"-compatible performance on a broad range of architectures for this class of problems has not been possible prior to present work.


Bora Ucar

A sparse matrix scaling algorithm and its efficient parallelization

We recently proposed an algorithm that scales the rows and columns of a given matrix to one in a given norm. In this talk we will review the algorithm and summarize some of its important characteristics. We will discuss its efficient parallelization in distributed memory parallel computers with MPI and on shared memory parallel computers with OpenMP. We will also see some numerical tests with a direct solver to see the effects of the proposed scaling algorithm in solving linear systems.


List of abstracts - contributed talks


Jennifer Scott, Miroslav Tůma

Enhanced incomplete Cholesky factorization

Incomplete Cholesky factorizations represent an important class of preconditioners for solving large-scale sparse symmetric positive-definite linear systems of equations. Over the last 60 years or so, many different types of incomplete factorization methods have been developed. Some of them were inspired by particular applications and some intended to be more general-purpose.

In this talk, we consider two important ideas that were introduced over time: the Jennings-Malik modification (1977) and the Tismenetsky decomposition (1991). We explore their theoretical and practical similarities and differences. Based on our observations, we propose a new implementation of the incomplete Cholesky factorization that uses a limited memory approach. Extensive numerical experimentation appears to confirm that the developed technique may be a method of choice in various practical applications. The proposed algorithm is implemented as a new package HSL_MI28 for the HSL mathematical software library.


Martin Hanek, Jakub Šístek, Pavel Burda

Numerical simulation of flow in hydrostatic bearing

In the paper we deal with numerical solution of Navier-Stokes equations for stationary incompressible flow in hydrostatic bearing by means of the finite element method. First we deal with 2D rotationaly symmetric flow and then with 3D problem. Here we use domain decomposition to nonoverlapping subdomains and BDDC preconditioner.

This work was supported by the Grant Agency of the Czech Technical University in Prague, grant No. SGS13/190/OHK2/3T/12


Martin Plešinger

The total least squares problem with multiple right-hand sides

Consider a general orthogonally invariant linear approximation problem Ax≈b. In the paper by Paige and Strakoš (2006), it is proved that partial upper bidiagonalization of the extended matrix [b,A] determines a core approximation problem, with the necessary and sufficient information for solving the original problem. The transformed data [b_1,A_{11}] can be computed either directly, using Householder orthogonal transformations, or iteratively, using Golub-Kahan bidiagonalization (see Golub, Kahan (1965)). It is shown how the core problem can be used in a simple and efficient way for solving the total least squares formulation of the original approximation problem (see Golub, Van Loan (1980), and Van Huffel, Vandewalle (1991)).

In this poster we concentrate on extension of the idea of core problem formulation to linear approximation problems AX≈B with multiple right-hand sides. Here the extension of the concept of a (minimally dimensioned) approximation problem containing the necessary and sufficient information for solving the original problem is not straightforward. We will survey results obtained during investigation of this problem, and discuss examples illustrating difficulties which have to be taken into account.


Martina Šimůnková

Processes and threads - two ways to parallel computation

We will discuss two different parallel implementations of multi-dimensional Galerkin method on hypercube and compare theirs performances. Both our implementations are in C language, one is based on MPI (message passing interface), the other uses pthread library (posix threads).


Vítězslav Žabka

GPU implementation of a multigrid solver for the incompressible Navier-Stokes equations

We present a GPU implementation of a geometric multigrid method for the incompressible Navier-Stokes equations in 2D. The equations are discretized in space by means of the mixed finite element method on unstructured triangular meshes. For the time discretization a semi-implicit scheme is employed. The resulting systems of linear equations are solved using the multigrid V-cycle with a Vanka-type smoother. In the presented implementation, the solver runs completely on the GPU including the assembling of the linear systems. Its parallelization is based on the red-black ordering of the mesh triangles. We apply the solver to the numerical simulation of air flow in a simplified urban canopy where speedup up to five, compared to the corresponding multi-core CPU implementation, is achieved.


Helena Švihlová

Preparing meshes from real medical data for finite element method

With increasing popularity of using CT scan for getting information about pacient' s medical state an aneurysm is often discovered and there is a question which ones have tendency to rupture. Rupture is usually fatal. That is the challenge for mathematical modelling to compute fluid flow in real geometries. We will introduce creating meshes from real data getting from CT scan, software for their smoothing and possibilities for prescribing boundary conditions on them We will show also numerical results for Navier-Stokes fluid computed by FEM on these meshes. This contribution includes comparing various boundary conditions and different discretizations.


Jakub Šístek

Parallel BDDC solver for flows in porous media

The Balancing Domain Decomposition based on Constraints (BDDC) method is extended to saddle point systems arising in solution of flows in fractured porous media by the finite element method. The problem is discretized using Raviart-Thomas finite elements, and the mixed-hybrid formulation is employed. The geologically important fractures within the rock are modelled by lower dimensional elements coupled with the main mesh by Robin boundary conditions. A new averaging operator is proposed to handle heterogeneous material coefficients and large variations of element sizes within the engineering models. The extensions are implemented within our open-source solver of systems of linear equations BDDCML, which is in turn combined with an existing software package for subsurface flow simulations FLOW123D. Performance of the parallel implementation is studied on several benchmark and geoengineering applications, and scaling is verified on up to 1024 compute cores. This is a joint work with Jan Březina (TU Liberec) and Bedřich Sousedík (USC Los Angeles).


Jaroslav Hron

Implicitly constituted materials: mixed formulations, numerical solutions and computations

Implicit constitutive theory is based on the idea of expressing the response of bodies by an implicit relation between the stress and appropriate kinematical variables. It can also be viewed as an approach that lies between the classical primal and dual formulations of explicit models. It leads to a less standard but interesting structure of the governing equations. We will present several examples emphasizing the advantages of this framework at the level of modeling of material responses and the numerical solution of resulting discrete systems by finite element method.


Jiří Kopal

Approximate inverse preconditioning for the conjugate gradient method

This contribution deals with an approximate inverse preconditioning for the conjugate gradient method. Approximate inverse/direct decompositions are often based on incomplete algorithms. Incompleteness of the algorithms is achieved by dropping small entries (in some sense). Although error bounds for basic decompositions in floating point arithmetic are often well known, the situation for the incomplete schemes is different. Such bounds for incomplete decompositions are either very specific or they cannot be easily exploited in computation.

In our previous work we have analyzed all the main variants of the generalized Gram--Schmidt process in the floating-point arithmetic. This is the process on which we base our preconditioner that approximates the matrix inverse. The nature of our results seems to extend the error bounds also to the incomplete case and exploit them in the course of computation.

An ideal case would be to obtain a decomposition just from the standard floating point computation that would be sparse and accurate at the same time. While this is usually not the case, the developed theoretical results enable to compute a level on which we drop entries of the decomposition such that the overall error stays roughly the same as if we would not drop. In this way, a sparse decomposition can be obtained. Additional techniques (pivoting, scaling) can improve numerical properties of the preconditioner even more.

The used error bounds imply an adaptive dropping strategy. The theoretical results are accompanied by carefully chosen experiments which demonstrate usefulness of this approach. We hope that this contribution extends applicability of the considered type of approximate inverse preconditioners.


Václav Hapla

FLLOP: A Novel Package for Quadratic Programming including FETI Computations

FLLOP (FETI Light Layer on top of PETSc) is a not yet published novel package for constrained quadratic programming (like MATLAB's quadprog) and FETI domain decomposition, built on top of PETSc (similarly to TAO or SLEPc). FLLOP API is carefully designed to be user-friendly allowing natural specification of a QP problem, QP transformations independent of a particular solver, automatic or manual choice of a sensible solver. Still it remains efficient and targeted at HPC. Current applications include mainly engineering problems of structure mechanics: linear elasticity, contact problems (also with friction), elasto-plasticity, and shape optimization. Interesting and quite unusual experiments include medical image registration. But our long-term tight cooperation with the Elmer FEM team should lead soon to other applications like modeling of ice-sheet melting or electrical engines. We permanently work to improve scalability of domain decomposition methods based on FETI hybrid iterative-direct approach. Smart use of parallel direct solvers for FETI coarse problem solution turned out to be crucial if we want to solve problems with hundreds of millions to billions unknowns. However, further significant improvement is anticipated in case we would use direct solvers designated specifically for multi-core or accelerators to solve local and coarse problems and map them appropriately to the computational nodes.


Jiří Kunovský, Václav Šátek

Modern Taylor Series Method

The development project deals with extremely exact, stable and fast numerical solutions of systems of differential equations. In a natural way, it also involves solutions of problems that can be reduced to solving a system of differential equations.

The Modern Taylor Series is based on a recurrent calculation of the Taylor series terms for each time interval. Thus the complicated calculation of higher order derivatives (much criticised in the literature) need not be performed but rather the value of each Taylor series term is numerically calculated. Solving the convolution operations is another typical algorithm used.

An automatic transformation of the original problem is a necessary part of the Modern Taylor Series Method. The original system of differential equations is automatically transformed to a polynomial form, i.e. to a form suitable for easily calculating the Taylor series forms using recurrent formulae.

An important part of the method is an automatic integration order setting, i.e. using as many Taylor series terms as the defined accuracy requires. Thus it is usual that the computation uses different numbers of Taylor series terms for different steps of constant length. On the other hand, for a pre-set integration order, the integration step length may be selected. This fact positively affects the stability and speed of the computation. These features are accentuated especially while solving large scale systems of linear differential equations.


Pavla Sehnalová

Multiderivative multistep methods

The talk summarizes numerical methods for ordinary differential equations and its generalization to multistep methods such as Adams methods and their implementation as predictor–corrector pairs in both PEC and PECE modes. The talk also focuses on the generalization to multiderivative methods such as Obreshkov method. The two-derivative multistep method is introduced and the comparison with other traditional methods is presented, including stability analysis.


Josef Šlapal

Convenient adjacencies on the digital plane

We study graphs with the vertex set Z^2 which are subgraphs of the 8-adjacency graph and have the property that certain natural cycles in these graphs are Jordan curves, i.e., separate Z^2 into exactly two connected components. For the minimal graphs with this property, we discuss their quotient graphs, too.


Radim Blaheta, Owe Axelsson, Petr Byczanski, Rostislav Hrtus, Stanislav Sysala

Preconditioners for Saddle Point Systems in Poroelasticity

Processes in porous media, involving mechanical deformation and fluid flow in porous space, deserve many applications. Linear poroelasticity or nonlinear poromechanics are applied in soil and rock mechanics, but also in biomechanics, filtration technologies, and others. In all these fields, one need demanded computations of coupled multiphysics problems, which require space and time discretization and efficient, parallelizable iterative methods with suitable preconditioners for the solution of arising systems.

For the iterative solution of such systems, we use Krylov type iterative methods with block preconditioners, which use a natural block decomposition provided by the physics. In the contribution, we review such preconditioners and introduce some new variants based on the Schur complement to the pressure block. We also provide analysis and comparison of these diagonal and block triangular preconditioners, see [1], [2].

Further, we discuss efficient implementation of selected preconditioners, including efficient solution of subsystems with diagonal blocks of the whole finite element matrix or blocks arising in Schur complements with or without their simplifications. By using inner iterations, we get variable step preconditioning of the outer iterations. We are especially interested in implementation which allows to exploit parallel computing, see [3].

Finally, we shall discuss the obtaining experience as well as possibility of extending the results for solving more complicated models, which couples multiphase flow and mechanical behaviour.

References
[1] O. Axelsson, R. Blaheta, P. Byczanski, An efficient preconditioner for saddle point type matrices arising in poroelasticity problems. Submitted.
[2] R. Blaheta, O. Axelsson, P. Byczanski, Solving poroelasticity problems with block type preconditioners. In progress.
[3] R. Blaheta, R. Hrtus, E. Turan, A Trilinos implementation of poroelasticity solvers. In progress.


Aleš Ronovský

Providing an Efficient Multilevel Mesh Multiplication Capability to Code_Saturne

The long-term objective of the current research is to develop algorithms able to handle extremely complex situations, which are known as Computational Fluid Dynamics (CFD). This type of simulations requires very large computing resources, and could only be performed on massively parallel computers. Moreover, the requirement for very accurate solution is crucial in many applications. This implies to solve the equations on very fine meshes of up to several billions of cells to accurately compute at realistic Reynolds numbers, which further increases computational costs.

Creating meshes of several billions of cells from scratch is a challenge as few parallel mesh generators exist, and certainly even fewer open-source ones. Moreover, if created, such meshes can not easily be stored, transferred and even read by the CFD solver. A good alternative resides in using mesh multiplication to generate a very large mesh from an initial coarser mesh. This strategy allows to keep the skewness and stretching between cells of the initial mesh for the new refined one as well as for the intermediate ones in the case of hexahedral cells.

The poster desribes Mesh Multiplication algorithm developed for Code_Saturne software and its performance and achieved results on test case of Large-Eddy Simulation (LES) in staggered distributed bundles. This case is refined up to 26 Billion cells. That test case is connected to real problem of LES of a full nuclear power plant reactor.


Jan Valdman

Vectorized MATLAB assembly of FEM matrices

We present an efficient vectorized MATLAB technique for assembling nodal and edge finite element matrices and discuss one application in solving a continuum model with a plastic spin. This is a joint work with T. Rahman (Bergen), I. Anjam (Jyvaskyla), D. Pauly and P. Nefff (Essen-Duisburg).


Karel Tůma

Simulation of viscoelastic Burgers' like model in deforming domains

We present a new non-linear thermodynamically compatible rate type fluid model that can be linearized to the standard Burgers' model. This new model is used for simulation of problems in deforming domains using arbitrary Lagrangian-Eulerian method transforming the moving domain to the fixed computational domain. The domain is discretized by regular quadrilaterals. Finite element method with fully coupled monolithic solver is used in numerical computation. Obtained set of non-linear equations is solved by Newton method and the set of linear equations is solved by the direct solver. Pressure p is approximated by piecewise discontinuous linear elements and velocity v and parts of the stress B_1 and B_2 are approximated by piecewise bi-quadratic continuous elements. Particularly, we show the simulation of rolling of asphalt.


Lubomír Říha

A method for communication efficient work distributions in stencil operation based applications on heterogeneous clusters

In recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing, has brought about significant performance increases for scientific applications. One of the best examples of this is Lattice Quantum Chromo-Dynamics (QCD), a stencil operation based simulation. These simulations have a large memory footprint necessitating the use of many graphics processing units (GPUs) in parallel. This requires the use of a heterogeneous cluster with one or more GPUs per node. In order to obtain optimal performance, it is necessary to determine an efficient communication pattern between GPUs on the same node and between nodes. In this paper we present a performance model based method for minimizing the communication time of applications with stencil operations, such as Lattice QCD, on heterogeneous computing systems with a non-blocking Infiniband interconnection network. The proposed method is able to increase the performance of the most computationally intensive kernel of Lattice QCD by 25 percent due to improved overlapping of communication and computation.


Martin Čermák, Michal Merta

Parallel solution of elasto-plastic problems

In this work we present the parallel solution of elasto-plastic problems. We assume the von Mises plastic criterion with kinematic hardening and the associated plastic flow rule. For the time discretization we use the implicit Euler method and the corresponding one-time-step problem is formulated with respect to unknown displacement. For the space discretization we use the finite element method and we parallelize the resulting problem using the Total-FETI method.

Our parallel implementations are based on the Trilinos and the PETSc software frameworks. Their performance and scalability is compared on 2D and 3D benchmarks. The scalability tests were carried out using the HECToR supercomputer at EPCC, UK.


Martin Stachoň

Modeling of non-adiabatic dynamics of rare-gas cluster cations on supercomputers

We present the software package MULTIDYN for non-adiabatic dynamics of rare-gas (He, Ar, Kr, Xe) cluster cations. The software is currently used for large-scale simulations on large European supercomputers.


Martin Palkovič

Goals and vision of National supercomputing center IT4Innovations

Less than one week ago the National supercomputing center IT4Innovations inaugurated its first cluster. With Rmax of 80TFLOPS it is the most powerful supercomputer in the academia in Czech Republic. In this talk we will speak not only about this supercomputer, but also about the plans of IT4Innovations for the coming years. We will overview both, the infrastructure as well as research directions of the center. The talk will be concluded with ongoing and future national and international collaborations of the center.



The conference is supported by Cooperation for future project CZ.1.07/2.4.00/31.0035