Dense linear algebra represents fundamental building blocks in many
computational science and engineering applications. The dense linear
algebra algorithms must be numerically stable, robust, and reliable in
order to be usable as black-box solvers by expert as well as
non-expert users. The algorithms also need to scale and run
efficiently on massively parallel computers with multi-core nodes.
Developing high-performance algorithms for dense matrix computations
is a challenging task, especially since the widespread adoption of
multi-core architectures. Cache reuse is an even more critical issue
on multi-core processors than on uni-core processors due to their
larger computational power and more complex memory
hierarchies. Blocked matrix storage formats, in which blocks of the
matrix are stored contiguously, and blocked algorithms, in which the
algorithms exhibit large amounts of cache reuse, remain key techniques
in the effort to approach the theoretical peak performance.
In Paper I, we present a packed and distributed Cholesky factorization
algorithm based on a new blocked and packed matrix storage
format. High performance node computations are obtained as a result of
the blocked storage format, and the use of look-ahead leads to
improved parallel efficiency. In Paper II and Paper III, we study the
problem of in-place matrix transposition in general and in-place
matrix storage format conversion in particular. We present and
evaluate new high-performance parallel algorithms for in-place
conversion between the standard column-major and row-major formats and
the four standard blocked matrix storage formats.
Another critical issue, besides cache reuse, is that of efficient
scheduling of computational tasks. Many weakly scalable parallel
algorithms are efficient only when the problem size per processor is
relatively large. A current research trend focuses on developing
parallel algorithms which are more strongly scalable and hence more
efficient also for smaller problems.
In Paper IV, we present a framework for dynamic node-scheduling of
two-sided matrix computations and demonstrate that by using
priority-based scheduling one can obtain an efficient scheduling of a
QR sweep. In Paper V and Paper VI, we present a blocked
implementation of two-stage Hessenberg reduction targeting multi-core
architectures. The main contributions of Paper V are in the blocking
and scheduling of the second stage. Specifically, we show that the
concept of look-ahead can be applied also to this two-sided
factorization, and we propose an adaptive load-balancing technique
that allow us to schedule the operations effectively.