Dense linear algebra algorithms are often regular and have a high arithmetic intensity.
This allows them to achieve performance close to the peak performance even on the largest computing systems in the world.
Factors that affect the performance of an implementation include the presence of a deep memory hierarchy, pipelines, SIMD units, and multiple processing cores.
On a larger scale, several processors and memories are physically distributed and communicates over a high speed network.
Techniques such as blocking, vectorization, and asynchronous communication are used in state-of-the-art implementations.
One aim of the research in this area is to develop portable software libraries with high performance implementations of scalable algorithms.
High level linear algebra algorithms can be built on top of a relatively small set of fundamental operations, such as matrix multiplication, that perform the majority of the floating point operations.
These operations, once tuned to a particular machine, provide portable performance to the higher level algorithms.
This observation led to the standardization of an interface to three levels of operations known as the Basic Linear Algebra Subprograms (BLAS) and eventually to software libraries such as LAPACK and ScaLAPACK.
One topic studied in this thesis is alternative schemes for packed storage of symmetric or triangular matrices in a distributed memory environment.
In particular, several implementations of the Cholesky factorization of dense matrices were designed and evaluated.
Related to this topic is how to convert a matrix stored in a canonical row- or column-major storage format to a block format.
A prototype library for conversion based on in-place matrix transposition is developed in connection with an evaluation of several known algorithms for the closely connected problem of in-place transposition.
This thesis also investigates ways of moving beyond the limitations of extracting parallelism at the BLAS levels of abstraction.
One of our Cholesky algorithms interleaves computation with a complex communication algorithm and thereby eliminates the distinct boundaries between computation and communication that is common in synchronous parallel algorithms.
Two-sided transformations offer further challenges due to their often much smaller degree of concurrency.
This thesis explores how priority-based dynamic scheduling on each node of a distributed memory system can be used to achieve a faster execution by eliminating synchronization overhead.
The advantages of being able to express diverse and complex schedules within a sequential node program are clear, but the inherent overheads and limitations are topics for further research.