We consider parallel reduction of a real matrix to Hessenberg form using orthogonal transformations. Standard Hessenberg reduction algorithms reduce the columns of the matrix from left to right in either a blocked or unblocked fashion. However, the standard blocked variant performs 20% of the computations as large matrix-vector multiplications.
We show that a two-stage approach consisting of an intermediate reduction to r-Hessenberg form speeds up the reduction by avoiding matrix-vector multiplications.
We describe and evaluate a new high-performance implementation of the two-stage approach that attains significant speedups over the one-stage approach on a dual quad-core machine. The key components are a dynamically scheduled implementation
of the first stage and a blocked adaptively load-balanced implementation of the second stage.