We are interested in solving linear, or matrix, system of arbitrary shape and size,

(1) A*x= b

Where the solution obtained from naïve method is badly behaved. This happens for various reasons, such as duplicated data, or other reasons for singularity. But the interesting case is when it happens because the right hand side has more data error than the system can tolerate and get a good answer. This comes up commonly in “inverse problems”, meaning problem in which we are trying to back up diffusive process in time. For example, in solving the “inverse heat problem” we are trying to determine what the heat profile of some linear arrangement was at an earlier time. Here is a graph of that ideal original hear profile in this problem:

Here is an example “solution” obtained by naively solving the 13 row by 17 column problem (see data files accompanying this document):

SVD

The natural way of solving any hard linear system is to start with the widely available Singular Value Decomposition:

(2)
A = U*S*V^{T}

Where U*U^{T} = I, U^{T}*U=I, V*V^{T}
=I, V^{T}*V=I, and S is a diagonal matrix of decreasing “singular
values”. We will ignore the minor aggravation of whether U and V and the full,
square matrices of theory, or a reduced shape which takes advantage of the
chance to save space by reducing the sizes of U and V when there are singular
values which are zero. Those matters are only of concern to the code developer.

We solve the system (1) as follows:

(3) A * x = b

(4)
U * S * V^{T
}* x = b

(5)
S * V^{T }* x = U^{T }* b

(6)
V^{T} *^{ }x = S^{+} * U^{T }*
b = PCV

(7)
x = V * S^{+} * U^{T }* b

Here S^{+} is the pseudo-inverse of S, meaning that
0 or negligible values of S[i] are treated instead as being infinity (∞).

This “pseudoinverse” approach only helps when the difficulty
is the presence of one or more “simple” singularities due to such problems as
duplicated rows. Sometimes that is enough. In statistical contexts the matrix V
* S^{+} * U^{T}_{ }is referred to as the
“pseudo-inverse of A and seems to be important.

The next step in trying to solve (1) is to set MORE elements
of S to zero (or ∞, whichever way you want to think of
it). For example, if the average accuracy of the elements of A is, say, 4
significant digits, then one might set all the singular values at the end of S
to zero that add together to less than 0.0001 times the average element of A,
or 0.0001 times S[1]. But this doesn’t really help in general because the
problem is usually with the error in b, not the error in A.

For our purposes zero singular values are not a difficulty… we always treat them as the pseudoinverse
does. The more important case is SMALL singular values.

Papers have been written for over thirty years on how to
automatically resolve the problem of excessively small singular values, or
“ill-conditioning” in linear equation systems. These problems are characterized
as having wildly unacceptable solutions, especially when the solution is
expected to be non-negative. And, even irrelevantly small changes to the
observed data, b, can cause the solution to be wildly different yet, and still
unacceptable.

I studied this problem many years ago as part of my Ph.D. under
the well-known Dr. Cleve Moler, the inventor of MATLAB. Many papers have been
published since then, with various approaches and degrees of success, and
various kinds of constraints or conditions. I have presented later versions of
my algorithm in C++ at appropriate conferences, and I also conferred with a
Russian mathematician on one of these competitive algorithm. My present
algorithm is not very “theoretical”, and not a very good candidate for
publication, but it is practical, efficient, and adaptable. For example, in my C++
package I allow any number of both equality constraints and inequality
constraints to be added to the problem. However, there is little general use
for such matters. In the current package we offer only optional one extra
constraint: that the solution be non-negative. With this perspective, let’s go
on:

Our interest in dealing with this ill-conditioning is in the
right hand side of (6), which we have labelled “PCV” for Picard Condition
Vector. Picard’s condition states that the elements of the PCV must
converge toward zero for the system to
be solvable stably. When the system (1) is ill-conditioned, that fact will show
here by the PCV growing inappropriately in its latter elements. This happens
because the elements of U^{T }*
b drop to zero FASTER that the singular values, S[i], drop to zero. In fact,
what happens is that the S[i] march steadily toward zero, as that is the nature
of the singular values in these case, whereas the latter elements of U^{T }*
b remain at a significantly high level due to noise in A and/or b. Here is an
example PCV for the small inverse heat problem we mentioned.

Obviously there is a real problem here! Next we zoom in to see the first few values:

All appears to behave well with the PCV for the first half
dozen elements. Then something wrong happens. Below is a spread sheet of the
singular values and the elements of U^{T }* b. The PCV is computed by
dividing the elements of U^{T }* b
in the second column by the singular values in the first column. That
quotient is in the third column. You can see that the singular values are
marching nicely to zero. But the U^{T }* b values seem to plateau at
random plus and minus values a little over 0.0001 in magnitude. Since the
latter values in U^{T }* b should – by Picard – have been two or three
orders of magnitude smaller than they are,
we can attribute maybe 99% of their magnitude to noise in A and b. In
fact, since U (and U^{T}) have
all ones for their singular values, the latter element of U^{T }* b are
irrelevantly transformed noise samples of b. So we can take the “latter values”
of U^{T }* b to be of the same magnitude as the original error in b.

NOW… the crux of our situation: if we knew an algorithm to
automatically tell WHERE to start in the second column we could automatically
estimate the root-mean-square, of the error in b – it is the same as the RMS of
these latter values in U^{T }* b. That would be very helpful because it
would let us then determine how much to regularize the solution process (more
on this later). You will notice that the choice of where to start counting the
second column estimates as error samples is not particularly critical on the
high side: we consider rows 6 to 13 to be error samples; or with little change,
consider rows 7 to 13 to be error samples; and in this problem we could also
use 8 to 13, and even further. But often we will not have to luxury of so many
rows of this table to work with. Many cases of small systems will have only a
couple of error samples: so they are precious and must not be wasted. On the
other hand, estimating the usable rows as 4 or 5 instead of 6 or 7 could be
catastrophic to the solution as that choice would unjustifiably increase the
RMS of the “error samples”. This would cause us to over-regularize the problem.

So you may be asking, if the “usable rank” of this problem is clearly around 5 to 7, why not just try setting all singular values past 5, 6, and 7 to zero and select one of those answers? For two reasons: we need an AUTOMATIC process. How would we choose between those three automatically? That is a difficult matter. And second, NONE of those three may be very good. We can do better than that, and automatically. (However, for the record, in this particular example zeroing singular value past rank 5 produces a useful answer almost as good as autosol’s result. This is equivalent to using a Generalized Inverse with a Singular Value cutoff of 0.004. But using a cutoff of 0.005 would product a bad solution.)

Before talking about our algorithm for determining “usable rank”, I want to emphasize that the above example is unusually clear. The PCV can vary drastically. I have seen it be roughly level for a few values, then steadily rise. It can oscillate between near zero and very positive and negative large values. It is very unpredictable. So our methods must be aware of the complexity and pitfalls.

We have developed for autosol two heuristic algorithms for
choosing that point in the PCV at which noise starts dominating the elements of
U^{T}*b. We have tuned the parameters with many problems for many
years. The first algorithm, called splita in the code, uses a pair of weighted
moving averages. One lazily follows successive values of the PCV which are **larger**
than its current value. The second lazily follows successive values of the PCV
which are **smaller** than its current value, but also VERY lazily follows
larger values of the PCV. When the first average becomes substantially larger
than the second average then autosol declares that point to be the end of any
possible useful data.

The second algorithm, called splitb in the code, re-analyses
the PCV data, up to the upper rank limit determined by splita, using a fixed
width, unweighted average of the SQUARES of the elements of the PCV. This
allows a closer look at when the values get too large. The fixed width that is
chosen depends on problem size but is never greater than 6. We look carefully
at these sums and stop when one successive moving sum is substantially larger
than the previous sum. This result from splitb becomes what we call the “usable
rank”, and values beyond that in U^{T}*b are taken to be samples of the
error in the original RHS, b.

So how do we use these noise values to find a solution? Simply: the RMS (root-mean-square) of them becomes the parameter SIGMA that we use to proceed with the solution. This process of ”Tikhonov regularization” can be presented several ways, but this straightforward method seems adequate:

To the original A*x=b, we add equations asking the x values to be zero. That is, we add

(8)
x_{i} = 0.0 , for all indices i

But we weight these extra equations by a small value which is usually called lambda, or λ.

(9)
λ x_{i} = 0.0 , for all indices i

We write this enhanced system in matrix form like this:

(10) A * x = b

(11) λ * I * x = 0 ( I is an appropriately sized identity matrix)

If λ is zero, there is no change to x. But the larger lambda
is, the more we will force the computed vector, x, from its nominal
pseudoinverse result toward a solution of all zeros. We choose which x we want with the
“discrepancy method”: that is, we search for a value of λ for which the
resulting solution, x(λ), satisfies this (where x(0) is the pseudoinverse
solution, for which λ is not present):

(12)
RMS( x(λ) -x(0) ) = sigma

That is, we, or rather, autosol asks that the new,
auto-regularized solution deviate from the unregularized solution by exactly
SIGMA on average (in RMS sense).

And that does it. The “Gordian knot” has been untied.

Finally, if the user so asks, we append a non-negativity algorithm
to the above solution process. This is as in the classic book, “Solving Least
Squares Problems” by Lawson & Hanson
(Hanson being another member of my dissertation committee) . Basically, we pick a negative result, and
zero its column. Then we resolve. That value will now be zero, and the solution
will have adapted to part to that change. Then you pick another to zero out.
The question is, what ORDER should these be picked? Our choice is to force the **most
negative** x_{i }to become zero. We iterate on this until there are
no non-roundoff negative solution values.

This is the resulting auto-regularized solution for the inverse
heat problem above:

And this is the result with the nonnegative option:

Note that the negative values didn’t just disappear: the solution
adjusted better represent the flat region in the first 10 samples. This may not
look like a great solution, but in most applications involving these ill-conditioned
systems, this is a fabulous automatic solution.

Return to Our Main Page.