We are interested in solving linear, or matrix, system of arbitrary shape and size,

(1) A*x= b

Where the solution obtained from naïve method is badly behaved. This happens for various reasons, such as duplicated data, or other reasons for singularity. But the interesting case is when it happens because the right hand side has more data error than the system can tolerate and get a good answer. This comes up commonly in “inverse problems”, meaning problem in which we are trying to back up diffusive process in time. For example, in solving the “inverse heat problem” we are trying to determine what the heat profile of some linear arrangement was at an earlier time. Here is a graph of that ideal original hear profile in this problem:

Here is an example “solution” obtained by naively solving the 13 row by 17 column problem (see data files accompanying this document):

SVD

The natural way of solving any hard linear system is to start with the widely available Singular Value Decomposition:

(2)
A = U*S*V^{T}

Where U*U^{T} = I, U^{T}*U=I, V*V^{T}
=I, V^{T}*V=I, and S is a diagonal matrix of decreasing “singular
values”. We will ignore the minor aggravation of whether U and V and the full,
square matrices of theory, or a reduced shape which takes advantage of the
chance to save space by reducing the sizes of U and V when there are singular
values which are zero. Those matters are only of concern to the code developer.

We solve the system (1) as follows:

(3) A * x = b

(4)
U * S * V^{T
}* x = b

(5)
S * V^{T }* x = U^{T }* b

(6)
V^{T} *^{ }x = S^{+} * U^{T }*
b = PCV

(7)
x = V * S^{+} * U^{T }* b

Here S^{+} is the pseudo-inverse of S, meaning that
0 or negligible values of S[i] are treated instead as
being infinity (∞).

This “pseudoinverse” approach only helps when the difficulty
is the presence of one or more “simple” singularities due to such problems as
duplicated rows. Sometimes that is enough. In statistical contexts the matrix V
* S^{+} * U^{T}_{ }is referred to as the
“pseudo-inverse of A and seems to be important.

The next step in trying to solve (1) is to set MORE elements
of S to zero (or ∞, whichever way you want to think of
it). For example, if the average accuracy of the elements of A is, say, 4
significant digits, then one might set all the singular values at the end of S
to zero that add together to less than 0.0001 times the average element of A,
or 0.0001 times S[1]. But this doesn’t really help in
general because the problem is usually with the error in b, not the error in A.

For our purposes zero singular values are not a difficulty… we always treat them as the pseudoinverse
does. The more important case is SMALL singular values.

Papers have been written for over thirty years on how to
automatically resolve the problem of excessively small singular values, or
“ill-conditioning” in linear equation systems. These problems are characterized
as having wildly unacceptable solutions, especially when the solution is
expected to be non-negative. And, even irrelevantly small changes to the
observed data, b, can cause the solution to be wildly different yet, and still
unacceptable.

I studied this problem many years ago as part of my Ph.D. under
the well-known Dr. Cleve Moler, the inventor of
MATLAB. Many papers have been published since then, with various approaches and
degrees of success, and various kinds of constraints or conditions. I have
presented later versions of my algorithm in C++ at appropriate conferences, and
I also conferred with a Russian mathematician on one of these competitive algorithm. My present algorithm is not very “theoretical”,
and not a very good candidate for publication, but it
is practical, efficient, and adaptable. For example, in my C++ package I allow
any number of both equality constraints and inequality constraints to be added
to the problem. However, there is little general use for such matters. In the
current package we offer only optional one extra constraint: that the solution be
non-negative. With this perspective, let’s go on:

Our interest in dealing with this ill-conditioning is in the
right hand side of (6), which we have labelled “PCV”
for Picard Condition Vector. Picard’s condition states that the elements of the
PCV must converge toward zero for the
system to be solvable stably. When the system (1) is ill-conditioned, that fact
will show here by the PCV growing inappropriately in its latter elements. This
happens because the elements of U^{T }*
b drop to zero FASTER that the singular values, S[i],
drop to zero. In fact, what happens is that the S[i]
march steadily toward zero, as that is the nature of the singular values in
these case, whereas the latter elements of U^{T }* b remain at a
significantly high level due to noise in A and/or b. Here is an example PCV for
the small inverse heat problem we mentioned.

Obviously there is a real problem here! Next we zoom in to see the first few values:

All appears to behave well with the PCV for the first half
dozen elements. Then something wrong happens. Below is a spread sheet of the
singular values and the elements of U^{T }* b. The PCV is computed by
dividing the elements of U^{T }* b
in the second column by the singular values in the first column. That
quotient is in the third column. You can see that the singular values are
marching nicely to zero. But the U^{T }* b values seem to plateau at
random plus and minus values a little over 0.0001 in magnitude. Since the
latter values in U^{T }* b should – by Picard – have been two or three
orders of magnitude smaller than they are,
we can attribute maybe 99% of their magnitude to noise in A and b. In
fact, since U (and U^{T}) have
all ones for their singular values, the latter element of U^{T }* b are
irrelevantly transformed noise samples of b. So we can
take the “latter values” of U^{T }* b to be of the same magnitude as
the original error in b.

NOW… the crux of our situation: if we knew an algorithm to
automatically tell WHERE to start in the second column we could automatically
estimate the root-mean-square, of the error in b – it is the same as the RMS of
these latter values in U^{T }* b. That would be very
helpful because it would let us then determine how much to regularize
the solution process (more on this later). You will notice that the choice of
where to start counting the second column estimates as error samples is not
particularly critical on the high side: we consider rows 6 to 13 to be error
samples; or with little change, consider rows 7 to 13 to be error samples; and
in this problem we could also use 8 to 13, and even further. But often we will
not have to luxury of so many rows of this table to work with. Many cases of
small systems will have only a couple of error samples: so
they are precious and must not be wasted. On the other hand, estimating the
usable rows as 4 or 5 instead of 6 or 7 could be catastrophic to the solution
as that choice would unjustifiably increase the RMS of the “error samples”. This
would cause us to over-regularize the problem.

So you may be asking, if the “usable rank” of this problem is clearly around 5 to 7, why not just try setting all singular values past 5, 6, and 7 to zero and select one of those answers? For two reasons: we need an AUTOMATIC process. How would we choose between those three automatically? That is a difficult matter. And second, NONE of those three may be very good. We can do better than that, and automatically. (However, for the record, in this particular example zeroing singular value past rank 5 produces a useful answer almost as good as autorej’s result. This is equivalent to using a Generalized Inverse with a Singular Value cutoff of 0.004. But using a cutoff of 0.005 would product a bad solution.)

Before talking about our algorithm for determining “usable rank”, I want to emphasize that the above example is unusually clear. The PCV can vary drastically. I have seen it be roughly level for a few values, then steadily rise. It can oscillate between near zero and very positive and negative large values. It is very unpredictable. So our methods must be aware of the complexity and pitfalls.

We have developed for autorej two
heuristic algorithms for choosing that point in the PCV at which noise starts
dominating the elements of U^{T}*b. We have tuned the parameters with
many problems for many years. The first algorithm, called splita
in the code, uses a pair of weighted moving averages. One lazily follows
successive values of the PCV which are **larger** than its current value.
The second lazily follows successive values of the PCV which are **smaller**
than its current value, but also VERY lazily follows larger values of the PCV.
When the first average becomes substantially larger than the second average
then autorej declares that point to be the end of any
possible useful data.

The second algorithm, called splitb
in the code, re-analyses the PCV data, up to the upper rank limit determined by
splita, using a fixed width, unweighted average of
the SQUARES of the elements of the PCV. This allows a closer look at when the
values get too large. The fixed width that is chosen depends on problem size
but is never greater than 6. We look carefully at these sums and stop when one
successive moving sum is substantially larger than the previous sum. This
result from splitb becomes what we call the “usable
rank”, and values beyond that in U^{T}*b are taken to be samples of the
error in the original RHS, b.

So how do we use these noise values to find a solution? Simply: the RMS (root-mean-square) of them becomes the parameter SIGMA that we use to proceed with the solution. This process of ”Tikhonov regularization” can be presented several ways, but this straightforward method seems adequate:

To the original A*x=b, we add equations asking the x values to be zero. That is, we add

(8)
x_{i} = 0.0 , for all indices i

But we weight these extra equations by a small value which is usually called lambda, or λ.

(9)
λ x_{i} = 0.0 , for all indices i

We write this enhanced system in matrix form like this:

(10) A * x = b

(11) λ * I * x = 0 ( I is an appropriately sized identity matrix)

If λ is zero, there is no change to x. But the larger lambda
is, the more we will force the computed vector, x, from its nominal
pseudoinverse result toward a solution of all zeros. We choose which x we want with the
“discrepancy method”: that is, we search for a value of λ for which the
resulting solution, x(λ), satisfies this (where x(0) is the pseudoinverse
solution, for which λ is not present):

(12)
RMS( x(λ) -x(0) ) = sigma

That is, we, or rather, autorej asks
that the new, auto-regularized solution deviate from the unregularized solution
by exactly SIGMA on average (in RMS sense).

And that does it. The “Gordian knot” has been untied.

Finally, if the user so asks, we append a non-negativity algorithm
to the above solution process. This is as in the classic book, “Solving Least
Squares Problems” by Lawson & Hanson
(Hanson being another member of my dissertation committee) . Basically, we pick a negative result, and
zero its column. Then we resolve. That value will now be zero, and the solution
will have adapted to part to that change. Then you pick another to zero out.
The question is, what ORDER should these be picked? Our choice is to force the **most
negative** x_{i }to become zero. We iterate on this until there are
no non-roundoff negative solution values.

This is the resulting auto-regularized solution for the inverse
heat problem above:

And this is the result with the nonnegative option:

Note that the negative values didn’t just
disappear: the solution adjusted better represent the flat region in the first
10 samples. This may not look like a great solution, but in most applications
involving these ill-conditioned systems, this is a fabulous automatic solution.

Return to main autorej
page.