# NavList:

## A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding

**Re: Rejecting outliers**

**From:**Peter Hakel

**Date:**2011 Jan 2, 17:17 -0800

George,

I forgot to mention that you can unlock the spreadsheet by turning off its protection, there is no password. I attach the unlocked version, so you can skip that step. The PNG file is an image, a screenshot of the color-coded portion of the spreadsheet where input and output data are concentrated. This is the part with which a user would interact; I attached it for those readers who may be interested in the main points but don't want to bother with Excel in detail. My Excel is Office 2004 for Mac, so hopefully compatibility will not be a problem.

I gave the details of the procedure in:

http://www.fer3.com/arc/m2.aspx?i=115086&y=201012

Step 1: Calculate the standard (non-weighted) least-squares linear fit through the data.

Now iterate:

Step 2: Calculate altitude differences "diff" between the data and the latest available linear fit.

Step 3: Calculate new weights as 1 / diff^2 for each data point.

Step 4: Calculate a new linear fit using weights from Step 3.

Repeat until convergence.

"diff" could turn up small, or even zero, which would cause numerical problems; that is why the weights have a ceiling controlled by the "Scatter" parameter. The weight=1.000 means that the data point has hit this ceiling and contributes to the result with maximum influence allowed by the procedure. For Scatter=7.0', all nine data points reach this ceiling and we are stuck at Step 1.

I have not gotten around to fitting Gaussian-scattered data, as you suggested. I may do so in the future, time permitting.

The procedure of weighted-least squares has a very solid theoretical background, see, e.g.,

http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares

http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares

The one admittedly heuristic detail is Step 3; weight should be 1/variance, but that is unknown in this case. Step 3 seems like a reasonable replacement, effectively substituting |diff| for the standard deviation of altitudes at the given UT.

This procedure is capable of eliminating lopsidedness to a certain extent, as I have shown previously. However, if there are too many "lopsided" data points, the result will follow them to the new "middle" defined by them. I don't know how we would weed that out without additional information about where the correct "middle" really is. As I said earlier, in the absence of an independent check, we must rely on the assumption that a sufficient majority of data points are "good."

My motivation was to see what information can be extracted from the data set alone, without any additional information such as DR. I think that Peter Fogg's approach of precomputing the slope is fine, and would most likely give a better practical result. After all, "position tracking" is preferable to "position establishing from scratch" and it is indeed what happens in real life. But I think you will agree that academic curiosity has its benefits, too. :-)

Peter Hakel

I'm failing to understand some aspects of Peter Hakel's approach, though it

least it seems numerical, definable, and repeatable.

First, though, I should say that problems arise when viewing his Excel

output. It looks as though some columns need to be widened to see their

contents, but it seems that the file is deliberately crippled to prevent me

taking many of the actions I am familiar with in Excel. But I should add

that mine is an old 2000 version, which may be part of the problem. The

alternative .png version opens with "paint" but only allows me to see the

top-left area of the sheet. Is that all I need?

Now to more substantive matters-

I don't understand how the various weighting factors have been derived, and

why many of them are exactly 1.0000, when others are much less.

Presumably, they depend on the divergence of each data point from some

calculated line, which is then readjusted by iteration, but I have failed

to follow how that initial straight-line norm was assessed, or what

algorithm was used to obtain the weights. Answers in words rather than in

statistical symbols would be most helpful to my simple mind.

You seem to have ended up with a best-fit slope of about 24' in a 5-minute

period, as I did when allowing Excel to make a best-fit, when giving it

freedom to alter the slope as it thought fit. But the slope can be

pre-assessed with sufficient accuracy from known information, and unless

there is some error in the information given, such as an (unlikely) star

mis-identification, we can be sure that the actual slope is nearer to 32',

and the apparent lower figure is no more than a product of scatter in the

data. This is a point that Peter Fogg keeps reiterating, perhaps the only

valid point in all he has written on this topic.

As a result, we could, if we wished, subtract off that line of known

constant slope from all the data, and end up with a set of numbers, all of

which should be roughly equal, simply scattering around some mean value

that we wish to discover. Then the statistical task of weeding outliers

becomes somewhat simpler.

=================

If you apply your procedure to an artificially-generated data-set,

scattering in a known Gaussian manner about a known mean (which could be

zero), and known to contain no non-Gaussian outliers, what is the resulting

scatter in the answer? How does it compare with the predicted scatter from

simple averaging? I suspect (predict) that it can only be worse, though

perhaps not by much.

This is the way I picture it. If there is any lopsidedness in the

distribution, then each added observation that is in the direction of the

imbalance will be given greatest weight, whereas any observation that would

act to rebalance it, on the other side, will be attenuated, being further

from the trend. So there will be some effect, however small, that acts to

enhance any unbalance, though probably not to the extent of causing

instability. Does that argument make sense to you? It could be checked out

by some Monte Carlo procedure.

I presume that the proposed procedure is entirely empirical, and has no

theoretical backing, though there may not be anything wrong with that, if

it works.

George.

contact George Huxtable, at george{at}hux.me.uk

or at +44 1865 820222 (from UK, 01865 820222)

or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.

I forgot to mention that you can unlock the spreadsheet by turning off its protection, there is no password. I attach the unlocked version, so you can skip that step. The PNG file is an image, a screenshot of the color-coded portion of the spreadsheet where input and output data are concentrated. This is the part with which a user would interact; I attached it for those readers who may be interested in the main points but don't want to bother with Excel in detail. My Excel is Office 2004 for Mac, so hopefully compatibility will not be a problem.

I gave the details of the procedure in:

http://www.fer3.com/arc/m2.aspx?i=115086&y=201012

Step 1: Calculate the standard (non-weighted) least-squares linear fit through the data.

Now iterate:

Step 2: Calculate altitude differences "diff" between the data and the latest available linear fit.

Step 3: Calculate new weights as 1 / diff^2 for each data point.

Step 4: Calculate a new linear fit using weights from Step 3.

Repeat until convergence.

"diff" could turn up small, or even zero, which would cause numerical problems; that is why the weights have a ceiling controlled by the "Scatter" parameter. The weight=1.000 means that the data point has hit this ceiling and contributes to the result with maximum influence allowed by the procedure. For Scatter=7.0', all nine data points reach this ceiling and we are stuck at Step 1.

I have not gotten around to fitting Gaussian-scattered data, as you suggested. I may do so in the future, time permitting.

The procedure of weighted-least squares has a very solid theoretical background, see, e.g.,

http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares

http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares

The one admittedly heuristic detail is Step 3; weight should be 1/variance, but that is unknown in this case. Step 3 seems like a reasonable replacement, effectively substituting |diff| for the standard deviation of altitudes at the given UT.

This procedure is capable of eliminating lopsidedness to a certain extent, as I have shown previously. However, if there are too many "lopsided" data points, the result will follow them to the new "middle" defined by them. I don't know how we would weed that out without additional information about where the correct "middle" really is. As I said earlier, in the absence of an independent check, we must rely on the assumption that a sufficient majority of data points are "good."

My motivation was to see what information can be extracted from the data set alone, without any additional information such as DR. I think that Peter Fogg's approach of precomputing the slope is fine, and would most likely give a better practical result. After all, "position tracking" is preferable to "position establishing from scratch" and it is indeed what happens in real life. But I think you will agree that academic curiosity has its benefits, too. :-)

Peter Hakel

**From:**George Huxtable <george{at}hux.me.uk>

**To:**NavList@fer3.com

**Sent:**Sun, January 2, 2011 3:12:28 PM

**Subject:**[NavList] Re: Rejecting outliers

I'm failing to understand some aspects of Peter Hakel's approach, though it

least it seems numerical, definable, and repeatable.

First, though, I should say that problems arise when viewing his Excel

output. It looks as though some columns need to be widened to see their

contents, but it seems that the file is deliberately crippled to prevent me

taking many of the actions I am familiar with in Excel. But I should add

that mine is an old 2000 version, which may be part of the problem. The

alternative .png version opens with "paint" but only allows me to see the

top-left area of the sheet. Is that all I need?

Now to more substantive matters-

I don't understand how the various weighting factors have been derived, and

why many of them are exactly 1.0000, when others are much less.

Presumably, they depend on the divergence of each data point from some

calculated line, which is then readjusted by iteration, but I have failed

to follow how that initial straight-line norm was assessed, or what

algorithm was used to obtain the weights. Answers in words rather than in

statistical symbols would be most helpful to my simple mind.

You seem to have ended up with a best-fit slope of about 24' in a 5-minute

period, as I did when allowing Excel to make a best-fit, when giving it

freedom to alter the slope as it thought fit. But the slope can be

pre-assessed with sufficient accuracy from known information, and unless

there is some error in the information given, such as an (unlikely) star

mis-identification, we can be sure that the actual slope is nearer to 32',

and the apparent lower figure is no more than a product of scatter in the

data. This is a point that Peter Fogg keeps reiterating, perhaps the only

valid point in all he has written on this topic.

As a result, we could, if we wished, subtract off that line of known

constant slope from all the data, and end up with a set of numbers, all of

which should be roughly equal, simply scattering around some mean value

that we wish to discover. Then the statistical task of weeding outliers

becomes somewhat simpler.

=================

If you apply your procedure to an artificially-generated data-set,

scattering in a known Gaussian manner about a known mean (which could be

zero), and known to contain no non-Gaussian outliers, what is the resulting

scatter in the answer? How does it compare with the predicted scatter from

simple averaging? I suspect (predict) that it can only be worse, though

perhaps not by much.

This is the way I picture it. If there is any lopsidedness in the

distribution, then each added observation that is in the direction of the

imbalance will be given greatest weight, whereas any observation that would

act to rebalance it, on the other side, will be attenuated, being further

from the trend. So there will be some effect, however small, that acts to

enhance any unbalance, though probably not to the extent of causing

instability. Does that argument make sense to you? It could be checked out

by some Monte Carlo procedure.

I presume that the proposed procedure is entirely empirical, and has no

theoretical backing, though there may not be anything wrong with that, if

it works.

George.

contact George Huxtable, at george{at}hux.me.uk

or at +44 1865 820222 (from UK, 01865 820222)

or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.