NavList:
A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding
Re: Rejecting outliers
From: Peter Hakel
Date: 2011 Jan 2, 17:17 -0800
From: Peter Hakel
Date: 2011 Jan 2, 17:17 -0800
George,
I forgot to mention that you can unlock the spreadsheet by turning off its protection, there is no password. I attach the unlocked version, so you can skip that step. The PNG file is an image, a screenshot of the color-coded portion of the spreadsheet where input and output data are concentrated. This is the part with which a user would interact; I attached it for those readers who may be interested in the main points but don't want to bother with Excel in detail. My Excel is Office 2004 for Mac, so hopefully compatibility will not be a problem.
I gave the details of the procedure in:
http://www.fer3.com/arc/m2.aspx?i=115086&y=201012
Step 1: Calculate the standard (non-weighted) least-squares linear fit through the data.
Now iterate:
Step 2: Calculate altitude differences "diff" between the data and the latest available linear fit.
Step 3: Calculate new weights as 1 / diff^2 for each data point.
Step 4: Calculate a new linear fit using weights from Step 3.
Repeat until convergence.
"diff" could turn up small, or even zero, which would cause numerical problems; that is why the weights have a ceiling controlled by the "Scatter" parameter. The weight=1.000 means that the data point has hit this ceiling and contributes to the result with maximum influence allowed by the procedure. For Scatter=7.0', all nine data points reach this ceiling and we are stuck at Step 1.
I have not gotten around to fitting Gaussian-scattered data, as you suggested. I may do so in the future, time permitting.
The procedure of weighted-least squares has a very solid theoretical background, see, e.g.,
http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares
http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
The one admittedly heuristic detail is Step 3; weight should be 1/variance, but that is unknown in this case. Step 3 seems like a reasonable replacement, effectively substituting |diff| for the standard deviation of altitudes at the given UT.
This procedure is capable of eliminating lopsidedness to a certain extent, as I have shown previously. However, if there are too many "lopsided" data points, the result will follow them to the new "middle" defined by them. I don't know how we would weed that out without additional information about where the correct "middle" really is. As I said earlier, in the absence of an independent check, we must rely on the assumption that a sufficient majority of data points are "good."
My motivation was to see what information can be extracted from the data set alone, without any additional information such as DR. I think that Peter Fogg's approach of precomputing the slope is fine, and would most likely give a better practical result. After all, "position tracking" is preferable to "position establishing from scratch" and it is indeed what happens in real life. But I think you will agree that academic curiosity has its benefits, too. :-)
Peter Hakel
From: George Huxtable <george@hux.me.uk>
To: NavList@fer3.com
Sent: Sun, January 2, 2011 3:12:28 PM
Subject: [NavList] Re: Rejecting outliers
I'm failing to understand some aspects of Peter Hakel's approach, though it
least it seems numerical, definable, and repeatable.
First, though, I should say that problems arise when viewing his Excel
output. It looks as though some columns need to be widened to see their
contents, but it seems that the file is deliberately crippled to prevent me
taking many of the actions I am familiar with in Excel. But I should add
that mine is an old 2000 version, which may be part of the problem. The
alternative .png version opens with "paint" but only allows me to see the
top-left area of the sheet. Is that all I need?
Now to more substantive matters-
I don't understand how the various weighting factors have been derived, and
why many of them are exactly 1.0000, when others are much less.
Presumably, they depend on the divergence of each data point from some
calculated line, which is then readjusted by iteration, but I have failed
to follow how that initial straight-line norm was assessed, or what
algorithm was used to obtain the weights. Answers in words rather than in
statistical symbols would be most helpful to my simple mind.
You seem to have ended up with a best-fit slope of about 24' in a 5-minute
period, as I did when allowing Excel to make a best-fit, when giving it
freedom to alter the slope as it thought fit. But the slope can be
pre-assessed with sufficient accuracy from known information, and unless
there is some error in the information given, such as an (unlikely) star
mis-identification, we can be sure that the actual slope is nearer to 32',
and the apparent lower figure is no more than a product of scatter in the
data. This is a point that Peter Fogg keeps reiterating, perhaps the only
valid point in all he has written on this topic.
As a result, we could, if we wished, subtract off that line of known
constant slope from all the data, and end up with a set of numbers, all of
which should be roughly equal, simply scattering around some mean value
that we wish to discover. Then the statistical task of weeding outliers
becomes somewhat simpler.
=================
If you apply your procedure to an artificially-generated data-set,
scattering in a known Gaussian manner about a known mean (which could be
zero), and known to contain no non-Gaussian outliers, what is the resulting
scatter in the answer? How does it compare with the predicted scatter from
simple averaging? I suspect (predict) that it can only be worse, though
perhaps not by much.
This is the way I picture it. If there is any lopsidedness in the
distribution, then each added observation that is in the direction of the
imbalance will be given greatest weight, whereas any observation that would
act to rebalance it, on the other side, will be attenuated, being further
from the trend. So there will be some effect, however small, that acts to
enhance any unbalance, though probably not to the extent of causing
instability. Does that argument make sense to you? It could be checked out
by some Monte Carlo procedure.
I presume that the proposed procedure is entirely empirical, and has no
theoretical backing, though there may not be anything wrong with that, if
it works.
George.
contact George Huxtable, at george@hux.me.uk
or at +44 1865 820222 (from UK, 01865 820222)
or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.
I forgot to mention that you can unlock the spreadsheet by turning off its protection, there is no password. I attach the unlocked version, so you can skip that step. The PNG file is an image, a screenshot of the color-coded portion of the spreadsheet where input and output data are concentrated. This is the part with which a user would interact; I attached it for those readers who may be interested in the main points but don't want to bother with Excel in detail. My Excel is Office 2004 for Mac, so hopefully compatibility will not be a problem.
I gave the details of the procedure in:
http://www.fer3.com/arc/m2.aspx?i=115086&y=201012
Step 1: Calculate the standard (non-weighted) least-squares linear fit through the data.
Now iterate:
Step 2: Calculate altitude differences "diff" between the data and the latest available linear fit.
Step 3: Calculate new weights as 1 / diff^2 for each data point.
Step 4: Calculate a new linear fit using weights from Step 3.
Repeat until convergence.
"diff" could turn up small, or even zero, which would cause numerical problems; that is why the weights have a ceiling controlled by the "Scatter" parameter. The weight=1.000 means that the data point has hit this ceiling and contributes to the result with maximum influence allowed by the procedure. For Scatter=7.0', all nine data points reach this ceiling and we are stuck at Step 1.
I have not gotten around to fitting Gaussian-scattered data, as you suggested. I may do so in the future, time permitting.
The procedure of weighted-least squares has a very solid theoretical background, see, e.g.,
http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares
http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
The one admittedly heuristic detail is Step 3; weight should be 1/variance, but that is unknown in this case. Step 3 seems like a reasonable replacement, effectively substituting |diff| for the standard deviation of altitudes at the given UT.
This procedure is capable of eliminating lopsidedness to a certain extent, as I have shown previously. However, if there are too many "lopsided" data points, the result will follow them to the new "middle" defined by them. I don't know how we would weed that out without additional information about where the correct "middle" really is. As I said earlier, in the absence of an independent check, we must rely on the assumption that a sufficient majority of data points are "good."
My motivation was to see what information can be extracted from the data set alone, without any additional information such as DR. I think that Peter Fogg's approach of precomputing the slope is fine, and would most likely give a better practical result. After all, "position tracking" is preferable to "position establishing from scratch" and it is indeed what happens in real life. But I think you will agree that academic curiosity has its benefits, too. :-)
Peter Hakel
From: George Huxtable <george@hux.me.uk>
To: NavList@fer3.com
Sent: Sun, January 2, 2011 3:12:28 PM
Subject: [NavList] Re: Rejecting outliers
I'm failing to understand some aspects of Peter Hakel's approach, though it
least it seems numerical, definable, and repeatable.
First, though, I should say that problems arise when viewing his Excel
output. It looks as though some columns need to be widened to see their
contents, but it seems that the file is deliberately crippled to prevent me
taking many of the actions I am familiar with in Excel. But I should add
that mine is an old 2000 version, which may be part of the problem. The
alternative .png version opens with "paint" but only allows me to see the
top-left area of the sheet. Is that all I need?
Now to more substantive matters-
I don't understand how the various weighting factors have been derived, and
why many of them are exactly 1.0000, when others are much less.
Presumably, they depend on the divergence of each data point from some
calculated line, which is then readjusted by iteration, but I have failed
to follow how that initial straight-line norm was assessed, or what
algorithm was used to obtain the weights. Answers in words rather than in
statistical symbols would be most helpful to my simple mind.
You seem to have ended up with a best-fit slope of about 24' in a 5-minute
period, as I did when allowing Excel to make a best-fit, when giving it
freedom to alter the slope as it thought fit. But the slope can be
pre-assessed with sufficient accuracy from known information, and unless
there is some error in the information given, such as an (unlikely) star
mis-identification, we can be sure that the actual slope is nearer to 32',
and the apparent lower figure is no more than a product of scatter in the
data. This is a point that Peter Fogg keeps reiterating, perhaps the only
valid point in all he has written on this topic.
As a result, we could, if we wished, subtract off that line of known
constant slope from all the data, and end up with a set of numbers, all of
which should be roughly equal, simply scattering around some mean value
that we wish to discover. Then the statistical task of weeding outliers
becomes somewhat simpler.
=================
If you apply your procedure to an artificially-generated data-set,
scattering in a known Gaussian manner about a known mean (which could be
zero), and known to contain no non-Gaussian outliers, what is the resulting
scatter in the answer? How does it compare with the predicted scatter from
simple averaging? I suspect (predict) that it can only be worse, though
perhaps not by much.
This is the way I picture it. If there is any lopsidedness in the
distribution, then each added observation that is in the direction of the
imbalance will be given greatest weight, whereas any observation that would
act to rebalance it, on the other side, will be attenuated, being further
from the trend. So there will be some effect, however small, that acts to
enhance any unbalance, though probably not to the extent of causing
instability. Does that argument make sense to you? It could be checked out
by some Monte Carlo procedure.
I presume that the proposed procedure is entirely empirical, and has no
theoretical backing, though there may not be anything wrong with that, if
it works.
George.
contact George Huxtable, at george@hux.me.uk
or at +44 1865 820222 (from UK, 01865 820222)
or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.
File: 115122.average2.xls