Welcome to the NavList Message Boards.

NavList:

A Community Devoted to the Preservation and Practice of Celestial Navigation and Other Methods of Traditional Wayfinding

Compose Your Message

Message:αβγ
Message:abc
Add Images & Files
    Name or NavList Code:
    Email:
       
    Reply
    Re: Rejecting outliers
    From: Peter Hakel
    Date: 2011 Jan 2, 17:17 -0800
    George,

    I forgot to mention that you can unlock the spreadsheet by turning off its protection, there is no password.  I attach the unlocked version, so you can skip that step.  The PNG file is an image, a screenshot of the color-coded portion of the spreadsheet where input and output data are concentrated.  This is the part with which a user would interact; I attached it for those readers who may be interested in the main points but don't want to bother with Excel in detail.  My Excel is Office 2004 for Mac, so hopefully compatibility will not be a problem.

    I gave the details of the procedure in:

    http://www.fer3.com/arc/m2.aspx?i=115086&y=201012

    Step 1: Calculate the standard (non-weighted) least-squares linear fit through the data.

    Now iterate:
    Step 2: Calculate altitude differences "diff" between the data and the latest available linear fit.
    Step 3: Calculate new weights as 1 / diff^2 for each data point.
    Step 4: Calculate a new linear fit using weights from Step 3.
    Repeat until convergence.

    "diff" could turn up small, or even zero, which would cause numerical problems; that is why the weights have a ceiling controlled by the "Scatter" parameter.  The weight=1.000 means that the data point has hit this ceiling and contributes to the result with maximum influence allowed by the procedure.  For Scatter=7.0', all nine data points reach this ceiling and we are stuck at Step 1.

    I have not gotten around to fitting Gaussian-scattered data, as you suggested.  I may do so in the future, time permitting.

    The procedure of weighted-least squares has a very solid theoretical background, see, e.g.,

    http://en.wikipedia.org/wiki/Least_squares#Weighted_least_squares
    http://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares

    The one admittedly heuristic detail is Step 3; weight should be 1/variance, but that is unknown in this case.  Step 3 seems like a reasonable replacement, effectively substituting |diff| for the standard deviation of altitudes at the given UT.

    This procedure is capable of eliminating lopsidedness to a certain extent, as I have shown previously.  However, if there are too many "lopsided" data points, the result will follow them to the new "middle" defined by them.  I don't know how we would weed that out without additional information about where the correct "middle" really is.  As I said earlier, in the absence of an independent check, we must rely on the assumption that a sufficient majority of data points are "good."

    My motivation was to see what information can be extracted from the data set alone, without any additional information such as DR.  I think that Peter Fogg's approach of precomputing the slope is fine, and would most likely give a better practical result.  After all, "position tracking" is preferable to "position establishing from scratch" and it is indeed what happens in real life.  But I think you will agree that academic curiosity has its benefits, too. :-)


    Peter Hakel



    From: George Huxtable <george@hux.me.uk>
    To: NavList@fer3.com
    Sent: Sun, January 2, 2011 3:12:28 PM
    Subject: [NavList] Re: Rejecting outliers

    I'm failing to understand some aspects of Peter Hakel's approach, though it
    least it seems numerical, definable, and repeatable.

    First, though, I should say that problems arise when viewing his Excel
    output. It looks as though some columns need to be widened to see their
    contents, but it seems that the file is deliberately crippled to prevent me
    taking many of the actions I am familiar with in Excel. But I should add
    that mine is an old 2000 version, which may be part of the problem. The
    alternative .png version opens with "paint" but only allows me to see the
    top-left area of the sheet. Is that all I need?

    Now to more substantive matters-

    I don't understand how the various weighting factors have been derived, and
    why many of them are exactly 1.0000, when others are much less.

    Presumably, they depend on the divergence of each data point from some
    calculated line, which is then readjusted by iteration, but I have failed
    to follow how that initial straight-line norm was assessed, or what
    algorithm was used to obtain the weights. Answers in words rather than in
    statistical symbols would be most helpful to my simple mind.

    You seem to have ended up with a best-fit slope of about 24' in a 5-minute
    period, as I did when allowing Excel to make a best-fit, when giving it
    freedom to alter the slope as it thought fit. But the slope can be
    pre-assessed with sufficient accuracy from known information, and unless
    there is some error in the information given, such as an (unlikely) star
    mis-identification, we can be sure that the actual slope is nearer to 32',
    and the apparent lower figure is no more than a product of scatter in the
    data. This is a point that Peter Fogg keeps reiterating, perhaps the only
    valid point in all he has written on this topic.

    As a result, we could, if we wished, subtract off that line of known
    constant slope from all the data, and end up with a set of numbers, all of
    which should be roughly equal, simply scattering around some mean value
    that we wish to discover. Then the statistical task of weeding outliers
    becomes somewhat simpler.

    =================

    If you apply your procedure to an artificially-generated  data-set,
    scattering in a known Gaussian manner about a known mean (which could be
    zero), and known to contain no non-Gaussian outliers, what is the resulting
    scatter in the answer? How does it compare with the predicted scatter from
    simple averaging? I suspect (predict) that it can only be worse, though
    perhaps not by much.

    This is the way I picture it. If there is any lopsidedness in the
    distribution, then each added observation that is in the direction of the
    imbalance will be given greatest weight, whereas any observation that would
    act to rebalance it, on the other side, will be attenuated, being further
    from the trend. So there will be some effect, however small, that acts to
    enhance any unbalance, though probably not to the extent of causing
    instability. Does that argument make sense to you? It could be checked out
    by some Monte Carlo procedure.

    I presume that the proposed procedure is entirely empirical, and has no
    theoretical backing, though there may not be anything wrong with that, if
    it works.

    George.

    contact George Huxtable, at  george@hux.me.uk
    or at +44 1865 820222 (from UK, 01865 820222)
    or at 1 Sandy Lane, Southmoor, Abingdon, Oxon OX13 5HX, UK.





    File: 115122.average2.xls
       
    Reply
    Browse Files

    Drop Files

    NavList

    What is NavList?

    Get a NavList ID Code

    Name:
    (please, no nicknames or handles)
    Email:
    Do you want to receive all group messages by email?
    Yes No

    A NavList ID Code guarantees your identity in NavList posts and allows faster posting of messages.

    Retrieve a NavList ID Code

    Enter the email address associated with your NavList messages. Your NavList code will be emailed to you immediately.
    Email:

    Email Settings

    NavList ID Code:

    Custom Index

    Subject:
    Author:
    Start date: (yyyymm dd)
    End date: (yyyymm dd)

    Visit this site
    Visit this site
    Visit this site
    Visit this site
    Visit this site
    Visit this site