Google Answers: Linear Regression C++ Code (incremental)

View Question

Q: Linear Regression C++ Code (incremental) ( No Answer, 4 Comments )

Question

Subject: Linear Regression C++ Code (incremental)
Category: Computers > Algorithms
Asked by: stevemadere2-ga
List Price: $100.00

Posted: 14 Dec 2005 10:52 PST
Expires: 13 Jan 2006 10:52 PST
Question ID: 605807

Provide C++ code for an incremental linear regression calculator class.

It will be used to estimate the trend of the tail end of a timeseries
of sample points.  It should calculate the LSF (least-squares-fit)
line through a set of points.



What do I mean by incremental?
      Let's say I just calculated a LSF on a set of 100 data points:

      Now I want to calculate a new LSF on a set of 100 points made
      up of the the latest 98 points (highest X values) from that 
      first set and two more points later than any of those 98.

      I would want this second LSF operation to be significantly less
      computationally expensive than simply running another linear 
      regression from scratch on the new set of 100 points 
      (made up of 98 old points and 2 new ones).

      I would want the ability to add more points to the end (latest time)
      and remove points from the back (earliest time) at will.
      If adding points out of order does not make it any more difficult,
      or CPU intensive, then I'd like that ability as well.


The code should be well formatted, and easily understandable.
(carefully chosen self-documenting variable and method names 
which convey the meaning of their purpose well)

Below is my stab at the header file:
You would provide the .cc  file.   ;)
(feel free to put trivial method bodies in the class definition)

(also, please try to follow the implied code format standards in
the code below)


class LinearRegressor {
    public:

    struct Point { double x, y};

    /** Add samples to the set of data points to be used in the regression.
        If it simplifies coding or makes it significantly more efficient,
        we can constrain args to x > max(x added so far).  You tell me.
    **/
    void add_sample_points(const std::vector<Point>& sp);

    /** Add a single sample to the set of data points to be used in 
        the regression.
        @sa add_sample_points()
    **/
    void add_sample_point(const Point &sp);


    /** Remove left-most samples from data set similarly to Perl's shift().
        @param num_to_remove number of samples to be removed.
        @return the sample points removed.
    **/
    std::vector<Point> shift_samples(unsigned num_to_remove =1);

   /** Remove samples from the back end of the data set whose 
       x value is less than the specified limit..
       All samples with x <= x_limit will be removed and a subsequent call
       to fit() will reflect their non-inclusion.
       @param x_limit the upper limit on the x value of points to remove.
       @return  the number of data points removed from the set.
   **/
    unsigned remove_samples_before(double x_limit);

    /// The number of sample points currently known to the LinearRegressor.
    unsigned num_samples();

    /// return type of fit(), describes a line.
    /// if you have more useful scalar return data to add to this, feel free.
    struct { double slope, intercept} FitParams;

    /** Calculate the Least-Squares-Fit line through the currently
active datapoints.
        The cost of this operation should be much lower after a previous
        call to fit() where most of the active datapoints were the
        same as they are now.
    **/
    FitParams fit();

    /** Convenience function to calculate y from x and current 
        cached FitParams.
    **/
    double estimated_y(double x) const;

};

Answer

There is no answer at this time.

Comments

Subject: Re: Linear Regression C++ Code (incremental)
From: quantris-ga on 24 Dec 2005 15:53 PST

Hi, I should be able to do this for you (I'm not a Google researcher,
however -- I'm an engineering physics undergrad in my 4th year).

I was just wanting a little clarification:

For removing points, when you say left-most you mean the ones with
least x-value, right?  How should we break ties for points with the
same x-value?

Also, I would presume that the input is not given in order of x-value?
 This is important because in this case sorting has to be done, which
would increase the complexity of adding/removing points.

However, I would like to point out that for the method I'm planning to
use, it does not matter where the additional/removed points are in
relation to the others (before, after, in the middle).  Therefore, it
would be a little simpler to not even worry about the x-coordinates,
and have the the addition/removal of points operate simply on the
order that the points are provided in.

Subject: Re: Linear Regression C++ Code (incremental)
From: wordless-ga on 24 Dec 2005 22:45 PST

How about the following straitforward code?

Header:

#include <vector>
#include <deque>

class LinearRegressor {
 private:
	struct _point {double x, y, xx, xy;};
	std::deque<_point> data;

	unsigned int num_sp;
	double X, Y, XX, XY;

 public:

	LinearRegressor(void) { num_sp = 0; X = Y = XX = XY = 0; data.clear(); };

	struct Point { double x, y;};

	/** Add samples to the set of data points to be used in the regression.
	    If it simplifies coding or makes it significantly more efficient,
	    we can constrain args to x > max(x added so far).  You tell me.
	**/
	void add_sample_points(const std::vector<Point>& sp){
		for(std::vector<Point>::const_iterator pos = sp.begin(); pos != sp.end(); pos ++)
			add_sample_point(*pos);
	};

	/** Add a single sample to the set of data points to be used in 
	    the regression.
	    @sa add_sample_points()
	**/
	void add_sample_point(const Point &sp){
		data.push_back((_point){sp.x, sp.y, sp.x * sp.x, sp.x * sp.y});
		X += sp.x;
		Y += sp.y;
		XX += sp.x * sp.x;
		XY += sp.x * sp.y;
		num_sp ++;
	};

	/** Remove left-most samples from data set similarly to Perl's shift().
	    @param num_to_remove number of samples to be removed.
	    @return the sample points removed.
	**/
	std::vector<Point> shift_samples(unsigned num_to_remove =1);

	/** Remove samples from the back end of the data set whose 
	    x value is less than the specified limit..
	    All samples with x <= x_limit will be removed and a subsequent call
	    to fit() will reflect their non-inclusion.
	    @param x_limit the upper limit on the x value of points to remove.
	    @return  the number of data points removed from the set.
	**/
	unsigned int remove_samples_before(double x_limit);

	/// The number of sample points currently known to the LinearRegressor.
	unsigned int num_samples() {return num_sp;};
	
	/// return type of fit(), describes a line.
	/// if you have more useful scalar return data to add to this, feel free.
	struct FitParams { double slope, intercept;};

	/** Calculate the Least-Squares-Fit line through the currently active datapoints.
	    The cost of this operation should be much lower after a previous
	    call to fit() where most of the active datapoints were the
	    same as they are now.
	**/
	FitParams fit(void) const {
		if(num_sp <= 1)
			throw "Not enough number of samples points";
		double slope = (num_sp * XY - X * Y) / (num_sp * XX - X * X);
		double intercept = (Y - slope * X) / num_sp;
		return (FitParams){slope, intercept};
	};

	/** Convenience function to calculate y from x and current 
	    cached FitParams.
	**/
	double estimated_y(double x) const { FitParams result = fit(); return
result.intercept + result.slope * x; };
	
};


CPP:

#include "linear_regressor.h"

std::vector<LinearRegressor::Point>
LinearRegressor::shift_samples(unsigned num_to_remove)
{
	std::vector<LinearRegressor::Point> points;

	if(num_to_remove > num_sp)
		throw "Invalid parameter";
	_point pt;
	unsigned int left = num_to_remove;
	while(left != 0)
	{
		pt = data.front();
		data.pop_front();
		X -= pt.x;
		Y -= pt.y;
		XX -= pt.xx;
		XY -= pt.xy;
		left --;
		points.push_back((Point){pt.x, pt.y});
	};
	num_sp -= num_to_remove;

	return points;
}

unsigned int
LinearRegressor::remove_samples_before(double x_limit)
{
	/*
	 * We take that the sample points are ordered by the x values. Or we
need sort the data.
	 */
	int num_removed = 0;
	_point pt = data.front();
	while(pt.x < x_limit)
	{
		data.pop_front();
		X -= pt.x;
		Y -= pt.y;
		XX -= pt.xx;
		XY -= pt.xy;
		num_removed ++;
		pt = data.front();
	};
	num_sp -= num_removed;

	return num_removed;
}

Subject: Re: Linear Regression C++ Code (incremental)
From: stevemadere2-ga on 28 Dec 2005 19:03 PST

Yes, left-most means least x value.

Breaking a tie could be done using the y value.  e.g.:
  bool operator < (const Point& p1,const Point& p2) {
     if (p1.x == p2.x) 
       return p1.y < p2.y;   
     else 
        return p1.x < p2.x;
  }

When faced with two data points which are entirely
identical, it does not matter which one is removed.
They are bosons. :)

Treating it like a queue of datapoints ordered only
by their insertion order would be fine with me too.
The caller can just be careful to add/remove points in increasing
x order.

In this case, the interface would be either
push() to add and shift() to remove
or
unshift() to add and  pop() to remove.
(See perl documentation of push, pop, shift, and unshift if the above
does not mean anything to you)


If it is easier to do with your method,
It would be fine with me to have an interface
which allows me to remove arbitrary points from
the data set. I don't even mind the caller
having the responsibility of keeping track of whether
a point has been added before attempting to remove it.

i.e:
    /** Adds a data point to the set of points on which
        the linear regression will be calculated.
    **/
    void add_data_point(const Point& p);

    /** Removes a previously added data point.
        warning:  does not check that the point was actually
        previously added.
    **/
    void removed_data_point(const Point &p);

Subject: Re: Linear Regression C++ Code (incremental)
From: welte-ga on 31 Dec 2005 08:45 PST

Have you looked at Numerical Recipes in C?

    -welte-ga

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy