View Question
Q: Linear Regression C++ Code (incremental) ( No Answer,   4 Comments )
 Question
 Subject: Linear Regression C++ Code (incremental) Category: Computers > Algorithms Asked by: stevemadere2-ga List Price: \$100.00 Posted: 14 Dec 2005 10:52 PST Expires: 13 Jan 2006 10:52 PST Question ID: 605807
 ```Provide C++ code for an incremental linear regression calculator class. It will be used to estimate the trend of the tail end of a timeseries of sample points. It should calculate the LSF (least-squares-fit) line through a set of points. What do I mean by incremental? Let's say I just calculated a LSF on a set of 100 data points: Now I want to calculate a new LSF on a set of 100 points made up of the the latest 98 points (highest X values) from that first set and two more points later than any of those 98. I would want this second LSF operation to be significantly less computationally expensive than simply running another linear regression from scratch on the new set of 100 points (made up of 98 old points and 2 new ones). I would want the ability to add more points to the end (latest time) and remove points from the back (earliest time) at will. If adding points out of order does not make it any more difficult, or CPU intensive, then I'd like that ability as well. The code should be well formatted, and easily understandable. (carefully chosen self-documenting variable and method names which convey the meaning of their purpose well) Below is my stab at the header file: You would provide the .cc file. ;) (feel free to put trivial method bodies in the class definition) (also, please try to follow the implied code format standards in the code below) class LinearRegressor { public: struct Point { double x, y}; /** Add samples to the set of data points to be used in the regression. If it simplifies coding or makes it significantly more efficient, we can constrain args to x > max(x added so far). You tell me. **/ void add_sample_points(const std::vector& sp); /** Add a single sample to the set of data points to be used in the regression. @sa add_sample_points() **/ void add_sample_point(const Point &sp); /** Remove left-most samples from data set similarly to Perl's shift(). @param num_to_remove number of samples to be removed. @return the sample points removed. **/ std::vector shift_samples(unsigned num_to_remove =1); /** Remove samples from the back end of the data set whose x value is less than the specified limit.. All samples with x <= x_limit will be removed and a subsequent call to fit() will reflect their non-inclusion. @param x_limit the upper limit on the x value of points to remove. @return the number of data points removed from the set. **/ unsigned remove_samples_before(double x_limit); /// The number of sample points currently known to the LinearRegressor. unsigned num_samples(); /// return type of fit(), describes a line. /// if you have more useful scalar return data to add to this, feel free. struct { double slope, intercept} FitParams; /** Calculate the Least-Squares-Fit line through the currently active datapoints. The cost of this operation should be much lower after a previous call to fit() where most of the active datapoints were the same as they are now. **/ FitParams fit(); /** Convenience function to calculate y from x and current cached FitParams. **/ double estimated_y(double x) const; };```
 There is no answer at this time.

 ```Hi, I should be able to do this for you (I'm not a Google researcher, however -- I'm an engineering physics undergrad in my 4th year). I was just wanting a little clarification: For removing points, when you say left-most you mean the ones with least x-value, right? How should we break ties for points with the same x-value? Also, I would presume that the input is not given in order of x-value? This is important because in this case sorting has to be done, which would increase the complexity of adding/removing points. However, I would like to point out that for the method I'm planning to use, it does not matter where the additional/removed points are in relation to the others (before, after, in the middle). Therefore, it would be a little simpler to not even worry about the x-coordinates, and have the the addition/removal of points operate simply on the order that the points are provided in.```
 ```How about the following straitforward code? Header: #include #include class LinearRegressor { private: struct _point {double x, y, xx, xy;}; std::deque<_point> data; unsigned int num_sp; double X, Y, XX, XY; public: LinearRegressor(void) { num_sp = 0; X = Y = XX = XY = 0; data.clear(); }; struct Point { double x, y;}; /** Add samples to the set of data points to be used in the regression. If it simplifies coding or makes it significantly more efficient, we can constrain args to x > max(x added so far). You tell me. **/ void add_sample_points(const std::vector& sp){ for(std::vector::const_iterator pos = sp.begin(); pos != sp.end(); pos ++) add_sample_point(*pos); }; /** Add a single sample to the set of data points to be used in the regression. @sa add_sample_points() **/ void add_sample_point(const Point &sp){ data.push_back((_point){sp.x, sp.y, sp.x * sp.x, sp.x * sp.y}); X += sp.x; Y += sp.y; XX += sp.x * sp.x; XY += sp.x * sp.y; num_sp ++; }; /** Remove left-most samples from data set similarly to Perl's shift(). @param num_to_remove number of samples to be removed. @return the sample points removed. **/ std::vector shift_samples(unsigned num_to_remove =1); /** Remove samples from the back end of the data set whose x value is less than the specified limit.. All samples with x <= x_limit will be removed and a subsequent call to fit() will reflect their non-inclusion. @param x_limit the upper limit on the x value of points to remove. @return the number of data points removed from the set. **/ unsigned int remove_samples_before(double x_limit); /// The number of sample points currently known to the LinearRegressor. unsigned int num_samples() {return num_sp;}; /// return type of fit(), describes a line. /// if you have more useful scalar return data to add to this, feel free. struct FitParams { double slope, intercept;}; /** Calculate the Least-Squares-Fit line through the currently active datapoints. The cost of this operation should be much lower after a previous call to fit() where most of the active datapoints were the same as they are now. **/ FitParams fit(void) const { if(num_sp <= 1) throw "Not enough number of samples points"; double slope = (num_sp * XY - X * Y) / (num_sp * XX - X * X); double intercept = (Y - slope * X) / num_sp; return (FitParams){slope, intercept}; }; /** Convenience function to calculate y from x and current cached FitParams. **/ double estimated_y(double x) const { FitParams result = fit(); return result.intercept + result.slope * x; }; }; CPP: #include "linear_regressor.h" std::vector LinearRegressor::shift_samples(unsigned num_to_remove) { std::vector points; if(num_to_remove > num_sp) throw "Invalid parameter"; _point pt; unsigned int left = num_to_remove; while(left != 0) { pt = data.front(); data.pop_front(); X -= pt.x; Y -= pt.y; XX -= pt.xx; XY -= pt.xy; left --; points.push_back((Point){pt.x, pt.y}); }; num_sp -= num_to_remove; return points; } unsigned int LinearRegressor::remove_samples_before(double x_limit) { /* * We take that the sample points are ordered by the x values. Or we need sort the data. */ int num_removed = 0; _point pt = data.front(); while(pt.x < x_limit) { data.pop_front(); X -= pt.x; Y -= pt.y; XX -= pt.xx; XY -= pt.xy; num_removed ++; pt = data.front(); }; num_sp -= num_removed; return num_removed; }```
 ```Yes, left-most means least x value. Breaking a tie could be done using the y value. e.g.: bool operator < (const Point& p1,const Point& p2) { if (p1.x == p2.x) return p1.y < p2.y; else return p1.x < p2.x; } When faced with two data points which are entirely identical, it does not matter which one is removed. They are bosons. :) Treating it like a queue of datapoints ordered only by their insertion order would be fine with me too. The caller can just be careful to add/remove points in increasing x order. In this case, the interface would be either push() to add and shift() to remove or unshift() to add and pop() to remove. (See perl documentation of push, pop, shift, and unshift if the above does not mean anything to you) If it is easier to do with your method, It would be fine with me to have an interface which allows me to remove arbitrary points from the data set. I don't even mind the caller having the responsibility of keeping track of whether a point has been added before attempting to remove it. i.e: /** Adds a data point to the set of points on which the linear regression will be calculated. **/ void add_data_point(const Point& p); /** Removes a previously added data point. warning: does not check that the point was actually previously added. **/ void removed_data_point(const Point &p);```
 ```Have you looked at Numerical Recipes in C? -welte-ga```