Hi k9queen!
Here are the answers to your questions.
a) The idea of this regression procedure is to find the line that best
approximates the actual values. That is, find a line such that for
each value of t (i.e., at each period) the difference between the
value of the line and the actual value is as small as possible. Of
course, there are many data points that need to be fit, so the
regression line is such that the *sum* of these errors is as small as
possible. Furthermore, each "error" is usually defined to be the
square of the difference between the line value and the actual value.
The reason for the square is twofold. First, to make the errors always
positive. For example, if you defined the error to be
(line value - actual value)
then you could have that at data point 1 the error is 10 (line value
is greater than actual value), and at data point 2 the error is -10.
Well, when summing these errors you would get 0, implying that the
line perfectly fits the data. Of course, this is wrong: there are in
fact errors of 10. So the square would make the -10 positive,
eliminating this problem. The second reason for the square is to
exaggerate the importance of large errors. In this way, when
minimizing the square errors in order to find the regression line, we
are trying to avoid large differences between the line and the actual
data.
Check the following links for more information on this subject and a
deeper insight on the following question.
Linear Relationships
http://illuminations.nctm.org/imath/912/LinearRelationships/
http://standards.nctm.org/document/eexamples/chap7/7.4/
Regarding the inclusion of point G in the regression of graph 2, it's
clear that the time parameter (the one that multiplies t) will be
substantially lower if include G than if we don't. Recall that the
regression line tries to minimize the sum of errors between the line
and the actual data. If you draw the line that best fits the points,
ignoring point G, you'll notice that there will be a large difference
between G and the line. Thus the line will have to become flatter in
order to lower the error that comes from the difference of point G
with the line. Mathematically, the line becoming flatter means that
the slope of the line (the parameter that multiplies t) becomes closer
to zero.
You might want to try the applet from the first link above to check
the line that best fits the point without G, and what happens when you
add it. Notice that the inclusion of point G will move the line in a
way that the error for all the other values becomes larger, so that
the fit of the regression line becomes worse. Data points like G are
called "outliers": values that are too "different" from the rest of
the data.
b) The parameter that multiplies t is the slope of the regression
line. Mathematically, the slope shows by how much Y changes when X
changes. In this case, X represents time. Therefore, this parameter
shows how does Y (the dependent variable) change as time passes. If
the parameter is positive, it means that Y is increasing through time;
if it's negative it means that it's decreasing; if it's close to zero,
it means that Y is not changing much through time.
c) The R-squared is a measure of the "goodness of the fit"; that is,
how well the regression line fits the actual data. The formula for it
(which is related to the ratio between the variance of the data and
the variance of the errors) implies that it is actually measuring how
much of the data variance is being explained by the regression line.
Check the following link for more information on R-squared and
trendlines in general.
Introduction to Mathematical Models
http://qrc.depaul.edu/jcasey/Thursday704/Class4Notes.htm
Graph 3
a) In this case, it's likely that the R-squared will be equal or near
to zero. I assume here that the data points are actually something
like this (please request a clarification if I'm wrong in this
assumption):
|
|
|
| * * * * * *
|
|* * * * * *
|
|
---------------------
t
(I'm making this correction because according to the graph in the
question, it would appear that each time t has two different values
for Y -which is impossible). It's clear that the trend line will be
something like this:
|
|
|
| * * * * * *
|---------------
|* * * * * *
|
|
---------------------
t
However, how much of the data variance is being explained by this
line? Nothing! The line actually predicts that the values of Y never
change, therefore it has zero variance. However, we can also see that
the actual values of Y do have positive variance (Y is not constant).
Therefore, the variance that is explained by this best fit trend line
will be zero. This implies that the value of the R-square of this
regression line is also 0.
Google search strategy
R-squared "regression line"
://www.google.com.ar/search?q=R-squared+%22regression+line%22&ie=UTF-8&oe=UTF-8&hl=es&meta=
"regression line" outliers
://www.google.com.ar/search?hl=es&ie=UTF-8&oe=UTF-8&q=%22regression+line%22+outliers&btnG=B%C3%BAsqueda+en+Google&meta=
I hope this helps! If you have any doubt regarding my answer, please
don't hesitate to request a clarification.
Best wishes!
elmarto |