Whosher,
OK. Let's see if I can give you your $30's worth on this question. My
goal in answering the question is going to be twofold - to give you
the techniques needed to answer the question (using Microsoft Excel as
the weapon of choice) and to understand the statistical theory behind
it. When I'm not being a google researcher, I do quite a bit of
college level business statistics tutoring, so consider this answer as
an online tutorial. Feel free to come back for clarification if
anything doesn't make sense - a good tutorial is a dynamic one!
I should also say that I will try not to be too technical in my
answers - I know that this might ruffle the feathers of some of the
stats experts who are reading the answer, especially if I fudge the
explanation - my aim is to get the general concepts across in one
shot.
Step 1 - Data Entry
-------------------
I won't spend too long on this, but I want to kind of set the question
up so that we are on the same page when talking about cell references.
I took your text and pasted it into excel.
I then used the Data, Text to Columns menu option to covert the data
into columns. The default (fixed width) settings work fine. You might
need to cut and paste the text to get it in the same cells as mine.
The following link shows how the data looks :
http://www2.bc.edu/~lawrenst/whosher/stats1.gif
In column A, we have the bank names (I shortened some). Column B has
the Deposits (numeric values in B2:B7). Column C has the Number of
Customers (Numeric values in C2:C7).
Question A - Independent and Dependent Variables
------------------------------------------------
a. Which variable is the independent variable? Which is the Dependent
variable?
Before I get to the answer, a brief explanation of the two different
types of variable in a regression and how to identify them.
Any regression has to have a dependent variable and at least one
independent variable. The simplest way of thinking about the variables
is to say that the dependent variable *depends* on the independent
variables. Said another way, the dependent variable can be partially
explained or described by the independent variables.
When answering this question, you want to think to yourself about
cause and effect - does a larger number of deposits lead to a larger
number of customers? or does a larger number of customers lead to a
larger number of deposits? Without getting too philosophical, I'm
going to argue that it is the number of customers that causes or
describes the deposits. You need customers to have deposits.
*** Hence the answer to this question is that Customers is the
INDEPENDENT variable and deposits is the DEPENDENT variable. ***
Microsoft Excel and the theory of statistics do not tell you which
variable is the dependent variable - you get results (statistical
output) no matter which way round you treat variables - it is your
common sense that guides you as to which way round to write a
regression. If you get it round the wrong way, the interpretation of
your regression is potentially meaningless or incorrect.
Question B - Scatter Plots and Interpretation
---------------------------------------------
b. what is a scatter plot of the data? Does it appear that the amount
of deposits is related to the number of customers?
There are two parts to this question - creating a scatter plot and
interpreting it. Without interpretation (as is always the case in
statistics), all you have is a nice pattern of dots.
To plot a scatter plot in Excel, follow these steps :
1) Select Insert, Chart from the Menu or click on the Chart Wizard
button
2) Select XY Scatter in the dialog box
(http://www2.bc.edu/~lawrenst/whosher/stats2.gif)
3) Click on "Next" and select the Series Tab. Click "Add".
4) Enter your independent variable (Customers) as the X variable. ie.
enter C2:C7 in the X values box.
5) Enter your dependent variable (Deposits) as the Y variable. ie.
enter B2:B7 in the Y values box.
(http://www2.bc.edu/~lawrenst/whosher/stats3.gif)
6) Click "Next". You can add a title, axis titles and change the
formatting if you want to, but it isn't really important for learning
statistics, so I'll leave it as it is.
7) Click finish
You should have a sheet that looks as follows:
http://www2.bc.edu/~lawrenst/whosher/stats4.gif
This is our scatter plot. Now we have to interpret it. Our six points
are arranged in a weird T shape, but there is generally an upward
slope to the data.
There are numerous possible scatterplots you could get. I'm going to
classify a few of them and explain what they mean before putting an
interpretation to our bank scatterplot.
Perfect Positive Correlation
http://www2.bc.edu/~lawrenst/whosher/stats5.gif
In this case, the data points fall in a straight upwards sloping line.
An increase in the x variable is reflected by a proportional increase
in the y variable. There is no deviation from this line. Perfectly
correlated variables mean that one variable completely describes the
other. If you know the value of x, you can estimate exactly the value
of y.
Perfect Negative Correlation
http://www2.bc.edu/~lawrenst/whosher/stats6.gif
In this case, the data points fall in a straight downwards sloping
line. An increase in x is reflected by a proportional decrease in the
y variable.
Positive Correlation
http://www2.bc.edu/~lawrenst/whosher/stats7.gif
In this case, the data points fall about an upwards sloping line, but
do not fall exactly on it. An increase in x is generally reflected by
an increase in y, but not exactly. If you know the value of x, you can
make a rough guess of what y should be but there is a margin of error.
Negative Correlation
http://www2.bc.edu/~lawrenst/whosher/stats8.gif
In this case, the data points fall about an downwards sloping line,
but do not fall exactly on it. An increase in x is related to a
decrease in y, but the relationship is not exact.
No Noticeable Correlation
http://www2.bc.edu/~lawrenst/whosher/stats9.gif
In this case, the data points fall randomly and show no particular
trend. An increase in x is not related to any particular change in y.
Given these 5 possibilities, it is easy to see that our question falls
into the situation of positive (not perfect) correlation. Much of the
variation in the y variable (deposits) can be explained by the
variation in the x variable (number of customers). The more customers,
the more deposits (makes sense, right?)
Question C - Regression Equations
---------------------------------
c. Find the equation of the regression line for the data
Two parts to this question - doing it in excel, and understanding what
you are doing.
When we run a regression on data, we are finding the line that fits
best through the data. Technically, by "best" we mean the line which
minimizes the sum of the squared vertical distances between each point
and the line.
This line tries to use as much of the variation in x to explain the
variation in y. If the variables are perfectly correlated, the
regression equation describes the line that joins the points.
To calculate the regression equation in excel we can do it in two
ways. One way is to plot a trendline and get the regresion equation,
but I will leave that for the next question.
Here's how we run the regression analysis:
1) Select Tools, Data Analysis from the menu. If it is not available,
you need to install the analysis toolpak (see Tools, Add-ins on the
menu).
2) Select Regression.
(http://www2.bc.edu/~lawrenst/whosher/stats10.gif)
3) Enter the y values (B2:B7) and the x values (C2:C7)
(http://www2.bc.edu/~lawrenst/whosher/stats11.gif)
4) Click "OK"
You should get the following output:
http://www2.bc.edu/~lawrenst/whosher/stats12.gif
A brief explanation of the important parts of the output (If you want
to know the definitions/explanations of the other statistics, i'd be
happy to answer that in a separate question) :
* First Panel - Overall regression statistics
R-Squared : How much of the variance in Y is described by X? In this
example 40% of the variation in the deposits can be explained by the
number of customers. The other 60% we presume are bank specific issues
(perhaps people who bank with Natwest are richer in general)
Observations : How many observations are there in your data set. In
this case we have data for 6 banks
* Second Panel - Anova Section :
This part tells us how much significance can be drawn from the
regression as a whole - ie. how much sense does it make to talk about
the best fit line in general?
- Significance F : This number tells us the overall significance of
the regression. It can be between 0 and 1 - the closer to zero, the
more significant it is. In this case, it is 0.17 which is not that
significant. But then again we only have 6 observations and it is
tough to say much about that few observations.
* Third Panel - Coefficient statistics
Intercept : These statistics tell us information about the point at
which our best fit line crosses the Y axis (x = 0).
X variable 1 : These statistics tell us information about the slope of
the best fit line. The steeper the slope, the more Y (deposits)
changes when X (customers) changes.
Coefficients : This tells us the actual value of the intercept (value
of Y when X = 0 on the best fit line) and the slope (how much Y
increases by when a X is increased by 1 on the best fit line). The
larger the slope coefficient, the steeper the slope.
Standard Error : How much uncertainty there is about the coefficient.
Unless the points are in a perfect straight line, any best fit line is
only an approximation. The more variation there is from the straight
line, the larger the standard error of the slope coefficient. (More on
this in part E)
P-value : How significantly different is the coefficient from zero?
The smaller the number, the more significant it is. (More on this in
part F) We normally say that a p-value less than 0.05 is significant.
If it is less than 0.01, it is highly significant.
Question D - Plotting the Best Fit Line (Trendline)
---------------------------------------------------
d. Plot the regression line on the same plot as the data.
To do this:
1) click on the chart. Click on one of the data points so that all of
the points are selected
(http://www2.bc.edu/~lawrenst/whosher/stats13.gif)
2) right click and select "Add Trendline" (or go to Chart, Add
Trendline)
3) Choose a linear line and go to the options tab
4) Check the "Display Equation on Chart box"
(http://www2.bc.edu/~lawrenst/whosher/stats14.gif)
5) Click OK.
You should see the following :
http://www2.bc.edu/~lawrenst/whosher/stats15.gif
This trendline reflects everything we have described so far. It is
represented by the equation shown (notice how the coefficients appear
in the equation).
Questions E and F - Using and interpreting the regression output
----------------------------------------------------------------
e. Calculate the standard error of the estimate, [sy/x,]for the
regression line
f. At the 0.05 level, is the model significant.
The standard error of the estimate, sy/x is our uncertainty about the
slope of the regression line. *** From part C, we see that it is
3.003688135 *** (approximately 3). This means that we are about 68%
sure that the slope of the line is 4.90 +/- 3.00. The larger this
number, the less certain we are about whether the slope is what we
estimated it to be.
*** The p-value of this coefficient is 0.177768364. This is not
significant at the 0.05 level *** (the number is bigger than 0.05).
This means that we are not sure whether the slope is significant - we
see that it is positive in our diagram, but this could be an accident.
The p-value tells us how likely it is that this happened 'by
accident'.
That's all that the question asks for, so I'll leave my answer at that
for now - if you want me to go into more detail on any section, feel
free to ask for additional clarification. I will try to answer those
questions where appropriate. If you want to go into more detail or
look at different questions, you can reopen some of your other
questions or ask new ones.
Glad I could help you and hopefully I've given you a clear and
understandable guide to statistics and the use of Microsoft Excel.
Enjoy
calebu2-ga |