Google Answers Logo
View Question
 
Q: regression analysis/statistics ( Answered,   4 Comments )
Question  
Subject: regression analysis/statistics
Category: Business and Money > Finance
Asked by: whosher-ga
List Price: $30.00
Posted: 30 Jul 2002 07:08 PDT
Expires: 29 Aug 2002 07:08 PDT
Question ID: 46840
The British Banker's Association wanted to look a the relationship
between the amount of deposits made (in Billions of [f]) and the
number of customers that a bank had.  Analysts collected dataon six
different large banks and found the following information:

Bank Name          Deposits (F billion)          Customers (million)

Abbey National             101.7                       13.6
Barclays                   108.2                       10.0
Lloyds                      96.9                       15.0
National Westminster       113.8                       7.5
Woolrich                    27.5                       4.0
Halifax                     77.1                        7.6  

a. Which variable is the independent variable? Which is the Dependent
variable?
b. what is a scatter plot of the data? Does it appear that the amount
of deposits is related to the number of customers?
c. Find the equation of the regression line for the data 
d. Plot the regression line on the same plot as the data. 
e. Calculate the standard error of the estimate, [sy/x,]for the
regression line
f. At the 0.05 level, is the model significant.

Request for Question Clarification by netcrazy-ga on 30 Jul 2002 07:13 PDT
Hello,

Is this a homework question??

Thanks
netcrazy

Clarification of Question by whosher-ga on 30 Jul 2002 07:32 PDT
no this is not home for the 4th time i am not in school is learning to
do statistic  by  a book that was brought out of a bookstore and is
not sure how they came up with the answer so i figure you can show me
the steps you are taking to arrive at the answer you are getting ...I
want to compare the tools being use

Request for Question Clarification by calebu2-ga on 30 Jul 2002 12:13 PDT
See my comments/clarification on question 46845.

If you cut and paste the data you have into excel, follow the steps to
do regression analysis, you can get back the results.

Admittedly there are a lot of different statistics produced, but if
they are unclear, you would probably be better suited asking a generic
question about interpreting the output of an excel regression analysis
rather than posting it in the current form.

I'd be happy to give you a non-specific guide to using excel to do
regression analysis - without rewording your questions (so that they
are not straight from a text) I can't really do much more to help you.

A good question for you to ask the researchers would be "Could you
come up with a question similar to the one listed, that isn't taken
from a textbook (so that we don't infringe copyright) and a detailed
solution so that I can get a feel for how questions of this type are
done". That would require extra work on the part of the researcher,
but it would be far more readily answered.

Clarification of Question by whosher-ga on 31 Jul 2002 07:38 PDT
Hello

In relation to comments how will I know which formula to use? I guess
if I can separate the formula on when and how to use the problem will
make more since, but the book does not go into details  in reference
to regression analysis.  So what ever info you can provide will be
greatly appreciated thanks.

Request for Question Clarification by calebu2-ga on 31 Jul 2002 18:02 PDT
I;ve created the following question, which I have no problems
answering if you think it will help you understand the banker's
question :

Consider the following hypothetical standings of a completely
imaginary baseball league - it is believed that there might be a
relationship with runs scored and games won :

Team	Wins	Runs
Georgia Bravehearts	68	464
Steinbrenner's Moneymachine	67	628
Tacoma Seamen	65	533
Phoenix Snakes	64	536
Twin City… Twins	64	559
Anaheim Angles	63	548
Buckner's Recurring Nightmare	62	571
Big Arch Birdies	59	489
Oakland Sportspeople	61	510
Bay Area Big guys	59	509
Holywood Avoiders	59	449
Cincinatti Rouges	55	471
Enron Astros	54	501
Piazza's Pals	54	459
Canadian Contractions	53	466
Florida Fishies	51	463
Windy city plain socks	51	565
Phillie Philadelphias	49	473
Maryland Birdies	48	452
Pittsburgh Napsters	48	408
Boulder Boulders	48	496
Ohio Natives	47	460
Canadian Birdies	45	508
SoCal Padres	45	420
Windy City Young Bears	44	448
W's Winners	44	518
Royal Blues	44	471
Missing Pistons	40	408
Wisconsin Beermen	38	442
Tampa Bay nohopers	35	430

Answer the following questions:
a. Which story makes most sense - "More runs give you more wins" or
"More wins give you more runs"? Hence, what are the dependent and
independent variables?
b. Use the graph function on excel to plot these data points. Just by
looking at them do you see a relationship?
c. Add a trendline to the graph
d. By using the Tools-Data Analysis run a regression on the data (make
sure to get the dependent and independent variables the right way
round). Explain the numbers you get as a result.
e. What is the p-value for the x-coefficient. Is there a statistically
significant relationship between runs and wins?

question copyright calebu2-ga and google answers (did not come from a
textbook)

As I said, if you are ok with it, I am more than willing to work
through that question as a detailed example, explaining each step
(with clarification if needed) to provide an understanding of the
steps that would be needed to answer the Bankers question. This way I
am able to help you with your study of business statistics without
either infringing the copyright of the question authors or adding to
the ethical dilemmas which have prevented your question being answered
to this point.

Request for Question Clarification by tehuti-ga on 03 Aug 2002 18:42 PDT
Re rbnn-ga's comment, I would say that the deposits are the dependent
variable, because if there are zero customers, there will definitely
be zero deposits. However, this then raises the question to what
extent does the amount of deposits depend on the number of customers. 
This is what the regression analysis will tell you.

A scatter plot involves plotting the number of customers on the
horizontal axis, and the respective deposit amount on the vertical
axis.  If the points follow more or less a straight line, then you can
say that the amount of deposits is directly related to the number of
customers in a linear manner.  You can already see just by looking at
the figures that this will not be the case, eg Halifax and Nat West
have almost the same number of customers, but Nat West has 50% more
deposits; Nat West has the highest level of deposits, but Lloyds has
the highest number of customers, etc, etc.

The regression coefficient is a way of attempting to put a straight
line over the more scattered data points that you have in your plot. 
The line should reflect an underlying relationship.

I used an online regression equation calculator at
http://people.hofstra.edu/faculty/Stefan_Waner/RealWorld/newgraph/regressionframes.html
 First, I entered customer numbers in the x column and deposits in the
y column.  I selected the equation for linear regression, y = mx + b,
because I want to know if there is a simple linear relationship
between number of customers and amount of deposits.  This gave a
regression coefficient of r = 0.632499.  If there was a direct
positive relationship between the two variables, the coefficient would
be 1, while if there was a direct negative relationship it would be
-1.  Zero correlation would give a coefficient of 0.

You can press "graph" to obtain an illustration of the calculated
regression line versus the data.  It is clear that the regression line
only has a couple of data points within a reasonable distance from it.

You can also see that figures now appear in the third column of data. 
These are the values of the deposits as predicted by the regression
equation.  You can see straight off that in only two cases do they
bear any relationship to the real figures.  This web site does not
have the facility to calculate the standard error and the
significance.  However, given the poor fit between the regression line
and the data, I would hazard a guess that the model is not
statistically significant.  The significance is tested either using a
t test or ANOVA.  Basically, what it does is to see how useful the
regression coefficient would be to predict the amount of deposits for
any specific number of customers.

In real life, the hypothesis you are testing is assuming that if a
bank has more customers, it will have more money placed on deposit
than a bank with fewer customers.  Of course, that need not be the
case, because such an assumption leaves out demographics.  If bank A
attracts many millionaires and other rich people, while bank B
attracts students and other poorer people, it is reasonable to assume
that even with fewer clients, bank A will have more money on deposit
than bank B.  Something of this sort is operating here.  Lloyds
customers are putting away less than half that put away by Nat West
customers.

This is a fairly simplistic explanation, and it does not deal with all
the points in your question, so I am posting it as a clarification
rather than as an answer.  However, sometimes it helps to think in a
simplistic way about a situation before applying full statistical
analysis.
Answer  
Subject: Re: regression analysis/statistics
Answered By: calebu2-ga on 04 Aug 2002 20:22 PDT
 
Whosher,

OK. Let's see if I can give you your $30's worth on this question. My
goal in answering the question is going to be twofold - to give you
the techniques needed to answer the question (using Microsoft Excel as
the weapon of choice) and to understand the statistical theory behind
it. When I'm not being a google researcher, I do quite a bit of
college level business statistics tutoring, so consider this answer as
an online tutorial. Feel free to come back for clarification if
anything doesn't make sense - a good tutorial is a dynamic one!

I should also say that I will try not to be too technical in my
answers - I know that this might ruffle the feathers of some of the
stats experts who are reading the answer, especially if I fudge the
explanation - my aim is to get the general concepts across in one
shot.

Step 1 - Data Entry
-------------------
I won't spend too long on this, but I want to kind of set the question
up so that we are on the same page when talking about cell references.
I took your text and pasted it into excel.

I then used the Data, Text to Columns menu option to covert the data
into columns. The default (fixed width) settings work fine. You might
need to cut and paste the text to get it in the same cells as mine.
The following link shows how the data looks :

http://www2.bc.edu/~lawrenst/whosher/stats1.gif

In column A, we have the bank names (I shortened some). Column B has
the Deposits (numeric values in B2:B7). Column C has the Number of
Customers (Numeric values in C2:C7).

Question A - Independent and Dependent Variables
------------------------------------------------
a. Which variable is the independent variable? Which is the Dependent
variable?

Before I get to the answer, a brief explanation of the two different
types of variable in a regression and how to identify them.

Any regression has to have a dependent variable and at least one
independent variable. The simplest way of thinking about the variables
is to say that the dependent variable *depends* on the independent
variables. Said another way, the dependent variable can be partially
explained or described by the independent variables.

When answering this question, you want to think to yourself about
cause and effect - does a larger number of deposits lead to a larger
number of customers? or does a larger number of customers lead to a
larger number of deposits? Without getting too philosophical, I'm
going to argue that it is the number of customers that causes or
describes the deposits. You need customers to have deposits.

*** Hence the answer to this question is that Customers is the
INDEPENDENT variable and deposits is the DEPENDENT variable. ***

Microsoft Excel and the theory of statistics do not tell you which
variable is the dependent variable - you get results (statistical
output) no matter which way round you treat variables - it is your
common sense that guides you as to which way round to write a
regression. If you get it round the wrong way, the interpretation of
your regression is potentially meaningless or incorrect.

Question B - Scatter Plots and Interpretation
---------------------------------------------
b. what is a scatter plot of the data? Does it appear that the amount
of deposits is related to the number of customers?

There are two parts to this question - creating a scatter plot and
interpreting it. Without interpretation (as is always the case in
statistics), all you have is a nice pattern of dots.

To plot a scatter plot in Excel, follow these steps :
1) Select Insert, Chart from the Menu or click on the Chart Wizard
button
2) Select XY Scatter in the dialog box
(http://www2.bc.edu/~lawrenst/whosher/stats2.gif)
3) Click on "Next" and select the Series Tab. Click "Add".
4) Enter your independent variable (Customers) as the X variable. ie.
enter C2:C7 in the X values box.
5) Enter your dependent variable (Deposits) as the Y variable. ie.
enter B2:B7 in the Y values box.
(http://www2.bc.edu/~lawrenst/whosher/stats3.gif)
6) Click "Next". You can add a title, axis titles and change the
formatting if you want to, but it isn't really important for learning
statistics, so I'll leave it as it is.
7) Click finish

You should have a sheet that looks as follows:
http://www2.bc.edu/~lawrenst/whosher/stats4.gif

This is our scatter plot. Now we have to interpret it. Our six points
are arranged in a weird T shape, but there is generally an upward
slope to the data.

There are numerous possible scatterplots you could get. I'm going to
classify a few of them and explain what they mean before putting an
interpretation to our bank scatterplot.

Perfect Positive Correlation
http://www2.bc.edu/~lawrenst/whosher/stats5.gif
In this case, the data points fall in a straight upwards sloping line.
An increase in the x variable is reflected by a proportional increase
in the y variable. There is no deviation from this line. Perfectly
correlated variables mean that one variable completely describes the
other. If you know the value of x, you can estimate exactly the value
of y.

Perfect Negative Correlation
http://www2.bc.edu/~lawrenst/whosher/stats6.gif
In this case, the data points fall in a straight downwards sloping
line. An increase in x is reflected by a proportional decrease in the
y variable.

Positive Correlation
http://www2.bc.edu/~lawrenst/whosher/stats7.gif
In this case, the data points fall about an upwards sloping line, but
do not fall exactly on it. An increase in x is generally reflected by
an increase in y, but not exactly. If you know the value of x, you can
make a rough guess of what y should be but there is a margin of error.

Negative Correlation
http://www2.bc.edu/~lawrenst/whosher/stats8.gif
In this case, the data points fall about an downwards sloping line,
but do not fall exactly on it. An increase in x is related to a
decrease in y, but the relationship is not exact.

No Noticeable Correlation
http://www2.bc.edu/~lawrenst/whosher/stats9.gif
In this case, the data points fall randomly and show no particular
trend. An increase in x is not related to any particular change in y.

Given these 5 possibilities, it is easy to see that our question falls
into the situation of positive (not perfect) correlation. Much of the
variation in the y variable (deposits) can be explained by the
variation in the x variable (number of customers). The more customers,
the more deposits (makes sense, right?)

Question C - Regression Equations
---------------------------------
c. Find the equation of the regression line for the data

Two parts to this question - doing it in excel, and understanding what
you are doing.

When we run a regression on data, we are finding the line that fits
best through the data. Technically, by "best" we mean the line which
minimizes the sum of the squared vertical distances between each point
and the line.

This line tries to use as much of the variation in x to explain the
variation in y. If the variables are perfectly correlated, the
regression equation describes the line that joins the points.

To calculate the regression equation in excel we can do it in two
ways. One way is to plot a trendline and get the regresion equation,
but I will leave that for the next question.

Here's how we run the regression analysis:
1) Select Tools, Data Analysis from the menu. If it is not available,
you need to install the analysis toolpak (see Tools, Add-ins on the
menu).
2) Select Regression.
(http://www2.bc.edu/~lawrenst/whosher/stats10.gif)
3) Enter the y values (B2:B7) and the x values (C2:C7)
(http://www2.bc.edu/~lawrenst/whosher/stats11.gif)
4) Click "OK"

You should get the following output:
http://www2.bc.edu/~lawrenst/whosher/stats12.gif

A brief explanation of the important parts of the output (If you want
to know the definitions/explanations of the other statistics, i'd be
happy to answer that in a separate question) :

* First Panel - Overall regression statistics
R-Squared : How much of the variance in Y is described by X? In this
example 40% of the variation in the deposits can be explained by the
number of customers. The other 60% we presume are bank specific issues
(perhaps people who bank with Natwest are richer in general)

Observations : How many observations are there in your data set. In
this case we have data for 6 banks

* Second Panel - Anova Section :
This part tells us how much significance can be drawn from the
regression as a whole - ie. how much sense does it make to talk about
the best fit line in general?
- Significance F : This number tells us the overall significance of
the regression. It can be between 0 and 1 - the closer to zero, the
more significant it is. In this case, it is 0.17 which is not that
significant. But then again we only have 6 observations and it is
tough to say much about that few observations.

* Third Panel - Coefficient statistics
Intercept : These statistics tell us information about the point at
which our best fit line crosses the Y axis (x = 0).

X variable 1 : These statistics tell us information about the slope of
the best fit line. The steeper the slope, the more Y (deposits)
changes when X (customers) changes.

Coefficients : This tells us the actual value of the intercept (value
of Y when X = 0 on the best fit line) and the slope (how much Y
increases by when a X is increased by 1 on the best fit line). The
larger the slope coefficient, the steeper the slope.

Standard Error : How much uncertainty there is about the coefficient.
Unless the points are in a perfect straight line, any best fit line is
only an approximation. The more variation there is from the straight
line, the larger the standard error of the slope coefficient. (More on
this in part E)

P-value : How significantly different is the coefficient from zero?
The smaller the number, the more significant it is. (More on this in
part F) We normally say that a p-value less than 0.05 is significant.
If it is less than 0.01, it is highly significant.

Question D - Plotting the Best Fit Line (Trendline)
---------------------------------------------------
d. Plot the regression line on the same plot as the data.

To do this:
1) click on the chart. Click on one of the data points so that all of
the points are selected
(http://www2.bc.edu/~lawrenst/whosher/stats13.gif)
2) right click and select "Add Trendline" (or go to Chart, Add
Trendline)
3) Choose a linear line and go to the options tab
4) Check the "Display Equation on Chart box"
(http://www2.bc.edu/~lawrenst/whosher/stats14.gif)
5) Click OK.

You should see the following :
http://www2.bc.edu/~lawrenst/whosher/stats15.gif

This trendline reflects everything we have described so far. It is
represented by the equation shown (notice how the coefficients appear
in the equation).

Questions E and F - Using and interpreting the regression output
----------------------------------------------------------------
e. Calculate the standard error of the estimate, [sy/x,]for the
regression line
f. At the 0.05 level, is the model significant.

The standard error of the estimate, sy/x is our uncertainty about the
slope of the regression line. *** From part C, we see that it is
3.003688135 *** (approximately 3). This means that we are about 68%
sure that the slope of the line is 4.90 +/- 3.00. The larger this
number, the less certain we are about whether the slope is what we
estimated it to be.

*** The p-value of this coefficient is 0.177768364. This is not
significant at the 0.05 level *** (the number is bigger than 0.05).
This means that we are not sure whether the slope is significant - we
see that it is positive in our diagram, but this could be an accident.
The p-value tells us how likely it is that this happened 'by
accident'.

That's all that the question asks for, so I'll leave my answer at that
for now - if you want me to go into more detail on any section, feel
free to ask for additional clarification. I will try to answer those
questions where appropriate. If you want to go into more detail or
look at different questions, you can reopen some of your other
questions or ask new ones.

Glad I could help you and hopefully I've given you a clear and
understandable guide to statistics and the use of Microsoft Excel.

Enjoy

calebu2-ga
Comments  
Subject: Re: regression analysis/statistics
From: rbnn-ga on 30 Jul 2002 12:16 PDT
 
I'm not sure how to tell which is the independent and which the
dependent variable actually. It seems to me technically one could want
to predict number of customers from deposits, (in which case
"#customers" would be the dependent variable") or one could want to
predict deposits from #customers, in which case the dependent variable
would be deposits. (Recall, the "dependent variable" is the quantity
we are trying to predict, and the "independent variable" is the
quantity that we know already.

Mind if I make another comment? 

Generally in statistics there are two ways to do things, from "first
principles" or using an existing software toolkit. In the olden days,
to "know regression" really meant being able, with pencil and paper 
to solve small regressions, or at least know the algorithms.

Nowadays, most of the time people just use some existing software
package toolkit to solve regressions. For instance, Excel, or Matlab,
or S, or SPSS and so on (there are lots of them out there).

Anyway, if you want, you might want to specify a bit more just what
kind of explanation you are looking for. Do you want to understand the
actual math? Or how to use a toolkit (which one?) Or exactly what the
meaning of the output of such a toolkit is?
Subject: Re: regression analysis/statistics
From: rbnn-ga on 30 Jul 2002 13:59 PDT
 
Oh, a more fundamental problem here: I'm not sure how to send a plot! Oh well.
Subject: Re: regression analysis/statistics
From: jeremymiles-ga on 03 Aug 2002 17:56 PDT
 
Hi Whosher;
This isn't a hard question, if you know what you are doing.  However,
I suspect that you want to know more than the answer - you want some
understanding of where the answer came from.  I don't want to seem
over-blunt, but your anser suggests to me that you are trying to learn
too much about statistics too fast.  The answer might tell you the SE
of the regression line, and the probability value associated with it,
but interpreting these is another matter.   I would humbly suggest
that you went for a simpler book or web page - I would be happy to
make some recommendations, although my area of interest is not in the
area of business statistics, and others could perhaps provide better
recommendations than me.
 
If you have Excel, and really just want the answers, then have a look
at:

http://www.jeremymiles.co.uk/regressionbook/extras/appendix2/excel/index.html
Subject: Re: regression analysis/statistics
From: jeremymiles-ga on 03 Aug 2002 18:04 PDT
 
One more thing.   I have just looked at the data more closely - there
is quite an interesting relationship going on - this might be
developed later in the book. Without giving anything away, you might
have to have an understanding of the banking system in the UK to
really understand these data.

I realise that this diversion isn't necessarily what you are after,
but it is an interesting question, that can be answered on a number of
levels.  I think that the researchers are interested in answering your
question in a way that satisfies the ethical problems, but also
provides you with a satisfactory answer.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.
Search Google Answers for
Google Answers  


Google Home - Answers FAQ - Terms of Service - Privacy Policy