Google Answers: Alternatives to IMDB's formula

View Question

Q: Alternatives to IMDB's formula ( No Answer, 7 Comments )

Question

Subject: Alternatives to IMDB's formula
Category: Science
Asked by: tiiba-ga
List Price: $3.14

Posted: 10 Apr 2005 11:12 PDT
Expires: 10 May 2005 11:12 PDT
Question ID: 507508

IMDB uses this famous formula:

weighted rank (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

 where:
  R = average for the movie (mean) = (Rating)
  v = number of votes for the movie = (votes)
  m = minimum votes required to be listed in the Top 250 (currently 1250)
  C = the mean vote across the whole report (currently 6.8)

This formula is exceedingly useful, but I have beef with the "m"
variable, because it's arbitrary. As far as I can tell, the other
three variables should be enough to calculate what score a movie would
have if it had a quadrillion votes.

So why is this  "m" nonsense thrown in, and is there any formula that avoids it?

Answer

There is no answer at this time.

Comments

Subject: Re: Alternatives to IMDB's formula
From: denco-ga on 10 Apr 2005 11:58 PDT

Howdy tiiba-ga,

For whatever it is worth, I think the formula needs to be modified
in more than one way.  In order for the formula to have real meaning
to me, the number of movies the viewer has seen as well as the age
of the viewer needs to be taken into account.

If the voter is 13 and has only seen 6 movies in their life (the
three newer "Lord of the Rings" movies and the 3 most recent "Star
Wars" movies) their opinion is skewed a bit.

I also wouldn't mind that a person would have to take a test (on 
film/movies in general) before they could rate a movie as well.

Not allowing a "counted" rating until the movie has been around
for a number of years wouldn't bother me either.

Looking Forward, denco-ga - Google Answers Researcher

Subject: Re: Alternatives to IMDB's formula
From: tiiba-ga on 10 Apr 2005 17:24 PDT

Well, they already have a power search, which lets you filter votes
based on age and sex, as well as only get the opinions of the top 1000
voters. And for my purposes, no information about who voted how is
available.

Subject: Re: Alternatives to IMDB's formula
From: volterwd-ga on 10 Apr 2005 21:18 PDT

I had a thorough response and it was removed... im too lazy to retype it...

the equation makes complete sense.  its a bayesian estimator and makes
alot of sense. the value m is not arbitrary

Subject: Re: Alternatives to IMDB's formula
From: tiiba-ga on 11 Apr 2005 21:03 PDT

And what makes it not arbitrary? How is it determined? Is there a
single perfect m, or should it be calculated for each individual
situation? I read your earlier comment, and it didn't tell me
anything.

Subject: Re: Alternatives to IMDB's formula
From: anonymous3141-ga on 11 Apr 2005 22:19 PDT

There is an excellent discussion in the following thread:
http://groups-beta.google.com/group/rec.puzzles/browse_frm/thread/1bc83090d74d5643/e29f26e647c4303e?q=imdb+formula+minimum+votes+rating&rnum=4#e29f26e647c4303e
-------------------------------------
Determining weighted rankings 
 Fixed font - Proportional font  

 
   Carlos Moreno   Nov 29 2002, 8:08 am     show options 

Newsgroups: rec.puzzles 
From: Carlos Moreno <moreno_at_mochima_dot_...@xx.xxx> - Find messages
by this author
Date: Fri, 29 Nov 2002 11:07:29 -0500 
Local: Fri,Nov 29 2002 8:07 am  
Subject: Determining weighted rankings 
Reply to Author | Forward | Print | Individual Message | Show original
| Report Abuse


Here's an interesting puzzle -- at least I'm puzzled with this! :-) 


I see the Top 250 movies in the IMDB (Internet Movie DataBase), 
and I'm curious about the formula for the weighted ranking, which 
they call a "true Bayesian estimate". 


People vote for a movie, giving 1 to 10, and then they rank the 
movies with this formula, that considers the average of votes 
and also the number of votes: 


    v               m 
------- * R  +  ------- * c 
  v + m           v + m 


Where: 


v = number of votes 
m = minimum number of votes to be considered (currently 1250) 
R = average "vote" given to the movie 
c = average "vote" for all the movies (currently 6.9) 


(when I put "currently", I mean that that is what they list in 
their page, where they describe the formula) 


I understand part of the "intent" of that formula, but I'm having 
trouble understanding another part... 


I see how it conveniently gives better ranking to a movie with 
higher number of votes for the same average vote  (if 10 people 
say that a movie is 8/10, that's not as good as 1000 people 
saying it is 8/10).  So, the estimate gives certain weight to 
the movie's ranking, but then it also considers the overall 
average -- kind of an "in case of doubt, I'll assume you are 
in the overall average" -- which translates to  "in case of 
insufficient information, I'll assume that probabilistically 
speaking, you'll approach the overall average". 


Obviously, the fraction that considers the movie's average vote 
approaches 1 when the number of votes approaches infinity, 
and the other fraction (the one that considers the overall 
average) approaches zero when the number of votes approaches 
infinity. 


My doubt is:  what is that magical number 1250?  Is it arbitrary? 
If it is, then how can they call this a "true Bayesian estimate"? 
A true estimate tells me that they are estimating what is the 
value with maximum likelihood to be the true value of the movie's 
average vote, given that we don't count on an infinite number 
of votes (which is what we would require to obtain the true 
average vote).  But this value, this formula is, the way I see 
it, "tainted" by this mysterious 1250... 


Let me put my doubt this way:  We have to determine things like 
the following:  we have two movies:  move 1 has 2000 votes and 
they average 9/10;  movie 2 has 4000 votes and they average 8/10; 
which movie is better??  (defining "better" as the movie for 
which an infinite number of voters would give a higher average). 
Movie 2 may have less average, but it carries more weight, as I 
have higher "confidence" in that value, since a lot of people 
have voted, whereas movie 1 has higher average, but it is 
"doubtful" -- not a lot of people have voted... 


So you see that if I make m = 10, I obtain one value, but if I 
make m = 2000, I obtain something very different.  So, is that 
m number really arbitrary?  Or does it obey some mathematical 
formula depending on the probability distribution of the votes? 


Thanks, 


Carlos 
-- 

 
   Mensanator   Nov 29 2002, 1:00 pm     show options 

Newsgroups: rec.puzzles 
From: mensana...@aol.com (Mensanator) - Find messages by this author  
Date: 29 Nov 2002 21:00:19 GMT 
Local: Fri,Nov 29 2002 1:00 pm  
Subject: Re: Determining weighted rankings 
Reply to Author | Forward | Print | Individual Message | Show original
| Report Abuse




- Hide quoted text -
- Show quoted text -

>Subject: Determining weighted rankings 
>From: Carlos Moreno moreno_at_mochima_dot_...@xx.xxx 
>Date: 11/29/2002 10:07 AM Central Standard Time 
>Message-id: <3DE790C1.2030...@xx.xxx> 

>Here's an interesting puzzle -- at least I'm puzzled with this! :-) 


>I see the Top 250 movies in the IMDB (Internet Movie DataBase), 
>and I'm curious about the formula for the weighted ranking, which 
>they call a "true Bayesian estimate". 


>People vote for a movie, giving 1 to 10, and then they rank the 
>movies with this formula, that considers the average of votes 
>and also the number of votes: 


>    v               m 
>------- * R  +  ------- * c 
>  v + m           v + m 


>Where: 


>v = number of votes 
>m = minimum number of votes to be considered (currently 1250) 
>R = average "vote" given to the movie 
>c = average "vote" for all the movies (currently 6.9) 


>(when I put "currently", I mean that that is what they list in 
>their page, where they describe the formula) 


>I understand part of the "intent" of that formula, but I'm having 
>trouble understanding another part... 


>I see how it conveniently gives better ranking to a movie with 
>higher number of votes for the same average vote  (if 10 people 
>say that a movie is 8/10, that's not as good as 1000 people 
>saying it is 8/10).  So, the estimate gives certain weight to 
>the movie's ranking, but then it also considers the overall 
>average -- kind of an "in case of doubt, I'll assume you are 
>in the overall average" -- which translates to  "in case of 
>insufficient information, I'll assume that probabilistically 
>speaking, you'll approach the overall average". 


>Obviously, the fraction that considers the movie's average vote 
>approaches 1 when the number of votes approaches infinity, 
>and the other fraction (the one that considers the overall 
>average) approaches zero when the number of votes approaches 
>infinity. 


>My doubt is:  what is that magical number 1250?  Is it arbitrary? 
>If it is, then how can they call this a "true Bayesian estimate"? 
>A true estimate tells me that they are estimating what is the 
>value with maximum likelihood to be the true value of the movie's 
>average vote, given that we don't count on an infinite number 
>of votes (which is what we would require to obtain the true 
>average vote).  But this value, this formula is, the way I see 
>it, "tainted" by this mysterious 1250... 


>Let me put my doubt this way:  We have to determine things like 
>the following:  we have two movies:  move 1 has 2000 votes and 
>they average 9/10;  movie 2 has 4000 votes and they average 8/10; 
>which movie is better??  (defining "better" as the movie for 
>which an infinite number of voters would give a higher average). 
>Movie 2 may have less average, but it carries more weight, as I 
>have higher "confidence" in that value, since a lot of people 
>have voted, whereas movie 1 has higher average, but it is 
>"doubtful" -- not a lot of people have voted... 



By what criteria do you claim that not a "lot" of people voted for movie 1? It 
seems to me that the function of  m is to determine what constitutes a "lot". 
In your example, 2000  > 1250, so why should it be doubtful? 

Plugging in the numbers gives rankings of 


v=2000 R=9 m=1250 
8.192307692 for movie 1 


and 


v=4000 R=8 m=1250 
7.738095238 for movie 2 


so yes, movie 1 is indeed "better" than movie 2. 


One way to look at m is to ask: how many votes of 9 does a movie need to be 
considered "better" than movie 2? 


v=831 R=9 m=1250 
7.738587218 


Thus, with only 831 votes required when we actually have 2000, there is no 
reason to be doubtful that movie 1 is better. 


Another question that can be asked is what if m were smaller, say 625 instead 
of 1250? The ranking formula will result in a value that falls between c and R. 
Higher values of v bring the result closer to R. The parameter m determines how 
quickly it approaches R. Lower values of m mean greater signifigance to each 
vote. 


So we can ask: if m were 625, how many votes of 9 would a movie need to equal 
4000 votes of 8? 


v=4000 R=8 m=625 
7.851351351 


v=518 R=9 m=625 
7.851706037 


Note that with a lower value of m, we only need 518 votes instead of 831 to 
equal movie 2. Thus, each vote of 9 carries more weight when m is lower. 



>So you see that if I make m = 10, I obtain one value, but if I 
>make m = 2000, I obtain something very different.  So, is that 
>m number really arbitrary?  Or does it obey some mathematical 
>formula depending on the probability distribution of the votes? 



I doubt that m is arbitrary, but I do not how how they arrive at it. 


- Hide quoted text -
- Show quoted text -

>Thanks, 


>Carlos 
>-- 


 
   Eb Oesch   Dec 2 2002, 12:20 pm     show options 

Newsgroups: rec.puzzles 
From: ericboe...@hotmail.com (Eb Oesch) - Find messages by this author  
Date: 2 Dec 2002 12:20:55 -0800 
Local: Mon,Dec 2 2002 12:20 pm  
Subject: Re: Determining weighted rankings 
Reply to Author | Forward | Print | Individual Message | Show original
| Report Abuse




- Hide quoted text -
- Show quoted text -

Carlos Moreno <moreno_at_mochima_dot_...@xx.xxx> wrote in message
<news:3DE790C1.2030700@xx.xxx>...
> Here's an interesting puzzle -- at least I'm puzzled with this! :-) 

> I see the Top 250 movies in the IMDB (Internet Movie DataBase), 
> and I'm curious about the formula for the weighted ranking, which 
> they call a "true Bayesian estimate". 


> People vote for a movie, giving 1 to 10, and then they rank the 
> movies with this formula, that considers the average of votes 
> and also the number of votes: 


>     v               m 
> ------- * R  +  ------- * c 
>   v + m           v + m 


> Where: 


> v = number of votes 
> m = minimum number of votes to be considered (currently 1250) 
> R = average "vote" given to the movie 
> c = average "vote" for all the movies (currently 6.9) 
> (when I put "currently", I mean that that is what they list in 
> their page, where they describe the formula) 


> I understand part of the "intent" of that formula, but I'm having 
> trouble understanding another part... 


> I see how it conveniently gives better ranking to a movie with 
> higher number of votes for the same average vote  (if 10 people 
> say that a movie is 8/10, that's not as good as 1000 people 
> saying it is 8/10).  So, the estimate gives certain weight to 
> the movie's ranking, but then it also considers the overall 
> average -- kind of an "in case of doubt, I'll assume you are 
> in the overall average" -- which translates to  "in case of 
> insufficient information, I'll assume that probabilistically 
> speaking, you'll approach the overall average". 


> Obviously, the fraction that considers the movie's average vote 
> approaches 1 when the number of votes approaches infinity, 
> and the other fraction (the one that considers the overall 
> average) approaches zero when the number of votes approaches 
> infinity. 


> My doubt is:  what is that magical number 1250?  Is it arbitrary? 
> If it is, then how can they call this a "true Bayesian estimate"? 



I agree with you.  "A true Bayesian estimate" implies to me that the 
estimate assumes a plausible model of the population of voters and 
uses that model to derive the best possible estimate.  I can't guess 
what model that would be. 



- Hide quoted text -
- Show quoted text -

> A true estimate tells me that they are estimating what is the 
> value with maximum likelihood to be the true value of the movie's 
> average vote, given that we don't count on an infinite number 
> of votes (which is what we would require to obtain the true 
> average vote).  But this value, this formula is, the way I see 
> it, "tainted" by this mysterious 1250... 

> Let me put my doubt this way:  We have to determine things like 
> the following:  we have two movies:  move 1 has 2000 votes and 
> they average 9/10;  movie 2 has 4000 votes and they average 8/10; 
> which movie is better??  (defining "better" as the movie for 
> which an infinite number of voters would give a higher average). 
> Movie 2 may have less average, but it carries more weight, as I 
> have higher "confidence" in that value, since a lot of people 
> have voted, whereas movie 1 has higher average, but it is 
> "doubtful" -- not a lot of people have voted... 



The 1250 value is unrelated to the usual statistical confidence 
levels.  With population sizes of 2000 and 4000 and all ratings 
limited to the range 1/10..10/10, we can be very confident in the 
significance of the difference between the two means of 9/10 and 8/10 
(we're talking on the order of 10 standard deviations). 

Maybe the rating formula is based on the assumption that the 
moviegoers are voting with their feet.  You could consider the 
decision not to see a movie as a tentative thumbs-down on that movie 
based on its reputation, or lack of same.  But the formula isn't 
directly based on this consideration, either, since the formula is 
nonlinear in the number of voters -- it treats the difference between 
2000 and 10000 votes as being much more important than the difference 
between 10000 and 80000 votes. 


You could rightly object that a linear model based on the 
voting-with-their-feet assumption is too simplistic.  A person might 
prefer to see movie 1 but ends up seeing movie 2 instead just because 
movie 2 happened to be the one showing at the local theater.  Any 
model you choose will be incomplete, since you can't model all the 
elements that determine how many people will end up rating a given 
movie on IMDB. 


So the actual formula used by IMDB seems like a reasonable hack.  Such 
hacks would probably offend an anti-Bayesian, and if the enemy of your 
enemy is your friend, maybe that makes it a Bayesian hack.  Or maybe 
the writer just thought that "a true Bayesian estimate" sounded more 
scientificky than "a weighted average".

Subject: Re: Alternatives to IMDB's formula
From: volterwd-ga on 11 Apr 2005 22:20 PDT

Im not going to do the math since i cant get paid ;)

Its not arbitrary.

Heres the lowdown

You have a true mean for the movie which is unknown because everyone hasnt voted.

So bayesian analysis maximizes the posterior distribution for the
rating distribution for this movie given the current ratings.

m is chosen to provide a blance between the prior data (the universal
average 6.8) and the data for the movie.  If m is too small then the
ratings will have extreme variability, but if its too large then it
will give too much weighting to the prior distribution

This method is very common... 

a simpler version is called maximum likelihood which uses a flat prior
distribution.  In that case our value of m would be 0.

This all makes alot of sense from a statistical point of view.

Subject: Re: Alternatives to IMDB's formula
From: volterwd-ga on 11 Apr 2005 22:26 PDT

That guy above my last comment is talking out his but... disregard it.

The fact that you get different values with different m is
immaterial... since as you get more normal votes for a particular
movie you will eventually get a rating that is solely R (the
unweighted ranking for the movie)  read my above comment about
bayesian analaysis.

Important Disclaimer: Answers and comments provided on Google Answers are general information, and are not intended to substitute for informed professional medical, psychiatric, psychological, tax, legal, investment, accounting, or other professional advice. Google does not endorse, and expressly disclaims liability for any product, manufacturer, distributor, service or service provider mentioned or any opinion expressed in answers or comments. Please read carefully the Google Answers Terms of Service.

If you feel that you have found inappropriate content, please let us know by emailing us at answers-support@google.com with the question ID listed above. Thank you.

Search Google Answers for

Google Home - Answers FAQ - Terms of Service - Privacy Policy