View Question
 Question
 Subject: Alternatives to IMDB's formula Category: Science Asked by: tiiba-ga List Price: \$3.14 Posted: 10 Apr 2005 11:12 PDT Expires: 10 May 2005 11:12 PDT Question ID: 507508
 ```IMDB uses this famous formula: weighted rank (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C where: R = average for the movie (mean) = (Rating) v = number of votes for the movie = (votes) m = minimum votes required to be listed in the Top 250 (currently 1250) C = the mean vote across the whole report (currently 6.8) This formula is exceedingly useful, but I have beef with the "m" variable, because it's arbitrary. As far as I can tell, the other three variables should be enough to calculate what score a movie would have if it had a quadrillion votes. So why is this "m" nonsense thrown in, and is there any formula that avoids it?```
 There is no answer at this time.

 Subject: Re: Alternatives to IMDB's formula From: denco-ga on 10 Apr 2005 11:58 PDT
 ```Howdy tiiba-ga, For whatever it is worth, I think the formula needs to be modified in more than one way. In order for the formula to have real meaning to me, the number of movies the viewer has seen as well as the age of the viewer needs to be taken into account. If the voter is 13 and has only seen 6 movies in their life (the three newer "Lord of the Rings" movies and the 3 most recent "Star Wars" movies) their opinion is skewed a bit. I also wouldn't mind that a person would have to take a test (on film/movies in general) before they could rate a movie as well. Not allowing a "counted" rating until the movie has been around for a number of years wouldn't bother me either. Looking Forward, denco-ga - Google Answers Researcher```
 Subject: Re: Alternatives to IMDB's formula From: tiiba-ga on 10 Apr 2005 17:24 PDT
 ```Well, they already have a power search, which lets you filter votes based on age and sex, as well as only get the opinions of the top 1000 voters. And for my purposes, no information about who voted how is available.```
 Subject: Re: Alternatives to IMDB's formula From: volterwd-ga on 10 Apr 2005 21:18 PDT
 ```I had a thorough response and it was removed... im too lazy to retype it... the equation makes complete sense. its a bayesian estimator and makes alot of sense. the value m is not arbitrary```
 Subject: Re: Alternatives to IMDB's formula From: tiiba-ga on 11 Apr 2005 21:03 PDT
 ```And what makes it not arbitrary? How is it determined? Is there a single perfect m, or should it be calculated for each individual situation? I read your earlier comment, and it didn't tell me anything.```
 Subject: Re: Alternatives to IMDB's formula From: anonymous3141-ga on 11 Apr 2005 22:19 PDT
 ```There is an excellent discussion in the following thread: http://groups-beta.google.com/group/rec.puzzles/browse_frm/thread/1bc83090d74d5643/e29f26e647c4303e?q=imdb+formula+minimum+votes+rating&rnum=4#e29f26e647c4303e ------------------------------------- Determining weighted rankings Fixed font - Proportional font Carlos Moreno Nov 29 2002, 8:08 am show options Newsgroups: rec.puzzles From: Carlos Moreno - Find messages by this author Date: Fri, 29 Nov 2002 11:07:29 -0500 Local: Fri,Nov 29 2002 8:07 am Subject: Determining weighted rankings Reply to Author | Forward | Print | Individual Message | Show original | Report Abuse Here's an interesting puzzle -- at least I'm puzzled with this! :-) I see the Top 250 movies in the IMDB (Internet Movie DataBase), and I'm curious about the formula for the weighted ranking, which they call a "true Bayesian estimate". People vote for a movie, giving 1 to 10, and then they rank the movies with this formula, that considers the average of votes and also the number of votes: v m ------- * R + ------- * c v + m v + m Where: v = number of votes m = minimum number of votes to be considered (currently 1250) R = average "vote" given to the movie c = average "vote" for all the movies (currently 6.9) (when I put "currently", I mean that that is what they list in their page, where they describe the formula) I understand part of the "intent" of that formula, but I'm having trouble understanding another part... I see how it conveniently gives better ranking to a movie with higher number of votes for the same average vote (if 10 people say that a movie is 8/10, that's not as good as 1000 people saying it is 8/10). So, the estimate gives certain weight to the movie's ranking, but then it also considers the overall average -- kind of an "in case of doubt, I'll assume you are in the overall average" -- which translates to "in case of insufficient information, I'll assume that probabilistically speaking, you'll approach the overall average". Obviously, the fraction that considers the movie's average vote approaches 1 when the number of votes approaches infinity, and the other fraction (the one that considers the overall average) approaches zero when the number of votes approaches infinity. My doubt is: what is that magical number 1250? Is it arbitrary? If it is, then how can they call this a "true Bayesian estimate"? A true estimate tells me that they are estimating what is the value with maximum likelihood to be the true value of the movie's average vote, given that we don't count on an infinite number of votes (which is what we would require to obtain the true average vote). But this value, this formula is, the way I see it, "tainted" by this mysterious 1250... Let me put my doubt this way: We have to determine things like the following: we have two movies: move 1 has 2000 votes and they average 9/10; movie 2 has 4000 votes and they average 8/10; which movie is better?? (defining "better" as the movie for which an infinite number of voters would give a higher average). Movie 2 may have less average, but it carries more weight, as I have higher "confidence" in that value, since a lot of people have voted, whereas movie 1 has higher average, but it is "doubtful" -- not a lot of people have voted... So you see that if I make m = 10, I obtain one value, but if I make m = 2000, I obtain something very different. So, is that m number really arbitrary? Or does it obey some mathematical formula depending on the probability distribution of the votes? Thanks, Carlos -- Mensanator Nov 29 2002, 1:00 pm show options Newsgroups: rec.puzzles From: mensana...@aol.com (Mensanator) - Find messages by this author Date: 29 Nov 2002 21:00:19 GMT Local: Fri,Nov 29 2002 1:00 pm Subject: Re: Determining weighted rankings Reply to Author | Forward | Print | Individual Message | Show original | Report Abuse - Hide quoted text - - Show quoted text - >Subject: Determining weighted rankings >From: Carlos Moreno moreno_at_mochima_dot_...@xx.x­xx >Date: 11/29/2002 10:07 AM Central Standard Time >Message-id: <3DE790C1.2030...@xx.xxx> >Here's an interesting puzzle -- at least I'm puzzled with this! :-) >I see the Top 250 movies in the IMDB (Internet Movie DataBase), >and I'm curious about the formula for the weighted ranking, which >they call a "true Bayesian estimate". >People vote for a movie, giving 1 to 10, and then they rank the >movies with this formula, that considers the average of votes >and also the number of votes: > v m >------- * R + ------- * c > v + m v + m >Where: >v = number of votes >m = minimum number of votes to be considered (currently 1250) >R = average "vote" given to the movie >c = average "vote" for all the movies (currently 6.9) >(when I put "currently", I mean that that is what they list in >their page, where they describe the formula) >I understand part of the "intent" of that formula, but I'm having >trouble understanding another part... >I see how it conveniently gives better ranking to a movie with >higher number of votes for the same average vote (if 10 people >say that a movie is 8/10, that's not as good as 1000 people >saying it is 8/10). So, the estimate gives certain weight to >the movie's ranking, but then it also considers the overall >average -- kind of an "in case of doubt, I'll assume you are >in the overall average" -- which translates to "in case of >insufficient information, I'll assume that probabilistically >speaking, you'll approach the overall average". >Obviously, the fraction that considers the movie's average vote >approaches 1 when the number of votes approaches infinity, >and the other fraction (the one that considers the overall >average) approaches zero when the number of votes approaches >infinity. >My doubt is: what is that magical number 1250? Is it arbitrary? >If it is, then how can they call this a "true Bayesian estimate"? >A true estimate tells me that they are estimating what is the >value with maximum likelihood to be the true value of the movie's >average vote, given that we don't count on an infinite number >of votes (which is what we would require to obtain the true >average vote). But this value, this formula is, the way I see >it, "tainted" by this mysterious 1250... >Let me put my doubt this way: We have to determine things like >the following: we have two movies: move 1 has 2000 votes and >they average 9/10; movie 2 has 4000 votes and they average 8/10; >which movie is better?? (defining "better" as the movie for >which an infinite number of voters would give a higher average). >Movie 2 may have less average, but it carries more weight, as I >have higher "confidence" in that value, since a lot of people >have voted, whereas movie 1 has higher average, but it is >"doubtful" -- not a lot of people have voted... By what criteria do you claim that not a "lot" of people voted for movie 1? It seems to me that the function of m is to determine what constitutes a "lot". In your example, 2000 > 1250, so why should it be doubtful? Plugging in the numbers gives rankings of v=2000 R=9 m=1250 8.192307692 for movie 1 and v=4000 R=8 m=1250 7.738095238 for movie 2 so yes, movie 1 is indeed "better" than movie 2. One way to look at m is to ask: how many votes of 9 does a movie need to be considered "better" than movie 2? v=831 R=9 m=1250 7.738587218 Thus, with only 831 votes required when we actually have 2000, there is no reason to be doubtful that movie 1 is better. Another question that can be asked is what if m were smaller, say 625 instead of 1250? The ranking formula will result in a value that falls between c and R. Higher values of v bring the result closer to R. The parameter m determines how quickly it approaches R. Lower values of m mean greater signifigance to each vote. So we can ask: if m were 625, how many votes of 9 would a movie need to equal 4000 votes of 8? v=4000 R=8 m=625 7.851351351 v=518 R=9 m=625 7.851706037 Note that with a lower value of m, we only need 518 votes instead of 831 to equal movie 2. Thus, each vote of 9 carries more weight when m is lower. >So you see that if I make m = 10, I obtain one value, but if I >make m = 2000, I obtain something very different. So, is that >m number really arbitrary? Or does it obey some mathematical >formula depending on the probability distribution of the votes? I doubt that m is arbitrary, but I do not how how they arrive at it. - Hide quoted text - - Show quoted text - >Thanks, >Carlos >-- Eb Oesch Dec 2 2002, 12:20 pm show options Newsgroups: rec.puzzles From: ericboe...@hotmail.com (Eb Oesch) - Find messages by this author Date: 2 Dec 2002 12:20:55 -0800 Local: Mon,Dec 2 2002 12:20 pm Subject: Re: Determining weighted rankings Reply to Author | Forward | Print | Individual Message | Show original | Report Abuse - Hide quoted text - - Show quoted text - Carlos Moreno wrote in message ... > Here's an interesting puzzle -- at least I'm puzzled with this! :-) > I see the Top 250 movies in the IMDB (Internet Movie DataBase), > and I'm curious about the formula for the weighted ranking, which > they call a "true Bayesian estimate". > People vote for a movie, giving 1 to 10, and then they rank the > movies with this formula, that considers the average of votes > and also the number of votes: > v m > ------- * R + ------- * c > v + m v + m > Where: > v = number of votes > m = minimum number of votes to be considered (currently 1250) > R = average "vote" given to the movie > c = average "vote" for all the movies (currently 6.9) > (when I put "currently", I mean that that is what they list in > their page, where they describe the formula) > I understand part of the "intent" of that formula, but I'm having > trouble understanding another part... > I see how it conveniently gives better ranking to a movie with > higher number of votes for the same average vote (if 10 people > say that a movie is 8/10, that's not as good as 1000 people > saying it is 8/10). So, the estimate gives certain weight to > the movie's ranking, but then it also considers the overall > average -- kind of an "in case of doubt, I'll assume you are > in the overall average" -- which translates to "in case of > insufficient information, I'll assume that probabilistically > speaking, you'll approach the overall average". > Obviously, the fraction that considers the movie's average vote > approaches 1 when the number of votes approaches infinity, > and the other fraction (the one that considers the overall > average) approaches zero when the number of votes approaches > infinity. > My doubt is: what is that magical number 1250? Is it arbitrary? > If it is, then how can they call this a "true Bayesian estimate"? I agree with you. "A true Bayesian estimate" implies to me that the estimate assumes a plausible model of the population of voters and uses that model to derive the best possible estimate. I can't guess what model that would be. - Hide quoted text - - Show quoted text - > A true estimate tells me that they are estimating what is the > value with maximum likelihood to be the true value of the movie's > average vote, given that we don't count on an infinite number > of votes (which is what we would require to obtain the true > average vote). But this value, this formula is, the way I see > it, "tainted" by this mysterious 1250... > Let me put my doubt this way: We have to determine things like > the following: we have two movies: move 1 has 2000 votes and > they average 9/10; movie 2 has 4000 votes and they average 8/10; > which movie is better?? (defining "better" as the movie for > which an infinite number of voters would give a higher average). > Movie 2 may have less average, but it carries more weight, as I > have higher "confidence" in that value, since a lot of people > have voted, whereas movie 1 has higher average, but it is > "doubtful" -- not a lot of people have voted... The 1250 value is unrelated to the usual statistical confidence levels. With population sizes of 2000 and 4000 and all ratings limited to the range 1/10..10/10, we can be very confident in the significance of the difference between the two means of 9/10 and 8/10 (we're talking on the order of 10 standard deviations). Maybe the rating formula is based on the assumption that the moviegoers are voting with their feet. You could consider the decision not to see a movie as a tentative thumbs-down on that movie based on its reputation, or lack of same. But the formula isn't directly based on this consideration, either, since the formula is nonlinear in the number of voters -- it treats the difference between 2000 and 10000 votes as being much more important than the difference between 10000 and 80000 votes. You could rightly object that a linear model based on the voting-with-their-feet assumption is too simplistic. A person might prefer to see movie 1 but ends up seeing movie 2 instead just because movie 2 happened to be the one showing at the local theater. Any model you choose will be incomplete, since you can't model all the elements that determine how many people will end up rating a given movie on IMDB. So the actual formula used by IMDB seems like a reasonable hack. Such hacks would probably offend an anti-Bayesian, and if the enemy of your enemy is your friend, maybe that makes it a Bayesian hack. Or maybe the writer just thought that "a true Bayesian estimate" sounded more scientificky than "a weighted average".```
 Subject: Re: Alternatives to IMDB's formula From: volterwd-ga on 11 Apr 2005 22:20 PDT
 ```Im not going to do the math since i cant get paid ;) Its not arbitrary. Heres the lowdown You have a true mean for the movie which is unknown because everyone hasnt voted. So bayesian analysis maximizes the posterior distribution for the rating distribution for this movie given the current ratings. m is chosen to provide a blance between the prior data (the universal average 6.8) and the data for the movie. If m is too small then the ratings will have extreme variability, but if its too large then it will give too much weighting to the prior distribution This method is very common... a simpler version is called maximum likelihood which uses a flat prior distribution. In that case our value of m would be 0. This all makes alot of sense from a statistical point of view.```
 Subject: Re: Alternatives to IMDB's formula From: volterwd-ga on 11 Apr 2005 22:26 PDT
 ```That guy above my last comment is talking out his but... disregard it. The fact that you get different values with different m is immaterial... since as you get more normal votes for a particular movie you will eventually get a rating that is solely R (the unweighted ranking for the movie) read my above comment about bayesian analaysis.```