Hello Friends,
This time I thought to bring in little more spice and thought of focusing on movies. I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost. Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.
So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”. I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis? Can I do something statistically here? And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same.
Correlation:
This is an indicator whose value is between -1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features. The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest. However a point of interest may be is there a relation between say
a) IQ Score of a person and Salary drawn
b) No. of obese people in an area vis-à-vis no. of fast-food centers in the locality
c) No. of Facebook friends , with relationship shelf life
d) No. of hours spent in office and attrition rate for and organization
An underlying technicality, I must point out here is both of the variables should follow a normal distribution.
Normal Distribution:
This is the most common probability distribution function, which is a bell shaped curve, with equal spread in both side of the mean. Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal. Most of the random events across disciplines follow normal distribution. The below is an internet image.
So I picked up movie information and like any one of us picked it up from IMDB (http://www.imdb.com/) and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind. The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.
Name
|
Year of Release
|
Rating
|
Duration
|
Small Desc
|
Skyfall
|
2012
|
8.1
|
143
|
Bond's loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost.
|
At this point of time I have taken 183 movies. I have stored it as a csv file.
First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.
Below are the commands for a quick reference. What I just adore about R is it’s simplicity, with just so few commands we are done
film<-read.csv("film.csv",header=T)# Reading the file in a list object
x<-as.matrix(film) # Converting the list to a matrix, for histogram plotting
y<-as.numeric(x[,3]) # Converting the movie rating to a numeric vector
y<-as.numeric(x[,4]) # Converting the movie duration to a numeric vector
hist(y,col="green",border="black",xlab="Duarion",ylab="mvfreq",main="Mv Duration Distribution",breaks=7)
hist(y,col="blue",border="black",xlab="mvRtng",ylab="mvfreq",main="Mv Rtng Distribution",breaks=9)
cor(y,z) # Calculate Correlation Coefficient between rating and duration
Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small. We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.
So someway or other the rating goes up with the duration of the movie.
I leave it to you for interpretation, but next time you might look at the movie duration for taking a call ! Mr. Directors , it might be a tips for you who knows and may be to me wify is always right. May be all that short is not that sweet.
With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments
Interesting observation and inference drawn.
ReplyDeleteIt would help if you can elaborate your sampling technique for selection of 183 movies as there is fair probability of bias getting introduced and a larger set can bring out a different inference.
Keep writing...
Thanks, it was manual sampling. But 183 is fairly good sample size. Also we can not rule out the possibility of diffrent means for diffrenet samples, that is the reason I ended with the null hypothesis rejection.
DeleteFirst, you should increase the width of your plots to fill the width of the page. As now, it is too small to view properly without clicking on each plot.
ReplyDeleteSecond, I assume that the correlation of 0.48 is for ALL movies in the database, as I did not see the code which restricts the data to movies that are shorter than N minutes.
Third, you will need the package XML and its functions readHTMLTable() to scrape data from the www.imdb.com web site. I could probably write a function to scrape the data from this website into a data frame, relatively easily.
Thanks, yes ur assumption is right , and I dont think we should restrict by N Minutes, if we want to bin my movie duration and show a trend may be, that is diffrent correlation for diffrent duration that would make sense,
DeleteLet me play with readHTML little bit, thank you so much.
Have you checked this page at IMDB?
ReplyDeletehttp://www.imdb.com/plugins
Thanks
DeleteThat seemed to me a way to display IMBD ratings and all, in my blog. Does not look like , it will allow to pull bulk data.
Saptarsi da.. some thoughts on what might bias the correlation outcome:
ReplyDelete1. There is a tendency in IMDB to give higher rating (7 and above) to long duration movies ( 2hrs and more) no matter how 'popular' or 'hit' the movie was. A weighted average rating of IMDB and rotten tomatoes could have been better.
2. The samples are hollywood movies. A sample drawn from Hollwood, Bollywood and Bengali movies (the whole population of movies we mostly watch) would give more unbiased result.
3. Average movie duration has changed (shortened) over time. As a result, since IMDB has tendency to give higher rating to long duration movies, generally the 60s and 70s movies (the longer duration movies) get higher rating on an average - which can bias the correlation between duration and rating (as a measurement of goodness of movie)
I hope the thoughts here will give you lots of irritation!!! ;-)
Thanks Saikat, no reason of irritation :). I have seen on first point and there are movies more than 120 hours and got a rating of 5 or so. Bengali and Hindi will surely have a diffrent normal distribution, so I think that should be dealt separately. The opinions that you are telling would be good to set up a hypothesis and test :)
ReplyDelete