meet Saptarsi: Opting for shorter movies, be aware u might be cutting the entertainment too!

Friday, 23 November 2012

Opting for shorter movies, be aware u might be cutting the entertainment too!

Hello Friends,

This time I thought to bring in little more spice and thought of focusing on movies. I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost. Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.

So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”. I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis? Can I do something statistically here? And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same.

Correlation:

This is an indicator whose value is between -1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features. The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest. However a point of interest may be is there a relation between say

a) IQ Score of a person and Salary drawn

b) No. of obese people in an area vis-à-vis no. of fast-food centers in the locality

c) No. of Facebook friends , with relationship shelf life

d) No. of hours spent in office and attrition rate for and organization

An underlying technicality, I must point out here is both of the variables should follow a normal distribution.

Normal Distribution:

This is the most common probability distribution function, which is a bell shaped curve, with equal spread in both side of the mean. Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal. Most of the random events across disciplines follow normal distribution. The below is an internet image.

So I picked up movie information and like any one of us picked it up from IMDB (http://www.imdb.com/) and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind. The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.

Name	Year of Release	Rating	Duration	Small Desc
Skyfall	2012	8.1	143	Bond's loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost.

At this point of time I have taken 183 movies. I have stored it as a csv file.

First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.

Below are the commands for a quick reference. What I just adore about R is it’s simplicity, with just so few commands we are done

film<-read.csv("film.csv",header=T)# Reading the file in a list object

x<-as.matrix(film) # Converting the list to a matrix, for histogram plotting

y<-as.numeric(x[,3]) # Converting the movie rating to a numeric vector

y<-as.numeric(x[,4]) # Converting the movie duration to a numeric vector

hist(y,col="green",border="black",xlab="Duarion",ylab="mvfreq",main="Mv Duration Distribution",breaks=7)

hist(y,col="blue",border="black",xlab="mvRtng",ylab="mvfreq",main="Mv Rtng Distribution",breaks=9)

cor(y,z) # Calculate Correlation Coefficient between rating and duration

Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small. We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.

So someway or other the rating goes up with the duration of the movie.

I leave it to you for interpretation, but next time you might look at the movie duration for taking a call ! Mr. Directors , it might be a tips for you who knows and may be to me wify is always right. May be all that short is not that sweet.

With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments

8 comments:

Unknown23 November 2012 at 11:33
Interesting observation and inference drawn.
It would help if you can elaborate your sampling technique for selection of 183 movies as there is fair probability of bias getting introduced and a larger set can bring out a different inference.
Keep writing...
ReplyDelete
Replies
6y58823 November 2012 at 16:19
First, you should increase the width of your plots to fill the width of the page. As now, it is too small to view properly without clicking on each plot.

Second, I assume that the correlation of 0.48 is for ALL movies in the database, as I did not see the code which restricts the data to movies that are shorter than N minutes.

Third, you will need the package XML and its functions readHTMLTable() to scrape data from the www.imdb.com web site. I could probably write a function to scrape the data from this website into a data frame, relatively easily.
ReplyDelete
Replies
Matti Z24 November 2012 at 01:09
Have you checked this page at IMDB?
http://www.imdb.com/plugins
ReplyDelete
Replies
Saikat24 November 2012 at 02:57
Saptarsi da.. some thoughts on what might bias the correlation outcome:

1. There is a tendency in IMDB to give higher rating (7 and above) to long duration movies ( 2hrs and more) no matter how 'popular' or 'hit' the movie was. A weighted average rating of IMDB and rotten tomatoes could have been better.

2. The samples are hollywood movies. A sample drawn from Hollwood, Bollywood and Bengali movies (the whole population of movies we mostly watch) would give more unbiased result.

3. Average movie duration has changed (shortened) over time. As a result, since IMDB has tendency to give higher rating to long duration movies, generally the 60s and 70s movies (the longer duration movies) get higher rating on an average - which can bias the correlation between duration and rating (as a measurement of goodness of movie)

I hope the thoughts here will give you lots of irritation!!! ;-)
ReplyDelete
Replies
Unknown24 November 2012 at 04:39
Thanks Saikat, no reason of irritation :). I have seen on first point and there are movies more than 120 hours and got a rating of 5 or so. Bengali and Hindi will surely have a diffrent normal distribution, so I think that should be dealt separately. The opinions that you are telling would be good to set up a hypothesis and test :)
ReplyDelete
Replies

Add comment

meet Saptarsi

Total Pageviews

Friday, 23 November 2012

Opting for shorter movies, be aware u might be cutting the entertainment too!

8 comments:

About Me

Translate

Total Pageviews

Friday, 23 November 2012

Opting for shorter movies, be aware u might be cutting the entertainment too!

8 comments:

About Me

Subscribe To