Total Pageviews

Friday, 23 November 2012

Opting for shorter movies, be aware u might be cutting the entertainment too!

Hello Friends,
This time I thought to bring in little more spice and thought of focusing on movies.  I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost.  Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.
So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”.  I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis?  Can I do something statistically here?  And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same.
Correlation:
This is an indicator whose value is between -1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features.  The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest.  However a point of interest may be is there a relation between say
a)      IQ Score of a person and Salary drawn
b)      No. of obese people in an area vis-à-vis no. of fast-food centers in the locality
c)       No. of Facebook friends , with relationship shelf life
d)      No. of hours spent in office and attrition rate for and organization
An underlying technicality, I must point out here is both of the variables should follow a normal distribution.
Normal Distribution:
This is the most common probability distribution function, which is a bell shaped curve, with equal spread in both side of the mean.  Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal.  Most of the random events across disciplines follow normal distribution. The below is an internet image. 

So I picked up movie information and like any one of us picked it up from IMDB (http://www.imdb.com/) and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind.  The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.

Name
Year of Release
Rating
Duration
Small Desc
Skyfall
2012
8.1
143
Bond's loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost.


At this point of time I have taken 183 movies.  I have stored it as a csv file.
First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.





















Below are the commands for a quick reference.  What I just adore about R is it’s simplicity, with just so few commands we are done
film<-read.csv("film.csv",header=T)# Reading the file in a list object
x<-as.matrix(film) # Converting the list to a matrix,  for histogram plotting
y<-as.numeric(x[,3]) # Converting the movie rating to a numeric vector
y<-as.numeric(x[,4]) # Converting the movie duration to a numeric vector
hist(y,col="green",border="black",xlab="Duarion",ylab="mvfreq",main="Mv Duration Distribution",breaks=7)
hist(y,col="blue",border="black",xlab="mvRtng",ylab="mvfreq",main="Mv Rtng Distribution",breaks=9)
cor(y,z) # Calculate Correlation Coefficient between rating and duration
Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small.  We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.
So someway or other the rating goes up with the duration of the movie.
I leave it to you for interpretation, but next time you might look at the movie duration for taking a call ! Mr. Directors , it might be a tips for you who knows and may be to me wify is always right. May be all that short is not that sweet.
With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments






Saturday, 10 November 2012

SPSS Modeler a quick intro

Hello Friends
It’s time I thought of talking to you again.  I happened to have some exposure to SPSS modeler and wanted to share some of the information with you. There are several articles, blogs, demos and product manuals of the same to guide/confuse you. I would just share a gist. Here’s how I would go. I would start with exploring the IBM stack in this space and then give you tiny tit-bits on the modeler. Again when I explore the stack it’s based on my individual perception and might not align with the formal positioning. Most of the products are individually licensed. We will have to mix and match while offering for customer based on their need of operational and analytical decision management. Currently I am limiting to the structured data space and am not touching upon Social Network, Text Analytics or products like Big Insight or Watson which envisages AI to be taken to a crescendo and have capabilities to make many highly paid business consultants redundant.
Product Portfolio:
·         SPSS Modeler desktop :It’s a GUI based cool tool , with blocks available for different tasks, where you can define your data mining , predictive analytics tasks. The desktop version can really commoditize data mining with its drag and drop features. Suited for data scientists as well as business users who have a flare to the algorithms.
·         SPSS Modeler Server edition would be required for scheduling of jobs, batch mode execution etc
·         CADS will be required for collaboration, deployment and scoring.  Collaboration will allow multiple people participate in model building and scoring will allow real time integration, with exposing the models as web services. So for a real time analytical application like fraud detection or day to day web experiences like association for cross sell and up sell it’s surely needed. Its champion, challenger model sounds fresh, where the challenger models become new champion on better result.
·         SPSS Statistica : Offering much more flexibility with scripting and customization. More suited for techies and statisticians. Can complement modeler for advanced statistical tasks.
·         Analytical Decision Management:  This is for the business users.  They can lay out the skeleton of the models here and the tech team can work in the background to put flesh and bone. This allows combining both business rules defined in the analytical decision management and rules coming out of SPSS modeler. It uses CPLEX which is a constraint based optimizer.
·         Entity Analytics: As the name suggests this is aligned for identifying logical duplicates and de-duplication. Results of this can significantly improve modelers’ accuracy.
SPSS Modeler:
Again in this section after talking about the nodes, I only talk about some of the features I liked
-          It provides you an interface to follow crisp DM methodology, which is a cross industry standard consisting of stages like (Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and deployment)
-          The data mining workflow will be defined in terms of various nodes like
o   Source node : For reading the data
o   Record node: It affects the no. of records. It can be as simple as filtering or aggregation.
o   Field node: Used for data transformations, cleaning and preparation. Automated data preparation (ADP) is a very handy filter note, allowing many easy and custom transformations with a significant probability of improving the accuracy. Anonymize allows to suppress/mask private information which is very relevant given so many prevalent compliance norms.
o   Graph Node: Allows many types of visualization as well as evaluation of the models.
o   Modeling Node : This is the cream, which will have  the data mining models. There is another group within the modeling node which is statistical node, which yet again offers useful functionalities like PCA, Factor, discriminant analysis etc.
o   Output Node :  This will be required for analyzing the results
o   Export Node: Allows data to be transported to other software tools like excel, SAS etc.
o   Super Node: Allows grouping of multiple nodes in more reusable and modular fashion.
-          Offers quite a few standard algorithms for common tasks like classification, regression, clustering, time series & association.
-          Auto classifier, auto clusters really makes evaluation of models so easy.  To clarify little bit more, we can use a classifier to detect a risky loan, we can be in two minds so as to which algorithms to be picked is it neural net or decision tree or may be logistic regression.  Auto classifier can do an evaluation on all of them and pick the best.
-          SQL Pushback:  Allows to push back some of the computations to the database.
-          In Database Mining : Allows SPSS Modeller , leverage native algorithms of other database vendors offering data mining capabilities like IBM Netezza, IBM DB2 InfoSphere Warehouse, Oracle Data Miner, and Microsoft Analysis Services
Overall modeler is a great tool which is easy to use and is intuitive and IBM has a rich portfolio of advanced analytics and decision management products, however so wide range may be confusing to the end customer and industry specific packaged solutions with combinations of products can demystify the same. Also packaging and readily available blocks are so far so good but need for deep domain knowledge, statistical understanding is here to stay, for superior results
I intentionally wanted to keep it short and just tickle your curiosities.  The festival of lights is nearing. Wish you and your family a safe and joyous time!
Will meet you again real soon! I hope you enjoyed.