Total Pageviews

Friday 21 December 2012

How ACIDic are Transactions really!

I got a little bit distracted while I was hearing about the ‘No SQL Databases’  , one of the key characteristics is they do not support Full ACID transactions,  This brings back  some of the memories. Not all of them pleasant, this is/was one of the sure shot menus to be served by the interviewers, be it job interviews or college viva.  Honestly, now I think the entire ‘ACID’ thing was little overdone. I have no disillusion that it is one of the widely written topics , a google search below brings 1.5 billon search pages, but I thought of making a point, may be at the cost of being repetitive.

I talk very briefly , highlighting key concepts.
ACID (Atomicity, Consistency, Isolation, Durability)
Atomicity:  Either all or none. Reverts to earlier state in case of a failure (system, hardware, software) . Handled by Recivery-management component
Consistency:  When the transaction is complete it is in a consistent state.  Responsibility of an Application Programmer.
Durability: Once a transaction is completed, the effects are permanent.  No failure can change the same. This is responsibility of recovery-management component
Isolation:  Concept of Serializability, responsibility of concurrency control

Read it again please carefully, is not concurrency and recovery all that we are talking.  Keeping consistency is like having commonsense, which is uncommonJ. 

I think we should talk about CR Properties rather than ACID.

Friday 23 November 2012

Opting for shorter movies, be aware u might be cutting the entertainment too!

Hello Friends,
This time I thought to bring in little more spice and thought of focusing on movies.  I don’t know about you but I am a movie buff. Often on a weekend when I am trying to pick up a movie from my movie repository, which spans to some TBs now, I feel little lost.  Apart from a general rating or a perception, the length of the movie plays a role in the choice, simple reason; the movie needs to be cramped between other demanding priorities.
So last Saturday, when I was in between this process, and I was searching for a movie less than 1 hour 30 minutes (There was a hard stop on that) my wife commented but “The short movies are generally not so good”.  I did not pay much heed to that then (Don’t conclude anything from this please), but later on I thought hold on, is that a hypothesis?  Can I do something statistically here?  And here we are. We will talk little bit on correlation, normal distribution etc. I use ‘R’, but it is so simple , we can even use excel sheet do the same.
This is an indicator whose value is between -1 and 1 and it indicates strength of linear relationship between two variables. Leave the jargon, many cases we relate features.  The typical law of physics like speed and displacement may have a perfect correlation, but those are not the point of interest.  However a point of interest may be is there a relation between say
a)      IQ Score of a person and Salary drawn
b)      No. of obese people in an area vis-à-vis no. of fast-food centers in the locality
c)       No. of Facebook friends , with relationship shelf life
d)      No. of hours spent in office and attrition rate for and organization
An underlying technicality, I must point out here is both of the variables should follow a normal distribution.
Normal Distribution:
This is the most common probability distribution function, which is a bell shaped curve, with equal spread in both side of the mean.  Associate to manager alike, you must have heard about normalization and bell curve while you face/do the appraisal.  Most of the random events across disciplines follow normal distribution. The below is an internet image. 

So I picked up movie information and like any one of us picked it up from IMDB ( and I put it in a structured form like the below, the ones highlighted below may not be required at this point of time, I kept it just for some future work in mind.  The list was prepared manually; I will keep on hunting for some API and all and would keep you posted on the same.

Year of Release
Small Desc
Bond's loyalty to M is tested as her past comes back to haunt her. As MI6 comes under attack, 007 must track down and destroy the threat, no matter how personal the cost.

At this point of time I have taken 183 movies.  I have stored it as a csv file.
First thing first, there are various formal ways to test whether it follows a normal distribution, I would just plot histograms and see how this looks like, both the variable seem to follow normal distributions closely.

Below are the commands for a quick reference.  What I just adore about R is it’s simplicity, with just so few commands we are done
film<-read.csv("film.csv",header=T)# Reading the file in a list object
x<-as.matrix(film) # Converting the list to a matrix,  for histogram plotting
y<-as.numeric(x[,3]) # Converting the movie rating to a numeric vector
y<-as.numeric(x[,4]) # Converting the movie duration to a numeric vector
hist(y,col="green",border="black",xlab="Duarion",ylab="mvfreq",main="Mv Duration Distribution",breaks=7)
hist(y,col="blue",border="black",xlab="mvRtng",ylab="mvfreq",main="Mv Rtng Distribution",breaks=9)
cor(y,z) # Calculate Correlation Coefficient between rating and duration
Interestingly the correlation turns out to be .48 in this case, which says there is a positive correlation between this two phenomenon and the correlation is not small.  We can set up a hypothesis “ There is no correlation “ and a level of significance and test the hypothesis. However .48 is a high value and I am sure we would reject the hypothesis that there is no correlation.
So someway or other the rating goes up with the duration of the movie.
I leave it to you for interpretation, but next time you might look at the movie duration for taking a call ! Mr. Directors , it might be a tips for you who knows and may be to me wify is always right. May be all that short is not that sweet.
With that I will call it a day, hope you enjoyed reading. I will be coming on with more such Looking forward to your feedbacks and comments

Saturday 10 November 2012

SPSS Modeler a quick intro

Hello Friends
It’s time I thought of talking to you again.  I happened to have some exposure to SPSS modeler and wanted to share some of the information with you. There are several articles, blogs, demos and product manuals of the same to guide/confuse you. I would just share a gist. Here’s how I would go. I would start with exploring the IBM stack in this space and then give you tiny tit-bits on the modeler. Again when I explore the stack it’s based on my individual perception and might not align with the formal positioning. Most of the products are individually licensed. We will have to mix and match while offering for customer based on their need of operational and analytical decision management. Currently I am limiting to the structured data space and am not touching upon Social Network, Text Analytics or products like Big Insight or Watson which envisages AI to be taken to a crescendo and have capabilities to make many highly paid business consultants redundant.
Product Portfolio:
·         SPSS Modeler desktop :It’s a GUI based cool tool , with blocks available for different tasks, where you can define your data mining , predictive analytics tasks. The desktop version can really commoditize data mining with its drag and drop features. Suited for data scientists as well as business users who have a flare to the algorithms.
·         SPSS Modeler Server edition would be required for scheduling of jobs, batch mode execution etc
·         CADS will be required for collaboration, deployment and scoring.  Collaboration will allow multiple people participate in model building and scoring will allow real time integration, with exposing the models as web services. So for a real time analytical application like fraud detection or day to day web experiences like association for cross sell and up sell it’s surely needed. Its champion, challenger model sounds fresh, where the challenger models become new champion on better result.
·         SPSS Statistica : Offering much more flexibility with scripting and customization. More suited for techies and statisticians. Can complement modeler for advanced statistical tasks.
·         Analytical Decision Management:  This is for the business users.  They can lay out the skeleton of the models here and the tech team can work in the background to put flesh and bone. This allows combining both business rules defined in the analytical decision management and rules coming out of SPSS modeler. It uses CPLEX which is a constraint based optimizer.
·         Entity Analytics: As the name suggests this is aligned for identifying logical duplicates and de-duplication. Results of this can significantly improve modelers’ accuracy.
SPSS Modeler:
Again in this section after talking about the nodes, I only talk about some of the features I liked
-          It provides you an interface to follow crisp DM methodology, which is a cross industry standard consisting of stages like (Business Understanding, Data Understanding, Data Preparation, Modeling, Evaluation and deployment)
-          The data mining workflow will be defined in terms of various nodes like
o   Source node : For reading the data
o   Record node: It affects the no. of records. It can be as simple as filtering or aggregation.
o   Field node: Used for data transformations, cleaning and preparation. Automated data preparation (ADP) is a very handy filter note, allowing many easy and custom transformations with a significant probability of improving the accuracy. Anonymize allows to suppress/mask private information which is very relevant given so many prevalent compliance norms.
o   Graph Node: Allows many types of visualization as well as evaluation of the models.
o   Modeling Node : This is the cream, which will have  the data mining models. There is another group within the modeling node which is statistical node, which yet again offers useful functionalities like PCA, Factor, discriminant analysis etc.
o   Output Node :  This will be required for analyzing the results
o   Export Node: Allows data to be transported to other software tools like excel, SAS etc.
o   Super Node: Allows grouping of multiple nodes in more reusable and modular fashion.
-          Offers quite a few standard algorithms for common tasks like classification, regression, clustering, time series & association.
-          Auto classifier, auto clusters really makes evaluation of models so easy.  To clarify little bit more, we can use a classifier to detect a risky loan, we can be in two minds so as to which algorithms to be picked is it neural net or decision tree or may be logistic regression.  Auto classifier can do an evaluation on all of them and pick the best.
-          SQL Pushback:  Allows to push back some of the computations to the database.
-          In Database Mining : Allows SPSS Modeller , leverage native algorithms of other database vendors offering data mining capabilities like IBM Netezza, IBM DB2 InfoSphere Warehouse, Oracle Data Miner, and Microsoft Analysis Services
Overall modeler is a great tool which is easy to use and is intuitive and IBM has a rich portfolio of advanced analytics and decision management products, however so wide range may be confusing to the end customer and industry specific packaged solutions with combinations of products can demystify the same. Also packaging and readily available blocks are so far so good but need for deep domain knowledge, statistical understanding is here to stay, for superior results
I intentionally wanted to keep it short and just tickle your curiosities.  The festival of lights is nearing. Wish you and your family a safe and joyous time!
Will meet you again real soon! I hope you enjoyed.

Friday 19 October 2012

Venturing in text mining with 'R'

Hello Friends, hope all of you are doing just great.  I decided to create my footprint in the blog space, it comes from my desire to share few very basic steps of text mining, with all of you.  I am neither a nerd, or statistician or an established data scientist and if you are one of them well, this blog is surely not for you.
I really struggled while I was experimenting with this simple stuff and thought of sharing this with each one of you. I have spent last 6-7 years as a DW & BI professional. I have seen full cycle from data extraction to information delivery, using various tools and technologies and when we talk about advanced analytics it includes advanced data mining/machine learning techniques apart from traditional OLAP.  Predictive analytics will be more relevant with the splurge of data. However I see, lot of my colleagues, across organization finds themselves little awkward, out-of-place while there are talks about dimensionality reduction, Customer Segmentation, Tag Clouding, Anomaly detection.  It is an acknowledged fact that data mining for both structured and unstructured data needs to be much more commoditized in the DWBI Community.  I write to address this gap!  Instead of a bookish bottom-up, I go top-down, with focus on a small yet intuitive task. Again it is not a ‘How to”, so I omit obvious screenshots mostly and try to bring a interactive feel in the narrative. As I said , in case you are an erudite in this field,  venturing further is at your on will and risk, if you are interested about ‘R’ and how this can be used for text mining, I am trying to pen as lucidly as possible, let’s take the plunge together.  So below is the flow, I will describe few terms in my own way and then talk about a simple text mining task.
Well, it’s an open source tool for statistical & visualization tasks.  Formally it is positioned as an environment for Computing and Graphics, it is a successor of S which was developed in Bell labs and it is a part of the GNU.  My inquisitive friends can look at and be further enlightened.  Again informally it is a lightweight convenient tool which is in fact free, and is robust with lot of features.  There are lots of people in different communities, discussion forums and mailing lists .  You don’t need Unix and all, runs smoothly on our own “Windows”. So we can get started with ‘R’ for basic data mining and text mining jobs with any further ado.
I will start with example. Let me put it this way when we are doing a text analysis / mining/ natural language processing or any other kind of task where we deal with words, a basic thing would be looking at distinct words there counts for sure.  So we need to count ‘run’, ‘ran’, ‘running’, ‘runner’ as one word rather than four. So we count the stem of the word rather than the form.
Stop words:
These are frequently used words, which are generally filtered out before a processing task. This is an absolute everyday phenomenon we encounter every day. Stop words can change depending on the nature of the text mining task.

Term Document Matrix:
Don’t scoff at me, if I say text documents are nothing but a very high dimensional vector, and lot of the dimensions are sparsely populated. Texts are high dimensional, simple reason being the words work as dimensions and we know how large than the dimensions can be. Number of dimensions can be as large as number of words in a dictionary. Well there is a popular abstraction of a text document.  We as if start by extracting words one by one from a document and dumping them in a bag.  So we lose the sense of grammar and order of words, but this is still a fairly useful abstraction.  Loosely word and term can be used interchangeably. Only that we might have removed stop words and used stemming, to prune the word list and the final list can be significantly shorter than the original words list.
Coming back to term document matrix in the simplest form, this will have the terms and the frequency against each document. I have taken two very short documents below and then illustrated the term document matrix.  The weights in this case are simple term frequency. However a popular method of weighing is combining the inverse document frequency (IDF) with the same, which takes the rarity of the word also into consideration. I will keep this for subsequent articles may be.
Document1:  Text mining is cool.
Document2:  This is the second text document, with almost no text.


R Installation and getting started:
Well this can be installed from the link I shared earlier.  Follow the version for your OS. I will presume it is for windows and continue, by default setting you would get a shortcut at the desktop.  You click on the same and R gets launched
Text Mining Specific Installation:
You would need to install 2 packages one TM and another Snowball which does not come with default setting.  Go to packages and click on install packages. Select a CRAN Mirror; preferably select one which is geographically closer. Select the packages one by one. Installation happens automatically, we might need to load the packages.
About the task:
You can pick up any task that you want to use the default one as explained in the text mining document “Introduction to the tm Package” or “Text Mining Infrastructure in R”. The second one is a very detailed one, for interested folks, this is definitely must read.  However I thought of making it little different and may be more interesting. I thought of identifying the buzz words/ trends from ‘C’ Levels of IT Offshore based service companies and I thought of choosing N. Chandrasekhar (CEO, TCS) , Francisco D’Souza (CEO, Cognizant ) and S.D.Shibulal ( CEO, Infosys) .  I collected a total of 5 interviews and saved them as 5 text files.  For a broad based trend surely we would need much more documents, but still this can give a decent start. 
Step 1:  Saved the text files in C:<>\Documents\R\win-library\2.15\tm\texts. The path will also depend on the installation options though. I created a folder interviews specific for this and saved the files here. I set the path for tm here
“Intrvw <- system.file("texts", "Interview", package = "tm")”
Step 2: I create a Corpus named “IntrvwC” with these documents
“IntrvwC<-Corpus(DirSource(Intrvw), readerControl = list(reader = readPlain, language = "eng"))”
Step 3: Strip whitespaces from the Corpora
“IntrvwC <- tm_map(IntrvwC, stripWhitespace)”
Step 4: make all the words to lower cases
“IntrvwC <- tm_map(IntrvwC, tolower)”
Step 5: Remove stop-words
“IntrvwC <- tm_map(IntrvwC, removeWords, stopwords("english"))”
Step 6: Remove punctuation
“IntrvwC <- tm_map(IntrvwC, removePunctuation)”
Step 7: Stemming the words
“IntrvwC <- tm_map(IntrvwC, stemDocument)”
Now we are done with our required task and we can look at the document term matrix for this corpora.

The document term matrix
If we just give the name of the document term matrix  at the default R prompt, it will give the below result
A document-term matrix (5 documents, 602 terms)

Non-/sparse entries: 751/2259
Sparsity           : 75%
Maximal term length: 41
Weighting          : term frequency (tf)

Finding frequent terms:
We use the below command
findFreqTerms(dtmIntrvw, 5)
This will identify all the terms that have occurred more than 5 times in the corpora
[1] "business"             "cash"                 "cent"                 "chandrasekaran"       "clients"              "companies"            "company"            
 [8] "customers"            "discretionary"        "don<U+393C><U+3E32>t" "europe"               "financial"            "growth"               "industry"            
[15] "infosys"              "insurance"            "look"                 "margins"              "opportunities"        "quarter"              "services"           
[22] "shibulal"             "spend"                "spending"             "strategic"            "tcs"                  "technology"           "time"

It will be audacious to conclude anything from corpora of five documents. Never the less Europe seems to be in any leadership’s mind none the less.
With that I will sign-off thanks for bearing with me. I would be more than looking forward for your comments. Wish you a happy festive time ahead with your family and friend.