Total Pageviews

Friday 19 October 2012

Venturing in text mining with 'R'

Background:
Hello Friends, hope all of you are doing just great.  I decided to create my footprint in the blog space, it comes from my desire to share few very basic steps of text mining, with all of you.  I am neither a nerd, or statistician or an established data scientist and if you are one of them well, this blog is surely not for you.
I really struggled while I was experimenting with this simple stuff and thought of sharing this with each one of you. I have spent last 6-7 years as a DW & BI professional. I have seen full cycle from data extraction to information delivery, using various tools and technologies and when we talk about advanced analytics it includes advanced data mining/machine learning techniques apart from traditional OLAP.  Predictive analytics will be more relevant with the splurge of data. However I see, lot of my colleagues, across organization finds themselves little awkward, out-of-place while there are talks about dimensionality reduction, Customer Segmentation, Tag Clouding, Anomaly detection.  It is an acknowledged fact that data mining for both structured and unstructured data needs to be much more commoditized in the DWBI Community.  I write to address this gap!  Instead of a bookish bottom-up, I go top-down, with focus on a small yet intuitive task. Again it is not a ‘How to”, so I omit obvious screenshots mostly and try to bring a interactive feel in the narrative. As I said , in case you are an erudite in this field,  venturing further is at your on will and risk, if you are interested about ‘R’ and how this can be used for text mining, I am trying to pen as lucidly as possible, let’s take the plunge together.  So below is the flow, I will describe few terms in my own way and then talk about a simple text mining task.
R:
Well, it’s an open source tool for statistical & visualization tasks.  Formally it is positioned as an environment for Computing and Graphics, it is a successor of S which was developed in Bell labs and it is a part of the GNU.  My inquisitive friends can look at http://cran.r-project.org/ and be further enlightened.  Again informally it is a lightweight convenient tool which is in fact free, and is robust with lot of features.  There are lots of people in different communities, discussion forums and mailing lists .  You don’t need Unix and all, runs smoothly on our own “Windows”. So we can get started with ‘R’ for basic data mining and text mining jobs with any further ado.
Stemming:
I will start with example. Let me put it this way when we are doing a text analysis / mining/ natural language processing or any other kind of task where we deal with words, a basic thing would be looking at distinct words there counts for sure.  So we need to count ‘run’, ‘ran’, ‘running’, ‘runner’ as one word rather than four. So we count the stem of the word rather than the form.
Stop words:
These are frequently used words, which are generally filtered out before a processing task. This is an absolute everyday phenomenon we encounter every day. Stop words can change depending on the nature of the text mining task.

Term Document Matrix:
Don’t scoff at me, if I say text documents are nothing but a very high dimensional vector, and lot of the dimensions are sparsely populated. Texts are high dimensional, simple reason being the words work as dimensions and we know how large than the dimensions can be. Number of dimensions can be as large as number of words in a dictionary. Well there is a popular abstraction of a text document.  We as if start by extracting words one by one from a document and dumping them in a bag.  So we lose the sense of grammar and order of words, but this is still a fairly useful abstraction.  Loosely word and term can be used interchangeably. Only that we might have removed stop words and used stemming, to prune the word list and the final list can be significantly shorter than the original words list.
Coming back to term document matrix in the simplest form, this will have the terms and the frequency against each document. I have taken two very short documents below and then illustrated the term document matrix.  The weights in this case are simple term frequency. However a popular method of weighing is combining the inverse document frequency (IDF) with the same, which takes the rarity of the word also into consideration. I will keep this for subsequent articles may be.
Document1:  Text mining is cool.
Document2:  This is the second text document, with almost no text.

Docs
Almost
cool
document
mining
second
text
The
this
with
1
0
1
0
1
0
1
0
0
0
2
1
0
1
0
1
2
1
1
1


R Installation and getting started:
Well this can be installed from the link I shared earlier.  Follow the version for your OS. I will presume it is for windows and continue, by default setting you would get a shortcut at the desktop.  You click on the same and R gets launched
Text Mining Specific Installation:
You would need to install 2 packages one TM and another Snowball which does not come with default setting.  Go to packages and click on install packages. Select a CRAN Mirror; preferably select one which is geographically closer. Select the packages one by one. Installation happens automatically, we might need to load the packages.
About the task:
You can pick up any task that you want to use the default one as explained in the text mining document “Introduction to the tm Package” or “Text Mining Infrastructure in R”. The second one is a very detailed one, for interested folks, this is definitely must read.  However I thought of making it little different and may be more interesting. I thought of identifying the buzz words/ trends from ‘C’ Levels of IT Offshore based service companies and I thought of choosing N. Chandrasekhar (CEO, TCS) , Francisco D’Souza (CEO, Cognizant ) and S.D.Shibulal ( CEO, Infosys) .  I collected a total of 5 interviews and saved them as 5 text files.  For a broad based trend surely we would need much more documents, but still this can give a decent start. 
Step 1:  Saved the text files in C:<>\Documents\R\win-library\2.15\tm\texts. The path will also depend on the installation options though. I created a folder interviews specific for this and saved the files here. I set the path for tm here
“Intrvw <- system.file("texts", "Interview", package = "tm")”
Step 2: I create a Corpus named “IntrvwC” with these documents
“IntrvwC<-Corpus(DirSource(Intrvw), readerControl = list(reader = readPlain, language = "eng"))”
Step 3: Strip whitespaces from the Corpora
“IntrvwC <- tm_map(IntrvwC, stripWhitespace)”
Step 4: make all the words to lower cases
“IntrvwC <- tm_map(IntrvwC, tolower)”
Step 5: Remove stop-words
“IntrvwC <- tm_map(IntrvwC, removeWords, stopwords("english"))”
Step 6: Remove punctuation
“IntrvwC <- tm_map(IntrvwC, removePunctuation)”
Step 7: Stemming the words
“IntrvwC <- tm_map(IntrvwC, stemDocument)”
Now we are done with our required task and we can look at the document term matrix for this corpora.

The document term matrix
If we just give the name of the document term matrix  at the default R prompt, it will give the below result
A document-term matrix (5 documents, 602 terms)

Non-/sparse entries: 751/2259
Sparsity           : 75%
Maximal term length: 41
Weighting          : term frequency (tf)

Finding frequent terms:
We use the below command
findFreqTerms(dtmIntrvw, 5)
This will identify all the terms that have occurred more than 5 times in the corpora
[1] "business"             "cash"                 "cent"                 "chandrasekaran"       "clients"              "companies"            "company"            
 [8] "customers"            "discretionary"        "don<U+393C><U+3E32>t" "europe"               "financial"            "growth"               "industry"            
[15] "infosys"              "insurance"            "look"                 "margins"              "opportunities"        "quarter"              "services"           
[22] "shibulal"             "spend"                "spending"             "strategic"            "tcs"                  "technology"           "time"

It will be audacious to conclude anything from corpora of five documents. Never the less Europe seems to be in any leadership’s mind none the less.
With that I will sign-off thanks for bearing with me. I would be more than looking forward for your comments. Wish you a happy festive time ahead with your family and friend.

13 comments:

  1. Finding the starting point for any new concept is often a challenge, thanks for making the starting point of 'R' so interesting and lucid, will wait for more on this in this space from you..

    ReplyDelete
  2. Thanks for introducing us to 'R'...Will be hooked on to this page for more updates...Keep them coming..Welcome to blogosphere!!!

    ReplyDelete
  3. That was seriously a lucid way of introducing "R" to us. Keep us updating more on this!

    ReplyDelete
  4. Thank you for this introduction ... it's helpful ...
    A question:
    How would you modify this intro/procedure to work with languages other than English? I mean unicode and/or ISO Latin 2 based languages...
    I would appreciate any suggestion ... :)

    ReplyDelete
    Replies
    1. Hello Crisiian

      u can change the language in the following command as example for latin you can mention language

      IntrvwC<-Corpus(DirSource(Intrvw), readerControl = list(reader = readPlain, language = "la"))

      Delete
  5. mantap gan , lanjutkan

    Dana Cepat

    ReplyDelete
  6. it's coollllllll,,,,Venturing in text mining with 'R'
    Keep posting broo
    bali luxury villas

    ReplyDelete
  7. Hello,
    I have a this question: Given a set of information on users (age, education, location, job..), I want to predict if a user will come to an event (Yes or No).
    I should be using Binomial Classification? right?
    Now how do I use Neural Net to predict future event attendance?

    Thanks,
    Regards,
    vishal

    ReplyDelete
  8. The dataset seems to be small , you can use decision tree as this is more intuitive and you can explain the results better to business users. The process of classification is simple you need to have a historical dataset. This will be used to build the model. Historical dataset will contain user attributes and past information about their attendance in events. You may have to break the historical dataset in two parts to avoid overfitting. If you are contemplating choosing a tool you may use weka which is open source and has a UI. R has a higher learning curve.

    ReplyDelete