Total Pageviews

Monday, 28 July 2014

Beyond bars, pie charts and histograms …………

( Was written for the Institute of Engineering and Management 2014 Magazine , put online for better access , specially for my students )

The reference of the popular saying “A picture is worth a thousand words” goes back to early part of 20th century. If we take graphs as a rough equivalent to picture, apart from bringing the quick interpretability factor, the other benefit of the same is a summarized view of the data. As a case in point, let’s look at the below crime data of US [U.S. Department of Justice].  I am taking crime data in property subdivided in Burglary, Larceny­ theft and Motor Vehicle theft (For those of you, who have been wondering what is larceny like me, it’s just a legal term for stealing). The numbers indicate rates per 100,000 populations.


States
Burglary
Larceny- theft
Motor Vehicle Theft
Total
Alabama
953.8
2650
288.3
3892.1
Alaska
622.5
2599.1
391
3612.5
Arizona
948.4
2965.2
924.4
4838
Arkansas
1084.6
2711.2
262.1
4057.9
California
693.3
1916.5
712.8
3322.6
Colorado
744.8
2735.2
559.5
4039.5
Connecticut
437.1
1824.1
296.8
2558
Delaware
688.9
2144
278.5
3111.4
Florida
926.3
2658.3
423.3
4007.9
Georgia
931
2751.1
490.2
4172.3

Many of you would apply your ‘CAT – cracking ´ DI skill and would do some quick mental math way better than me, but there is no denying, the below chart reveals much more to mere mortals. We are very quick to spot which are two top/bottom crime prone states.





Let’s look at another view of the same data, this time a stacked bar chart



A stacked bar chart takes contribution of each state as 100% and then depicts the contribution from different components with different color portions.  It will be difficult to miss
Ø  Alabama and Arkansas have a relatively lesser proportion of motor vehicles theft
Ø  Whereas Alaska and Colorado has relatively low proportion of burglary
I hope that now some of you might have been convinced on the usefulness of traditional charts. So this is an opportune time for “kahanime Twist”. There is a trite ‘Gyan’ available specially for negotiating on price while shopping from hawkers.  You should not start negotiating on the thing you liked and want to purchase, rather do on something else nearby (for a lack of an original example let me settle for if you are trying to buy a রুমাল (handkerchief) ask the price of a বেড়াল (cat)), bargain hard and then casually bring him to the piece you actually want to buy. Because if the hawker understands you really fell for it, the chance of negotiation is bleak. So with this entire prologue, let me just say gone are the days, where we can impress people with graphical reports with bar charts, histograms, pie charts, line charts etc.  This can be felt more with the advent of “Big Data”.
Commercially ‘Big Data” is differentiated by three Vs, Volume, Variety and Velocity.  Without going into too much of details, I guess it will be sufficient to draw your attention to the number of tweets per day, which is 500 millions! This gave birth to a new term named Tweets per second (TPS).  For understanding variety we do not need to look beyond the photos, blogs, chats, emails, medical images, sensor logs, RFID logs generated at each second. When we talk about velocity we are not talking about the rate data is getting generated, but also the rate at which it is becoming obsolete. One practical example of the same is Gurgaon police now a days have added tweets in their dashboard. So if people are tweeting on some quarrel breaking out , bedlam or some other reason a pandemonium , the police can exactly track the area at real time via the tweets, so they can send the nearest petrol car within minutes. So contrary to the”filmy” style of arriving well after when crime is done and criminals fled, now the cops arrive well in time.
There has been a toolbox of visualization techniques on this new “avatar” of the Data and a lot can be written on them. I will choose just one of them as a class representative (Class of new visualization techniques)
So meet “Mr Word Cloud”. You might not have known his formal name but surely you had a chance or deliberate tryst with him somewhere in your internet life.  Though he is not a toddler, but at least he is your age group (1992 born).  So here is a word cloud used by a leading data analyst in his blogs emphasizing on the importance of data.  It is very handy to summarize large body of texts, blogs, and tweets. Oh by the way I was forgetting, he does not mind being called ‘tag cloud’ as well.


It’s easy to note that words have different font size, which is proportional to their frequency aka importance, the vertical and horizontal orientation does not mean much.
Roughly there are two ways you can start using word clouds.
1.       Use a site like http://www.wordle.net/ . Either copy paste your text or give the url of a site which has a Atom or RSS feed (very crudely a mechanism to exchange dynamically updated information).
2.       Use a computing environment like R ( it’s one of the most used open source)statistical and data analysis computing environment. (http://cran.r-project.org/)
The below one is freshly baked for you, using not more than 10 lines of code in R from few of the movie reviews (NDTV, Hindustan Times, TOI etc. ) on a movie and no prize for guessing the name of the movie ( I used some filtering of selecting only words which has appeared at least 3 times and some further work can be done in removing ‘stop words’ which are film industry specific, basically words which are likely to be present in any movie review , irrespective of the movie like movie, film etc. Without watching the movie¸ I would not be probably wrong in concluding if the movie is to be watched it’s for Aamir.

 I hope to make a case for visualization techniques and did you guys think it can help hide the author’s excellent writing prowess or lack of it; well none of your provocations will elicit an answer from me.

Monday, 31 March 2014

Exploratory data analysis on P/E ratio of Indian Stocks

Price Earnings ratio (P/E) is one of the very popular ratios reported with all stocks.  Very simply this is thought as - Current Market Price / Earning per Share.   An operational definition of Earning per Share would be Total profit divided by # of Shares .  I will redirect interested readers for further reading to
In this post, I would just like to show, how we can grab P/E data from Web and create some visualizations on it.  My focus right now is Indian stocks and I intend to use the below website
So my first step is gearing up for the data extraction and essentially that is the most non-trivial task.  As shown in the figure below, there is separate pages for each sector and we need to click on individual links , to go to that page and get the P/E ratios.
Here is something , I did outside ‘r’ , creating a csv file with the sector names , using delimiters while importing text and paste special as transpose , here is how my csv file would look.  I would never discourage using multiple tools as this would be required to solve real world issues


So now I can import this in a dataset and read one row at a time and go to necessary URLs , but god have different plans J , it’s not that straightforward
Case 1 :  Single word sector names :
We have sector as ‘Banks’and the sector link is as below
Again it is a no brainer , we can pick up the base url , append the sector name after a forward slash and then append the string  ‘-Sector’ , this is true for most single word sector names like ‘FMCG’ , ‘Tyres’ , ‘Heathcare’ etc
Case 2:  Multiple words without ‘-‘  , ‘&’ and ‘/’
We have sector as ‘Tobacco Products’ and the sector link is as below
This is also not that difficult apart from adding the ‘-Sector’ we need replace the spaces by a ‘-‘ .
Case 3:  Multiple words with a ‘-‘
We have sector name as ‘IT-Software’, where we have to remove other spaces if exiting. There can be several other cases, but for discussion sake , I will limit myself here
Case 4:  Multiple words with a ‘/‘
We have sector name as ‘Stock/ Commodity Brokers’,  so the “/” needs to be removed
# Reading in dataset
sectorsv1 <- read.csv("C:/Users/user/Desktop/Datasets/sectorsv1.csv")
# Converting to a matrix , this is a practice generally I follow
sectorvm<-as.matrix(sectorsv1)
we can access individual sectors by ,  sectorvm[rowno,colon]
pe<-c()
cname<-c()
cnt<-0
baseurl<-'http://www.indiainfoline.com/MarketStatistics/PE-Ratios/'
sectorvm<-as.matrix(sectorsv1)
for(i in 1:nrow(sectorvm))
{
securl<-sectorvm[i,1]
# Fixed true indicated the string is to matched as is and is not a regular expression
# Substitution of the different cases as we explained , we will point out using gsub instead of sub
# else only the first instance will be replaced
if(length(grep(' ',securl,fixed=TRUE))!=1)
{
securl<-paste(securl,'-Sector', sep="")
}
else
{
securl<-gsub(' ', '-', securl, ignore.case =FALSE, fixed=TRUE)
if(length(grep('---',securl,fixed=TRUE))==1)
{
securl<-gsub(' ---', '-', securl, ignore.case =FALSE, fixed=TRUE)
 }
if(length(grep('&',securl,fixed=TRUE))==1)
{
                securl<-gsub('&', 'and', securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep('/',securl,fixed=TRUE))==1)
{
                securl<-gsub('/', '', securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep(',',securl,fixed=TRUE))==1)
{
                securl<-gsub(',', '', securl, ignore.case =FALSE, fixed=TRUE)
}
securl<-paste(securl,'-Sector', sep="")
}
fullurl<-paste(baseurl,securl, sep="")
print(fullurl)
if (url.exists(fullurl))
{
petbls<-readHTMLTable(fullurl)
# Exploring the tables we found out relevant information on table 2
# Also the data is getting stored as factor , just doing an as.numeric will not suffice
# we need to do an as.character and then an as.numeric
pe<-c(pe,as.numeric(as.character(petbls[[2]]$PE)))
cname<-c(cname, as.character (petbls[[2]]$Company))
cnt = cnt + 1
}
}
Different functions that we have used are explained as below
readHTMLTables -> Given a url , this function can retrieve the contents of the <Table> tag from html page.  We need to use appropriate no. for the same. Like in this case we have used table no 2.
Grep, Paste, Gsub are normal string functions, grep finds occurrence of a string in another, paste concatenates and gsub does the act of replacing.
As.numeric(as.character()) had a lasting impressing on my mind as an innocuous and intuitive as.numeric would have left me only with the ranks.
url.exists :-> it is a good idea , to check the existence of the url , given we are dynamically forming the URLs.
Now playing with summary statistics:
We use the describe function from psych package
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
1797
59.71
76.92
20.09
46.64
29.79
0
587.5
587.5
2.15
7.25
1.81

hist(pe,col='blue',main='P/E Distribution')

We get the below histogram for the P/E ratio , which shows it is nowhere near a normal distribution , with it’s peakedness and skew as confirmed from the summary statistics as well
We will never the less do a normalty test
shapiro.test(pe)
 
        Shapiro-Wilk normality test
 
data:  pe 
W = 0.7496, p-value < 2.2e-16
 
Basically the null hypothesis is, the values come from a normal distribution and we see the p value to be very insignificant and hence we can easily reject the null.
Drawing a box plot on the P/E ratios
boxplot(pe,col='blue')

Finding the outliers
boxplot.stats(pe)$out
 
 
484.33 327.91 587.50
 
cname[which(pe %in% boxplot.stats(pe)$out)]

[1] "Bajaj Electrical" "BF Utilities"     "Ruchi Infrastr." 

Of course no prize guessing we should stay out of these stocks
So if we summarize this is kind of exploratory data analysis on PE ratio of Indian stocks

·     We saw, we can get content out of url and html tables
·      We added them in a data frame
·       Looked at summary statistics , histogram and did a normality test

·       Plotted a box plot and found the outliers 

Thursday, 19 December 2013

happiness equation !!!!




Who knew that happiness can have a equation, I was glancing through one of the lectures by   Swami   Sarvapriyananda, he introduced to something called as a happiness equation , something proposed by Martin Seligman, the father of positive psychology.   The version he shared and which is in perfect agreement with the  Purusartha’s of   Vedanta is something like this

H = P + E + M

  • P is for pleasure.   Good food, movies .  they are short-lived.
  • E comes from engagement, our profession, creativity , research
  • M is for meaningful, for others.  First beyond you and then beyond your family.

Happiness from E and M should make up for the most. u dine at a good restaurant , u spend 1 hour teaching a poor student.  after some months , years the later activity will give you much more happiness. 

Very profound!!!


More can be watched  @ goo.gl/n5TlWx

Monday, 11 November 2013

Statistica , very few points

Don’t think I am qualified, but got a chance to interact with a professional from Statistica and thought of sharing few things about the tool

·         It has amazing integration with ms suites , may be something in the cards
·         Import is easy and wizard driven , supports multitude of data files ,allows connection with a file as well as database
·         It also allows to work on data from multiple sources
·         It has good statistical capabilities
·         Did not get a chance to look at all functionality
·         Correlation and Regression looked good with linear, multiple, factor regression
·         Help files are good and comes with quite a few sample datasets
·         To me a winner is , it’s vba coding interface , which will make life simpler
·         You do not necessarily , need to bring all data locally and process , there is a technique available for the same as well


I know , this is tip of the ice berg , but just thought of sharing what ever I gathered, will keep you posted.

Wednesday, 9 October 2013

Classification using neural net in r

This is mostly for my students and myself for future reference.

Classification is a supervised task , where we need preclassified data and then on new data , I can predict.
Generally we holdout a % from the data available for testing and we call them training and testing data respectively.  So it's like this , if we know which emails are spam , then only using classification we can predict the emails as spam.

I used the dataset http://archive.ics.uci.edu/ml/datasets/seeds# .  The data set has 7 real valued attributes and 1 for predicting .  http://www.jeffheaton.com/2013/06/basic-classification-in-r-neural-networks-and-support-vector-machines/ has influenced many of the writing , probably I am making it more obvious.

The library to be used is library(nnet) , below are the list of commands for your reference



1.       Read from dataset

seeds<-read.csv('seeds.csv',header=T)

2.       Setting training set index ,  210 is the dataset size, 147 is 70 % of that

   seedstrain<- sample(1:210,147)

3.       Setting test set index

   seedstest <- setdiff(1:210,seedstrain)
 
4.       Normalize the value to be predicted , use that attribute of the dataset , that you want to predict

   ideal <- class.ind(seeds$Class)

5.       Train the model, -8 because you want to leave out the class attribute , the dataset had a total of 8 attributes with the last one as the predicted one

   seedsANN = nnet(seeds[seedstrain,-8], ideal[seedstrain,], size=10, softmax=TRUE)

6.       Predict on training set

   predict(seedsANN, seeds[seedstrain,-8], type="class")

7.       Calculate Classification accuracy


   table(predict(seedsANN, seeds[seedstest,-8], type="class"),seeds[seedstest,]$Class)

Happy Coding !

Wednesday, 4 September 2013

DBMS : few questions for freshers

Hello , once you are ready with the 'HR' type of questions , (you can take a look at some of the questions and answers at HR questions)  it is obviously important for getting ready for the technical and one of the subjects ,  and DBMS is one of the leading ones from both the parties. So I pen down few DBMS questions, this are all indicative ones , just to give you an idea on the depth and breadth. Normalization, Transactions , SQL and Indexing ,  I have seen to be all time favorites.

Feel free to give your comments , post answers , more questions , any other feedback.

Imagining you guyz are reading this because of your impending campus , read smartly , fine balancing with enjoying life , work on a plan and you will be partying in time :)

DBMS :
General


  1.    What is OLAP and OLTP?
  2.    What are the advantages of DBMS over file system? ( Do not forget key ones like transaction , normalization,  indexing , locking , logging etc. )
  3.     Why RDBMS called RDBMS ? Stress on the relational part.
  4.     Draw an ER Diagram for the project you did ? 
  5.    .What is the degree of a relationship? Does relationships always needs a table ?
  6.      How do you achieve generalization and specialization in an ER?

Normalization

  1.      Why do we normalize?
  2.      What are the different anomalies?
  3.      Why do we denormalize ?
  4.      Give an example of a table , which is not in 3 NF , explain diffrent anomalies in that context ,  tell how you will make it normalized



Transaction

  1. What is a transaction?
  2. What is ACID property? Explain each one with example.
  3. How the durability property is implemented ?
  4. How do you implement a transaction from a programming language like C# or java?
  5.  Can a transaction be partially committed ?
  6. What is locking , what is two phase locking ?
  7. What is seralizability ? 

Integrity 


  1. What are different type of constraints ?
  2. What is primary key , foreign key , candidate key ?
  3. What is unique , Check constrains 
  4. Difference between a primary key and unique constraint ?
  5. Is it possible to create a table with out a primary key ?
1

Indexing:

  1.  Why do we use index?
  2. What are the overheads of index?
  3. What is the difference between clustered index and a non-clustered index ?
  4.  What is the difference between B Tree and B+ Tree
  5.  Another way of asking the same question , what are the diffrent data structures that are used 

SQL:
  1.  What is difference between a function and stored procedure ?
  2.  What is cursor ?  What are the different types of cursor in oracle ?
  3. What are the different indices that oracle support?’ ( Special focus on bitmap indices)
  4. What is the difference on where and having clause ? 
  5. Questions on NULL
  6. What is the difference between char data type and varchar data type?
  7. What is a view?  What is materialised view?  Why they are used?What are different types of joins , what is difference between a Cartesian product and a full outer join ( Practise this with few examples )
  8. What is the function of UNION ?  Is it different from UNION ALL? 
  9.  What is DDL and DML,  how truncate is different from Delete
  10. Where will you use trigger , what are the different types of trigger ?
  11. Why do we create packages ?
  12. How do we handle exceptions?
  13. What is the role of dual ?
  14. If two tables have PK - FK relationship and you want when the PK gets deleted , the FK entries also gets deleted , how do you do that ?
  15. Why and how do you create sequences?
  16. How can you delete dupicate data from a table ?
3



Follow by Email