Total Pageviews

Monday, 31 March 2014

Exploratory data analysis on P/E ratio of Indian Stocks

Price Earnings ratio (P/E) is one of the very popular ratios reported with all stocks.  Very simply this is thought as - Current Market Price / Earning per Share.   An operational definition of Earning per Share would be Total profit divided by # of Shares .  I will redirect interested readers for further reading to
In this post, I would just like to show, how we can grab P/E data from Web and create some visualizations on it.  My focus right now is Indian stocks and I intend to use the below website
So my first step is gearing up for the data extraction and essentially that is the most non-trivial task.  As shown in the figure below, there is separate pages for each sector and we need to click on individual links , to go to that page and get the P/E ratios.
Here is something , I did outside ‘r’ , creating a csv file with the sector names , using delimiters while importing text and paste special as transpose , here is how my csv file would look.  I would never discourage using multiple tools as this would be required to solve real world issues


So now I can import this in a dataset and read one row at a time and go to necessary URLs , but god have different plans J , it’s not that straightforward
Case 1 :  Single word sector names :
We have sector as ‘Banks’and the sector link is as below
Again it is a no brainer , we can pick up the base url , append the sector name after a forward slash and then append the string  ‘-Sector’ , this is true for most single word sector names like ‘FMCG’ , ‘Tyres’ , ‘Heathcare’ etc
Case 2:  Multiple words without ‘-‘  , ‘&’ and ‘/’
We have sector as ‘Tobacco Products’ and the sector link is as below
This is also not that difficult apart from adding the ‘-Sector’ we need replace the spaces by a ‘-‘ .
Case 3:  Multiple words with a ‘-‘
We have sector name as ‘IT-Software’, where we have to remove other spaces if exiting. There can be several other cases, but for discussion sake , I will limit myself here
Case 4:  Multiple words with a ‘/‘
We have sector name as ‘Stock/ Commodity Brokers’,  so the “/” needs to be removed
# Reading in dataset
sectorsv1 <- read.csv("C:/Users/user/Desktop/Datasets/sectorsv1.csv")
# Converting to a matrix , this is a practice generally I follow
sectorvm<-as.matrix(sectorsv1)
we can access individual sectors by ,  sectorvm[rowno,colon]
pe<-c()
cname<-c()
cnt<-0
baseurl<-'http://www.indiainfoline.com/MarketStatistics/PE-Ratios/'
sectorvm<-as.matrix(sectorsv1)
for(i in 1:nrow(sectorvm))
{
securl<-sectorvm[i,1]
# Fixed true indicated the string is to matched as is and is not a regular expression
# Substitution of the different cases as we explained , we will point out using gsub instead of sub
# else only the first instance will be replaced
if(length(grep(' ',securl,fixed=TRUE))!=1)
{
securl<-paste(securl,'-Sector', sep="")
}
else
{
securl<-gsub(' ', '-', securl, ignore.case =FALSE, fixed=TRUE)
if(length(grep('---',securl,fixed=TRUE))==1)
{
securl<-gsub(' ---', '-', securl, ignore.case =FALSE, fixed=TRUE)
 }
if(length(grep('&',securl,fixed=TRUE))==1)
{
                securl<-gsub('&', 'and', securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep('/',securl,fixed=TRUE))==1)
{
                securl<-gsub('/', '', securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep(',',securl,fixed=TRUE))==1)
{
                securl<-gsub(',', '', securl, ignore.case =FALSE, fixed=TRUE)
}
securl<-paste(securl,'-Sector', sep="")
}
fullurl<-paste(baseurl,securl, sep="")
print(fullurl)
if (url.exists(fullurl))
{
petbls<-readHTMLTable(fullurl)
# Exploring the tables we found out relevant information on table 2
# Also the data is getting stored as factor , just doing an as.numeric will not suffice
# we need to do an as.character and then an as.numeric
pe<-c(pe,as.numeric(as.character(petbls[[2]]$PE)))
cname<-c(cname, as.character (petbls[[2]]$Company))
cnt = cnt + 1
}
}
Different functions that we have used are explained as below
readHTMLTables -> Given a url , this function can retrieve the contents of the <Table> tag from html page.  We need to use appropriate no. for the same. Like in this case we have used table no 2.
Grep, Paste, Gsub are normal string functions, grep finds occurrence of a string in another, paste concatenates and gsub does the act of replacing.
As.numeric(as.character()) had a lasting impressing on my mind as an innocuous and intuitive as.numeric would have left me only with the ranks.
url.exists :-> it is a good idea , to check the existence of the url , given we are dynamically forming the URLs.
Now playing with summary statistics:
We use the describe function from psych package
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
1797
59.71
76.92
20.09
46.64
29.79
0
587.5
587.5
2.15
7.25
1.81

hist(pe,col='blue',main='P/E Distribution')

We get the below histogram for the P/E ratio , which shows it is nowhere near a normal distribution , with it’s peakedness and skew as confirmed from the summary statistics as well
We will never the less do a normalty test
shapiro.test(pe)
 
        Shapiro-Wilk normality test
 
data:  pe 
W = 0.7496, p-value < 2.2e-16
 
Basically the null hypothesis is, the values come from a normal distribution and we see the p value to be very insignificant and hence we can easily reject the null.
Drawing a box plot on the P/E ratios
boxplot(pe,col='blue')

Finding the outliers
boxplot.stats(pe)$out
 
 
484.33 327.91 587.50
 
cname[which(pe %in% boxplot.stats(pe)$out)]

[1] "Bajaj Electrical" "BF Utilities"     "Ruchi Infrastr." 

Of course no prize guessing we should stay out of these stocks
So if we summarize this is kind of exploratory data analysis on PE ratio of Indian stocks

·     We saw, we can get content out of url and html tables
·      We added them in a data frame
·       Looked at summary statistics , histogram and did a normality test

·       Plotted a box plot and found the outliers 

Thursday, 19 December 2013

happiness equation !!!!




Who knew that happiness can have a equation, I was glancing through one of the lectures by   Swami   Sarvapriyananda, he introduced to something called as a happiness equation , something proposed by Martin Seligman, the father of positive psychology.   The version he shared and which is in perfect agreement with the  Purusartha’s of   Vedanta is something like this

H = P + E + M

  • P is for pleasure.   Good food, movies .  they are short-lived.
  • E comes from engagement, our profession, creativity , research
  • M is for meaningful, for others.  First beyond you and then beyond your family.

Happiness from E and M should make up for the most. u dine at a good restaurant , u spend 1 hour teaching a poor student.  after some months , years the later activity will give you much more happiness. 

Very profound!!!


More can be watched  @ goo.gl/n5TlWx

Monday, 11 November 2013

Statistica , very few points

Don’t think I am qualified, but got a chance to interact with a professional from Statistica and thought of sharing few things about the tool

·         It has amazing integration with ms suites , may be something in the cards
·         Import is easy and wizard driven , supports multitude of data files ,allows connection with a file as well as database
·         It also allows to work on data from multiple sources
·         It has good statistical capabilities
·         Did not get a chance to look at all functionality
·         Correlation and Regression looked good with linear, multiple, factor regression
·         Help files are good and comes with quite a few sample datasets
·         To me a winner is , it’s vba coding interface , which will make life simpler
·         You do not necessarily , need to bring all data locally and process , there is a technique available for the same as well


I know , this is tip of the ice berg , but just thought of sharing what ever I gathered, will keep you posted.

Wednesday, 9 October 2013

Classification using neural net in r

This is mostly for my students and myself for future reference.

Classification is a supervised task , where we need preclassified data and then on new data , I can predict.
Generally we holdout a % from the data available for testing and we call them training and testing data respectively.  So it's like this , if we know which emails are spam , then only using classification we can predict the emails as spam.

I used the dataset http://archive.ics.uci.edu/ml/datasets/seeds# .  The data set has 7 real valued attributes and 1 for predicting .  http://www.jeffheaton.com/2013/06/basic-classification-in-r-neural-networks-and-support-vector-machines/ has influenced many of the writing , probably I am making it more obvious.

The library to be used is library(nnet) , below are the list of commands for your reference



1.       Read from dataset

seeds<-read.csv('seeds.csv',header=T)

2.       Setting training set index ,  210 is the dataset size, 147 is 70 % of that

   seedstrain<- sample(1:210,147)

3.       Setting test set index

   seedstest <- setdiff(1:210,seedstrain)
 
4.       Normalize the value to be predicted , use that attribute of the dataset , that you want to predict

   ideal <- class.ind(seeds$Class)

5.       Train the model, -8 because you want to leave out the class attribute , the dataset had a total of 8 attributes with the last one as the predicted one

   seedsANN = nnet(seeds[seedstrain,-8], ideal[seedstrain,], size=10, softmax=TRUE)

6.       Predict on training set

   predict(seedsANN, seeds[seedstrain,-8], type="class")

7.       Calculate Classification accuracy


   table(predict(seedsANN, seeds[seedstest,-8], type="class"),seeds[seedstest,]$Class)

Happy Coding !

Wednesday, 4 September 2013

DBMS : few questions for freshers

Hello , once you are ready with the 'HR' type of questions , (you can take a look at some of the questions and answers at HR questions)  it is obviously important for getting ready for the technical and one of the subjects ,  and DBMS is one of the leading ones from both the parties. So I pen down few DBMS questions, this are all indicative ones , just to give you an idea on the depth and breadth. Normalization, Transactions , SQL and Indexing ,  I have seen to be all time favorites.

Feel free to give your comments , post answers , more questions , any other feedback.

Imagining you guyz are reading this because of your impending campus , read smartly , fine balancing with enjoying life , work on a plan and you will be partying in time :)

DBMS :
General

1.       What is OLAP and OLTP?
2.       What are the advantages of DBMS over file system?
3.       Why RDBMS called RDBMS ? Sress on the relational part.
4.       Draw an ER Diagram for the project you did ?

Normalization
5.       Why do we normalize?
6.       What are the different anomalies?
7.       Why do we denormalize ?

Transaction
8.       What is a transaction?
9.       What is ACID property?
10.   How do you implement a transaction from a programming language like C# or java?
11.   Can a transaction be partially committed ?
12.   What is locking , what is two phase locking ?
13.   What is seralizability ?

Integrity :
14.   What are different type of constraints ?
15.   What is primary key , foreign key , candidate key ?
16.   What is unique , Check constrains ?
17.   Difference between a primary key and unique constraint ?

Indexing:
18.   Why do we use index?
19.   What are the overheads of index?
20.   What is the difference between clustered index and a non-clustered index ?
21.   What is the degree of a relationship? Does relationships always needs a table ?

SQL:
22.   What is difference between a function and stored procedure ?
23.   What is cursor ?  What are the different types of cursor in oracle ?
24.   What are the different indices that oracle support?’
25.   What is the difference on where and having clause ?
26.   Questions on NULL
27.   What is the difference between char data type and varchar data type?
28.   What is a view?  What is materialized view?  Why they are used?
29.   What are different types of joins , what is difference between a Cartesian product and a full outer join ( Practice this with few examples )
30.   What is the function of UNION ?  Is it different from UNION ALL?
31.   What is DDL and DML,  how truncate is different from Delete ?

Sunday, 18 August 2013

Preparing for Campus ...

The most important thing in a campus interview is to be yourself.  It is easier said than done as in all probability it is your first job interview and you are in cloths, which you generally do not wear, additionally there is a tie hanging awkwardly.  To top it, the procedure has started from morning, so it is quite exhausting mentally and physically.  So best thing is be prepared ,  I will give you simple dos and don’ts from the experience that I have during my years in Cognizant and various student interactions in Calcutta University and Institute of Engineering and Management ( The views are completely personal)

Ø    Read each and every word of your resume , don’t keep anything if you are not comfortable
Ø    Introduce yourself is a sure question , don’t fumble on this , set it up here , tell about your subject interest , hobbies ,  if you are a student partner , a class representative , some uncanny hobbies, part of your college computer society .  If you have done something academically very good don’t forget to mention that.
Ø    Strength and weakness is optional and if you say something, please be prepared to back it up. So if you say you have analytic mind or you are hard working, you should be able to give few examples
Ø    You need not be too forthcoming on your weaknesses, and please do not mention like I am too emotional or I am loose tempered.
Ø    Have a professional looking email id ,  not something like itwasnoteasy  , walkthetalk , or futufutejyosthna . Don’t laugh , I have faced it
Ø    I will say write it down and practice, record and listen to if required.
Ø    Don’t over commit on your subjects , two theoretical subjects , one/two programming language , one database should be sufficient
Ø    Prepare well on your final year / Internship project ,  architecture , business sense / unique proposition should come out very clearly
Ø    People have told their hobby is reading, when I have asked what book they last read, they are drawing a blank.
Ø    Be prepared for questions like, “Why do you want to join TCS”, “Why should we hire you”.  I would have been honest with these questions like “I have heard very positive things from my seniors, and Tata name has a huge brand value. The training programs are excellent ……..”. Not something like this is my dream company and I have always thought of joining this company and all that.
Ø    If you have a year lag or considerable % drop please prepare for the same question.
Ø    On technical part , please understand people who are coming will be 10 + years experience and mostly they might not ask you definitions ,  I will ask you to focus from interview perspective , not an examination purpose ,  know the examples , understand the concepts , discuss with friends to gain confidence.

Ø   Pseudo code will suffice most of the cases , so do not spend lot of time on syntax

For few technical questions on DBMS , you may visit DBMS Questions

Wish you All the very best. 

Friday, 21 December 2012

How ACIDic are Transactions really!

I got a little bit distracted while I was hearing about the ‘No SQL Databases’  , one of the key characteristics is they do not support Full ACID transactions,  This brings back  some of the memories. Not all of them pleasant, this is/was one of the sure shot menus to be served by the interviewers, be it job interviews or college viva.  Honestly, now I think the entire ‘ACID’ thing was little overdone. I have no disillusion that it is one of the widely written topics , a google search below brings 1.5 billon search pages, but I thought of making a point, may be at the cost of being repetitive.

I talk very briefly , highlighting key concepts.
ACID (Atomicity, Consistency, Isolation, Durability)
Atomicity:  Either all or none. Reverts to earlier state in case of a failure (system, hardware, software) . Handled by Recivery-management component
Consistency:  When the transaction is complete it is in a consistent state.  Responsibility of an Application Programmer.
Durability: Once a transaction is completed, the effects are permanent.  No failure can change the same. This is responsibility of recovery-management component
Isolation:  Concept of Serializability, responsibility of concurrency control

Read it again please carefully, is not concurrency and recovery all that we are talking.  Keeping consistency is like having commonsense, which is uncommonJ. 

I think we should talk about CR Properties rather than ACID.

Follow by Email