Price Earnings ratio (P/E) is one
of the very popular ratios reported with all stocks. Very simply this is thought as - Current
Market Price / Earning per Share. An
operational definition of Earning per Share would be Total profit divided by #
of Shares . I will redirect interested
readers for further reading to
In this post, I would just like
to show, how we can grab P/E data from Web and create some visualizations on
it. My focus right now is Indian stocks
and I intend to use the below website
So my first step is gearing up
for the data extraction and essentially that is the most non-trivial task. As shown in the figure below, there is
separate pages for each sector and we need to click on individual links , to go
to that page and get the P/E ratios.
Here is something , I did outside
‘r’ , creating a csv file with the sector names , using delimiters while
importing text and paste special as transpose , here is how my csv file would
look. I would never discourage using
multiple tools as this would be required to solve real world issues
So now I can import this in a
dataset and read one row at a time and go to necessary URLs , but god have
different plans J ,
it’s not that straightforward
Case 1 : Single word sector names :
We have sector as ‘Banks’and the
sector link is as below
Again it is a no brainer , we can
pick up the base url , append the sector name after a forward slash and then
append the string ‘-Sector’ , this is
true for most single word sector names like ‘FMCG’ , ‘Tyres’ , ‘Heathcare’ etc
Case 2: Multiple words without ‘-‘ , ‘&’ and ‘/’
We have sector as ‘Tobacco
Products’ and the sector link is as below
This is also not that difficult
apart from adding the ‘-Sector’ we need replace the spaces by a ‘-‘ .
Case 3: Multiple words with a ‘-‘
We have sector name as
‘IT-Software’, where we have to remove other spaces if exiting. There can be
several other cases, but for discussion sake , I will limit myself here
Case 4: Multiple words with a ‘/‘
We have sector name as ‘Stock/ Commodity
Brokers’, so the “/” needs to be removed
# Reading in dataset
sectorsv1
<- read.csv("C:/Users/user/Desktop/Datasets/sectorsv1.csv")
# Converting to a matrix , this
is a practice generally I follow
sectorvm<-as.matrix(sectorsv1)
we can access individual sectors
by , sectorvm[rowno,colon]
pe<-c()
cname<-c()
cnt<-0
baseurl<-'http://www.indiainfoline.com/MarketStatistics/PE-Ratios/'
sectorvm<-as.matrix(sectorsv1)
for(i in 1:nrow(sectorvm))
{
securl<-sectorvm[i,1]
# Fixed true indicated the string is to
matched as is and is not a regular expression
# Substitution of the different cases
as we explained , we will point out using gsub instead of sub
# else only the first instance will be
replaced
if(length(grep('
',securl,fixed=TRUE))!=1)
{
securl<-paste(securl,'-Sector',
sep="")
}
else
{
securl<-gsub(' ', '-', securl,
ignore.case =FALSE, fixed=TRUE)
if(length(grep('---',securl,fixed=TRUE))==1)
{
securl<-gsub(' ---', '-', securl,
ignore.case =FALSE, fixed=TRUE)
}
if(length(grep('&',securl,fixed=TRUE))==1)
{
securl<-gsub('&',
'and', securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep('/',securl,fixed=TRUE))==1)
{
securl<-gsub('/',
'', securl, ignore.case =FALSE, fixed=TRUE)
}
if(length(grep(',',securl,fixed=TRUE))==1)
{
securl<-gsub(',',
'', securl, ignore.case =FALSE, fixed=TRUE)
}
securl<-paste(securl,'-Sector',
sep="")
}
fullurl<-paste(baseurl,securl,
sep="")
print(fullurl)
if (url.exists(fullurl))
{
petbls<-readHTMLTable(fullurl)
# Exploring the tables we found out
relevant information on table 2
# Also the data is getting stored as
factor , just doing an as.numeric will not suffice
# we need to do an as.character and
then an as.numeric
pe<-c(pe,as.numeric(as.character(petbls[[2]]$PE)))
cname<-c(cname, as.character (petbls[[2]]$Company))
cnt = cnt + 1
}
}
Different functions that we have
used are explained as below
readHTMLTables -> Given a url
, this function can retrieve the contents of the <Table> tag from html
page. We need to use appropriate no. for
the same. Like in this case we have used table no 2.
Grep, Paste, Gsub are normal
string functions, grep finds occurrence of a string in another, paste
concatenates and gsub does the act of replacing.
As.numeric(as.character()) had a
lasting impressing on my mind as an innocuous and intuitive as.numeric would
have left me only with the ranks.
url.exists :-> it is a good
idea , to check the existence of the url , given we are dynamically forming the
URLs.
Now playing with summary statistics:
We use the describe function from
psych package
n
|
mean
|
sd
|
median
|
trimmed
|
mad
|
min
|
max
|
range
|
skew
|
kurtosis
|
se
|
1797
|
59.71
|
76.92
|
20.09
|
46.64
|
29.79
|
0
|
587.5
|
587.5
|
2.15
|
7.25
|
1.81
|
hist(pe,col='blue',main='P/E Distribution')
We get the below histogram for
the P/E ratio , which shows it is nowhere near a normal distribution , with it’s
peakedness and skew as confirmed from the summary statistics as well
We will never the less do a
normalty test
shapiro.test(pe)
Shapiro-Wilk normality test
data: pe
W = 0.7496, p-value < 2.2e-16
Basically the null hypothesis
is, the values come from a normal distribution and we see the p value to be
very insignificant and hence we can easily reject the null.
Drawing a box plot on the P/E
ratios
boxplot(pe,col='blue')
Finding the outliers
boxplot.stats(pe)$out
484.33 327.91 587.50
cname[which(pe %in% boxplot.stats(pe)$out)]
[1] "Bajaj Electrical" "BF Utilities" "Ruchi Infrastr."
Of course no prize guessing we should
stay out of these stocks
So if we summarize this is kind
of exploratory data analysis on PE ratio of Indian stocks
· We saw, we can get content out of url and html
tables
· We added them in a data frame
· Looked at summary statistics , histogram and did
a normality test
· Plotted a box plot and found the outliers