meet Saptarsi

Tuesday, 19 August 2014

Questions on Java

Friends , here is my first attempt , the SCJP book by Kathy has given me few questions , I would like to enrich it further , but time for a first cut

OOPS, Java

· Is Java compiled or interpreted?

· How can we declare two public classes in a Java source file?

· Is there a restriction on naming the source file?

· Give an example of code statement, where do while and while will give different output?

· What is the default access, if we do not specify any access modifiers?

· How method overloading and method overriding is different?

· Give example of a non access modifier?

· Why should some classes or methods be marked as final?

· How the abstract class is different from an interface as far as abstract methods are concerned?

· How garbage collection works in C# and Java ?

· What is boxing and Unboxing ?

· How can we implement threads ?

· What are the different states of a thread ?

· What is a web service? How this is related with SOAP, UDDI and WSDL?

· What are the different ways to maintain state in a web application , relative advantage and disadvantages ? ( Hint : Session Variable, Cookies , Hidden Fields , Query String ) . For .Net view state is important.

· What is MVC Architecture ?

· Explain 3- tier Architecture ?

· Explain Service oriented architecture

· What is the difference between default and package access modifier

· Let’s say Class A has a private method called fun() , class B extends Class A and defined fun() , what this phenomenon is called ?

· What is the use of Final Keyword ? Can we declare a variable as final ?

· Is it possible to define a abstract class , with non abstract member s?

· When should we use an arraylist in place of an array?

· What is the method that can be used to traverse an arraylist?

· What is checked and unchecked exception ? Give example of both type of exceptions.

· How to use your own exception , so let’s say you are taking age as an input and you want to throw an exception when it is less than zero, how will you achieve it ?

· Give some example of statements that should be put in finally ?

· Does Java allow multiple inheritance ? What is ‘Deadly Diamond of Death”

· Class A

{

public void fun()

{

}

Class B extend A

{

Protected void fun()

}

Is there a problem in the above code ?

· Give an example how polymorphism can be achieved using method overriding?

public int foo()

{

char c = ‘c’;

return c

}

Is this code legal ?

· What is default constructor supplied by the compiler ? Is there any caution , that you need to follow when you create your own constructor, will the default constructor be supplied in the following case

class Horse()

{

void Horse ()

}

· Which variables are in stack , which are stored in heap ?

· What do we mean by saying strings are immutable objects ? what is role of ‘String Constant Pool” . Which classed do we use to overcome the limitations of string ?

Monday, 28 July 2014

Beyond bars, pie charts and histograms …………

( Was written for the Institute of Engineering and Management 2014 Magazine , put online for better access , specially for my students )

The reference of the popular saying “A picture is worth a thousand words” goes back to early part of 20^th century. If we take graphs as a rough equivalent to picture, apart from bringing the quick interpretability factor, the other benefit of the same is a summarized view of the data. As a case in point, let’s look at the below crime data of US [U.S. Department of Justice]. I am taking crime data in property subdivided in Burglary, Larceny theft and Motor Vehicle theft (For those of you, who have been wondering what is larceny like me, it’s just a legal term for stealing). The numbers indicate rates per 100,000 populations.

States	Burglary	Larceny- theft	Motor Vehicle Theft	Total
Alabama	953.8	2650	288.3	3892.1
Alaska	622.5	2599.1	391	3612.5
Arizona	948.4	2965.2	924.4	4838
Arkansas	1084.6	2711.2	262.1	4057.9
California	693.3	1916.5	712.8	3322.6
Colorado	744.8	2735.2	559.5	4039.5
Connecticut	437.1	1824.1	296.8	2558
Delaware	688.9	2144	278.5	3111.4
Florida	926.3	2658.3	423.3	4007.9
Georgia	931	2751.1	490.2	4172.3

Many of you would apply your ‘CAT – cracking ´ DI skill and would do some quick mental math way better than me, but there is no denying, the below chart reveals much more to mere mortals. We are very quick to spot which are two top/bottom crime prone states.

Let’s look at another view of the same data, this time a stacked bar chart

A stacked bar chart takes contribution of each state as 100% and then depicts the contribution from different components with different color portions. It will be difficult to miss

Ø Alabama and Arkansas have a relatively lesser proportion of motor vehicles theft

Ø Whereas Alaska and Colorado has relatively low proportion of burglary

I hope that now some of you might have been convinced on the usefulness of traditional charts. So this is an opportune time for “kahanime Twist”. There is a trite ‘Gyan’ available specially for negotiating on price while shopping from hawkers. You should not start negotiating on the thing you liked and want to purchase, rather do on something else nearby (for a lack of an original example let me settle for if you are trying to buy a রুমাল (handkerchief) ask the price of a বেড়াল (cat)), bargain hard and then casually bring him to the piece you actually want to buy. Because if the hawker understands you really fell for it, the chance of negotiation is bleak. So with this entire prologue, let me just say gone are the days, where we can impress people with graphical reports with bar charts, histograms, pie charts, line charts etc. This can be felt more with the advent of “Big Data”.

Commercially ‘Big Data” is differentiated by three Vs, Volume, Variety and Velocity. Without going into too much of details, I guess it will be sufficient to draw your attention to the number of tweets per day, which is 500 millions! This gave birth to a new term named Tweets per second (TPS). For understanding variety we do not need to look beyond the photos, blogs, chats, emails, medical images, sensor logs, RFID logs generated at each second. When we talk about velocity we are not talking about the rate data is getting generated, but also the rate at which it is becoming obsolete. One practical example of the same is Gurgaon police now a days have added tweets in their dashboard. So if people are tweeting on some quarrel breaking out , bedlam or some other reason a pandemonium , the police can exactly track the area at real time via the tweets, so they can send the nearest petrol car within minutes. So contrary to the”filmy” style of arriving well after when crime is done and criminals fled, now the cops arrive well in time.

There has been a toolbox of visualization techniques on this new “avatar” of the Data and a lot can be written on them. I will choose just one of them as a class representative (Class of new visualization techniques)

So meet “Mr Word Cloud”. You might not have known his formal name but surely you had a chance or deliberate tryst with him somewhere in your internet life. Though he is not a toddler, but at least he is your age group (1992 born). So here is a word cloud used by a leading data analyst in his blogs emphasizing on the importance of data. It is very handy to summarize large body of texts, blogs, and tweets. Oh by the way I was forgetting, he does not mind being called ‘tag cloud’ as well.

It’s easy to note that words have different font size, which is proportional to their frequency aka importance, the vertical and horizontal orientation does not mean much.

Roughly there are two ways you can start using word clouds.

1. Use a site like http://www.wordle.net/ . Either copy paste your text or give the url of a site which has a Atom or RSS feed (very crudely a mechanism to exchange dynamically updated information).

2. Use a computing environment like R ( it’s one of the most used open source)statistical and data analysis computing environment. (http://cran.r-project.org/)

The below one is freshly baked for you, using not more than 10 lines of code in R from few of the movie reviews (NDTV, Hindustan Times, TOI etc. ) on a movie and no prize for guessing the name of the movie ( I used some filtering of selecting only words which has appeared at least 3 times and some further work can be done in removing ‘stop words’ which are film industry specific, basically words which are likely to be present in any movie review , irrespective of the movie like movie, film etc. Without watching the movie¸ I would not be probably wrong in concluding if the movie is to be watched it’s for Aamir.

I hope to make a case for visualization techniques and did you guys think it can help hide the author’s excellent writing prowess or lack of it; well none of your provocations will elicit an answer from me.

Monday, 31 March 2014

Exploratory data analysis on P/E ratio of Indian Stocks

Price Earnings ratio (P/E) is one of the very popular ratios reported with all stocks. Very simply this is thought as - Current Market Price / Earning per Share. An operational definition of Earning per Share would be Total profit divided by # of Shares . I will redirect interested readers for further reading to

www.investopedia.com/terms/p/price-earningsratio.asp

In this post, I would just like to show, how we can grab P/E data from Web and create some visualizations on it. My focus right now is Indian stocks and I intend to use the below website

http://www.indiainfoline.com/MarketStatistics/PE-Ratios/

So my first step is gearing up for the data extraction and essentially that is the most non-trivial task. As shown in the figure below, there is separate pages for each sector and we need to click on individual links , to go to that page and get the P/E ratios.

Here is something , I did outside ‘r’ , creating a csv file with the sector names , using delimiters while importing text and paste special as transpose , here is how my csv file would look. I would never discourage using multiple tools as this would be required to solve real world issues

So now I can import this in a dataset and read one row at a time and go to necessary URLs , but god have different plans J , it’s not that straightforward

Case 1 : Single word sector names :

We have sector as ‘Banks’and the sector link is as below

http://www.indiainfoline.com/MarketStatistics/PE-Ratios/Banks-Sector

Again it is a no brainer , we can pick up the base url , append the sector name after a forward slash and then append the string ‘-Sector’ , this is true for most single word sector names like ‘FMCG’ , ‘Tyres’ , ‘Heathcare’ etc

Case 2: Multiple words without ‘-‘ , ‘&’ and ‘/’

We have sector as ‘Tobacco Products’ and the sector link is as below

http://www.indiainfoline.com/MarketStatistics/PE-Ratios/Tobacco-Products-Sector

This is also not that difficult apart from adding the ‘-Sector’ we need replace the spaces by a ‘-‘ .

Case 3: Multiple words with a ‘-‘

We have sector name as ‘IT-Software’, where we have to remove other spaces if exiting. There can be several other cases, but for discussion sake , I will limit myself here

Case 4: Multiple words with a ‘/‘

We have sector name as ‘Stock/ Commodity Brokers’, so the “/” needs to be removed

# Reading in dataset

sectorsv1 <- read.csv("C:/Users/user/Desktop/Datasets/sectorsv1.csv")

# Converting to a matrix , this is a practice generally I follow

sectorvm<-as.matrix(sectorsv1)

we can access individual sectors by , sectorvm[rowno,colon]

pe<-c()

cname<-c()

cnt<-0

baseurl<-'http://www.indiainfoline.com/MarketStatistics/PE-Ratios/'

sectorvm<-as.matrix(sectorsv1)

for(i in 1:nrow(sectorvm))

{

securl<-sectorvm[i,1]

# Fixed true indicated the string is to matched as is and is not a regular expression

# Substitution of the different cases as we explained , we will point out using gsub instead of sub

# else only the first instance will be replaced

if(length(grep(' ',securl,fixed=TRUE))!=1)

{

securl<-paste(securl,'-Sector', sep="")

}

else

{

securl<-gsub(' ', '-', securl, ignore.case =FALSE, fixed=TRUE)

if(length(grep('---',securl,fixed=TRUE))==1)

{

securl<-gsub(' ---', '-', securl, ignore.case =FALSE, fixed=TRUE)

}

if(length(grep('&',securl,fixed=TRUE))==1)

{

securl<-gsub('&', 'and', securl, ignore.case =FALSE, fixed=TRUE)

}

if(length(grep('/',securl,fixed=TRUE))==1)

{

securl<-gsub('/', '', securl, ignore.case =FALSE, fixed=TRUE)

}

if(length(grep(',',securl,fixed=TRUE))==1)

{

securl<-gsub(',', '', securl, ignore.case =FALSE, fixed=TRUE)

}

securl<-paste(securl,'-Sector', sep="")

}

fullurl<-paste(baseurl,securl, sep="")

print(fullurl)

if (url.exists(fullurl))

{

petbls<-readHTMLTable(fullurl)

# Exploring the tables we found out relevant information on table 2

# Also the data is getting stored as factor , just doing an as.numeric will not suffice

# we need to do an as.character and then an as.numeric

pe<-c(pe,as.numeric(as.character(petbls[[2]]$PE)))

cname<-c(cname, as.character (petbls[[2]]$Company))

cnt = cnt + 1

}

Different functions that we have used are explained as below

readHTMLTables -> Given a url , this function can retrieve the contents of the <Table> tag from html page. We need to use appropriate no. for the same. Like in this case we have used table no 2.

Grep, Paste, Gsub are normal string functions, grep finds occurrence of a string in another, paste concatenates and gsub does the act of replacing.

As.numeric(as.character()) had a lasting impressing on my mind as an innocuous and intuitive as.numeric would have left me only with the ranks.

url.exists :-> it is a good idea , to check the existence of the url , given we are dynamically forming the URLs.

Now playing with summary statistics:

We use the describe function from psych package

n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
1797	59.71	76.92	20.09	46.64	29.79	0	587.5	587.5	2.15	7.25	1.81

hist(pe,col='blue',main='P/E Distribution')

We get the below histogram for the P/E ratio , which shows it is nowhere near a normal distribution , with it’s peakedness and skew as confirmed from the summary statistics as well

We will never the less do a normalty test

shapiro.test(pe)

        Shapiro-Wilk normality test

data:  pe

W = 0.7496, p-value < 2.2e-16

Basically the null hypothesis is, the values come from a normal distribution and we see the p value to be very insignificant and hence we can easily reject the null.

Drawing a box plot on the P/E ratios

boxplot(pe,col='blue')

Finding the outliers

boxplot.stats(pe)$out

484.33 327.91 587.50

cname[which(pe %in% boxplot.stats(pe)$out)]

[1] "Bajaj Electrical" "BF Utilities"     "Ruchi Infrastr."

Of course no prize guessing we should stay out of these stocks

So if we summarize this is kind of exploratory data analysis on PE ratio of Indian stocks

· We saw, we can get content out of url and html tables

· We added them in a data frame

· Looked at summary statistics , histogram and did a normality test

· Plotted a box plot and found the outliers

Thursday, 19 December 2013

happiness equation !!!!

Who knew that happiness can have a equation, I was glancing through one of the lectures by Swami Sarvapriyananda, he introduced to something called as a happiness equation , something proposed by Martin Seligman, the father of positive psychology. The version he shared and which is in perfect agreement with the Purusartha’s of Vedanta is something like this

H = P + E + M

P is for pleasure. Good food, movies . they are short-lived.
E comes from engagement, our profession, creativity , research
M is for meaningful, for others. First beyond you and then beyond your family.

Happiness from E and M should make up for the most. u dine at a good restaurant , u spend 1 hour teaching a poor student. after some months , years the later activity will give you much more happiness.

Very profound!!!

More can be watched @ goo.gl/n5TlWx

Monday, 11 November 2013

Statistica , very few points

Don’t think I am qualified, but got a chance to interact with a professional from Statistica and thought of sharing few things about the tool

· It has amazing integration with ms suites , may be something in the cards

· Import is easy and wizard driven , supports multitude of data files ,allows connection with a file as well as database

· It also allows to work on data from multiple sources

· It has good statistical capabilities

· Did not get a chance to look at all functionality

· Correlation and Regression looked good with linear, multiple, factor regression

· Help files are good and comes with quite a few sample datasets

· To me a winner is , it’s vba coding interface , which will make life simpler

· You do not necessarily , need to bring all data locally and process , there is a technique available for the same as well

I know , this is tip of the ice berg , but just thought of sharing what ever I gathered, will keep you posted.

meet Saptarsi

Total Pageviews

Tuesday, 19 August 2014

Questions on Java

Monday, 28 July 2014

Beyond bars, pie charts and histograms …………

Monday, 31 March 2014

Exploratory data analysis on P/E ratio of Indian Stocks

Thursday, 19 December 2013

happiness equation !!!!

Monday, 11 November 2013

Statistica , very few points

About Me

Translate

Total Pageviews

Tuesday, 19 August 2014

Questions on Java

Monday, 28 July 2014

Beyond bars, pie charts and histograms …………

Monday, 31 March 2014

Exploratory data analysis on P/E ratio of Indian Stocks

Thursday, 19 December 2013

happiness equation !!!!

Monday, 11 November 2013

Statistica , very few points

About Me

Subscribe To