Total Pageviews

Monday 28 July 2014

Beyond bars, pie charts and histograms …………

( Was written for the Institute of Engineering and Management 2014 Magazine , put online for better access , specially for my students )

The reference of the popular saying “A picture is worth a thousand words” goes back to early part of 20th century. If we take graphs as a rough equivalent to picture, apart from bringing the quick interpretability factor, the other benefit of the same is a summarized view of the data. As a case in point, let’s look at the below crime data of US [U.S. Department of Justice].  I am taking crime data in property subdivided in Burglary, Larceny­ theft and Motor Vehicle theft (For those of you, who have been wondering what is larceny like me, it’s just a legal term for stealing). The numbers indicate rates per 100,000 populations.


States
Burglary
Larceny- theft
Motor Vehicle Theft
Total
Alabama
953.8
2650
288.3
3892.1
Alaska
622.5
2599.1
391
3612.5
Arizona
948.4
2965.2
924.4
4838
Arkansas
1084.6
2711.2
262.1
4057.9
California
693.3
1916.5
712.8
3322.6
Colorado
744.8
2735.2
559.5
4039.5
Connecticut
437.1
1824.1
296.8
2558
Delaware
688.9
2144
278.5
3111.4
Florida
926.3
2658.3
423.3
4007.9
Georgia
931
2751.1
490.2
4172.3

Many of you would apply your ‘CAT – cracking ´ DI skill and would do some quick mental math way better than me, but there is no denying, the below chart reveals much more to mere mortals. We are very quick to spot which are two top/bottom crime prone states.





Let’s look at another view of the same data, this time a stacked bar chart



A stacked bar chart takes contribution of each state as 100% and then depicts the contribution from different components with different color portions.  It will be difficult to miss
Ø  Alabama and Arkansas have a relatively lesser proportion of motor vehicles theft
Ø  Whereas Alaska and Colorado has relatively low proportion of burglary
I hope that now some of you might have been convinced on the usefulness of traditional charts. So this is an opportune time for “kahanime Twist”. There is a trite ‘Gyan’ available specially for negotiating on price while shopping from hawkers.  You should not start negotiating on the thing you liked and want to purchase, rather do on something else nearby (for a lack of an original example let me settle for if you are trying to buy a রুমাল (handkerchief) ask the price of a বেড়াল (cat)), bargain hard and then casually bring him to the piece you actually want to buy. Because if the hawker understands you really fell for it, the chance of negotiation is bleak. So with this entire prologue, let me just say gone are the days, where we can impress people with graphical reports with bar charts, histograms, pie charts, line charts etc.  This can be felt more with the advent of “Big Data”.
Commercially ‘Big Data” is differentiated by three Vs, Volume, Variety and Velocity.  Without going into too much of details, I guess it will be sufficient to draw your attention to the number of tweets per day, which is 500 millions! This gave birth to a new term named Tweets per second (TPS).  For understanding variety we do not need to look beyond the photos, blogs, chats, emails, medical images, sensor logs, RFID logs generated at each second. When we talk about velocity we are not talking about the rate data is getting generated, but also the rate at which it is becoming obsolete. One practical example of the same is Gurgaon police now a days have added tweets in their dashboard. So if people are tweeting on some quarrel breaking out , bedlam or some other reason a pandemonium , the police can exactly track the area at real time via the tweets, so they can send the nearest petrol car within minutes. So contrary to the”filmy” style of arriving well after when crime is done and criminals fled, now the cops arrive well in time.
There has been a toolbox of visualization techniques on this new “avatar” of the Data and a lot can be written on them. I will choose just one of them as a class representative (Class of new visualization techniques)
So meet “Mr Word Cloud”. You might not have known his formal name but surely you had a chance or deliberate tryst with him somewhere in your internet life.  Though he is not a toddler, but at least he is your age group (1992 born).  So here is a word cloud used by a leading data analyst in his blogs emphasizing on the importance of data.  It is very handy to summarize large body of texts, blogs, and tweets. Oh by the way I was forgetting, he does not mind being called ‘tag cloud’ as well.


It’s easy to note that words have different font size, which is proportional to their frequency aka importance, the vertical and horizontal orientation does not mean much.
Roughly there are two ways you can start using word clouds.
1.       Use a site like http://www.wordle.net/ . Either copy paste your text or give the url of a site which has a Atom or RSS feed (very crudely a mechanism to exchange dynamically updated information).
2.       Use a computing environment like R ( it’s one of the most used open source)statistical and data analysis computing environment. (http://cran.r-project.org/)
The below one is freshly baked for you, using not more than 10 lines of code in R from few of the movie reviews (NDTV, Hindustan Times, TOI etc. ) on a movie and no prize for guessing the name of the movie ( I used some filtering of selecting only words which has appeared at least 3 times and some further work can be done in removing ‘stop words’ which are film industry specific, basically words which are likely to be present in any movie review , irrespective of the movie like movie, film etc. Without watching the movie¸ I would not be probably wrong in concluding if the movie is to be watched it’s for Aamir.

 I hope to make a case for visualization techniques and did you guys think it can help hide the author’s excellent writing prowess or lack of it; well none of your provocations will elicit an answer from me.

2 comments: