( Was written for the Institute of Engineering and Management 2014 Magazine , put online for better access , specially for my students )
The reference of the popular saying “A picture is worth a thousand
words” goes back to early part of 20th century. If we take graphs as
a rough equivalent to picture, apart from bringing the quick interpretability
factor, the other benefit of the same is a summarized view of the data. As a
case in point, let’s look at the below crime data of US [U.S. Department of
Justice]. I am taking crime data in
property subdivided in Burglary, Larceny theft and Motor Vehicle theft (For
those of you, who have been wondering what is larceny like me, it’s just a
legal term for stealing). The numbers indicate rates per 100,000 populations.
States
|
Burglary
|
Larceny- theft
|
Motor Vehicle Theft
|
Total
|
Alabama
|
953.8
|
2650
|
288.3
|
3892.1
|
Alaska
|
622.5
|
2599.1
|
391
|
3612.5
|
Arizona
|
948.4
|
2965.2
|
924.4
|
4838
|
Arkansas
|
1084.6
|
2711.2
|
262.1
|
4057.9
|
California
|
693.3
|
1916.5
|
712.8
|
3322.6
|
Colorado
|
744.8
|
2735.2
|
559.5
|
4039.5
|
Connecticut
|
437.1
|
1824.1
|
296.8
|
2558
|
Delaware
|
688.9
|
2144
|
278.5
|
3111.4
|
Florida
|
926.3
|
2658.3
|
423.3
|
4007.9
|
Georgia
|
931
|
2751.1
|
490.2
|
4172.3
|
Many of you would apply your ‘CAT – cracking ´ DI skill and would do
some quick mental math way better than me, but there is no denying, the below
chart reveals much more to mere mortals. We are very quick to spot which are two
top/bottom crime prone states.
Let’s look at another view of the
same data, this time a stacked bar chart
A stacked bar chart takes contribution of each state as 100% and then
depicts the contribution from different components with different color
portions. It will be difficult to miss
Ø
Alabama and Arkansas have a relatively lesser
proportion of motor vehicles theft
Ø
Whereas Alaska and Colorado has relatively low
proportion of burglary
I hope that now some
of you might have been convinced on the usefulness of traditional charts. So
this is an opportune time for “kahanime Twist”. There is a trite ‘Gyan’
available specially for negotiating on price while shopping from hawkers. You should not start negotiating on the thing
you liked and want to purchase, rather do on something else nearby (for a lack
of an original example let me settle for if you are trying to buy a রুমাল
(handkerchief) ask the price of a বেড়াল (cat)), bargain hard and then casually bring
him to the piece you actually want to buy. Because if the hawker understands
you really fell for it, the chance of negotiation is bleak. So with this entire
prologue, let me just say gone are the days, where we can impress people with
graphical reports with bar charts, histograms, pie charts, line charts etc. This can be felt more with the advent of “Big
Data”.
Commercially ‘Big Data” is differentiated by three Vs, Volume, Variety
and Velocity. Without going into too much
of details, I guess it will be sufficient to draw your attention to the number
of tweets per day, which is 500 millions! This gave birth to a new term named
Tweets per second (TPS). For
understanding variety we do not need to look beyond the photos, blogs, chats,
emails, medical images, sensor logs, RFID logs generated at each second. When
we talk about velocity we are not talking about the rate data is getting
generated, but also the rate at which it is becoming obsolete. One practical
example of the same is Gurgaon police now a days have added tweets in their
dashboard. So if people are tweeting on some quarrel breaking out , bedlam or
some other reason a pandemonium , the police can exactly track the area at real
time via the tweets, so they can send the nearest petrol car within minutes. So
contrary to the”filmy” style of arriving well after when crime is done and
criminals fled, now the cops arrive well in time.
There has
been a toolbox of visualization techniques on this new “avatar” of the Data and
a lot can be written on them. I will choose just one of them as a class
representative (Class of new visualization techniques)
So meet “Mr Word
Cloud”. You might not have known his formal name but surely you had a
chance or deliberate tryst with him somewhere in your internet life. Though he is not a toddler, but at least he
is your age group (1992 born). So here
is a word cloud used by a leading data analyst in his blogs emphasizing on the
importance of data. It is very handy to
summarize large body of texts, blogs, and tweets. Oh by the way I was
forgetting, he does not mind being called ‘tag cloud’ as well.
It’s easy to
note that words have different font size, which is proportional to their
frequency aka importance, the vertical and horizontal orientation does not mean
much.
Roughly there
are two ways you can start using word clouds.
1.
Use a site like http://www.wordle.net/
. Either copy paste your text or give the url of a site which has a Atom or RSS
feed (very crudely a mechanism to exchange dynamically updated information).
2.
Use a computing environment like R ( it’s one of
the most used open source)statistical and data analysis computing environment.
(http://cran.r-project.org/)
The below one
is freshly baked for you, using not more than 10 lines of code in R from few of
the movie reviews (NDTV, Hindustan Times, TOI etc. ) on a movie and no prize
for guessing the name of the movie ( I used some filtering of selecting only
words which has appeared at least 3 times and some further work can be done in
removing ‘stop words’ which are film industry specific, basically words which
are likely to be present in any movie review , irrespective of the movie like
movie, film etc. Without watching the movie¸ I would not be probably wrong in
concluding if the movie is to be watched it’s for Aamir.
I hope to
make a case for visualization techniques and did you guys think it can help
hide the author’s excellent writing prowess or lack of it; well none of your
provocations will elicit an answer from me.