Friday, September 2, 2011

Word Cloud from Blog RSS



Crazy busy  - no time to blog recently. Time enough for pretty pictures based upon previous words though...(thanks http://www.wordle.net).






Share/Bookmark

Saturday, January 1, 2011

Ten News Stories of 2010 - and the Statistics that Made Them

Significance Magazine
According to Significance Magazine (jointly published by Royal Statistical Society and the American Statistical Association) the following are the top ten stories of 2010.

1. Progress in the prevention of HIV:    Public health studies result in HIV treatment advancements.
2. Drug regulation: restrictions and retractions:   Related to breast cancer and type 2 diabetes.
3. Measuring a teacher's value:  LA Times graded teachers based on standards tests results. 
4. Political rhetoric finds a helpmeet in statistics:   "a statistical recovery and a human recession."
5. Census of Marine Life: The first census of the world's seas completed in 2010.
6. Death of Frederick Jelinek:  a pioneer in speech recognition and statistical methods of NLP.
7. The genetic key to Shangri-La:  Dr. Paola Sebastiani genetics advancements related to longevity.
8. Screening saves: CT Scanning definitively associated with a reduced risk of lung cancer mortality.
9. Fat kills: Quantitative reviews in various areas of health and nutrition.
10. Words, words words: Culturomics project produces the Google Ngram Viewer.


The details are parceled out in 5 articles: Part 1 | Part 2  | Part 3 | Part 4 | Part 5



Other Stories - or my $0.02.
The following are not exactly in the same category as the listings in significance magazine - but they involve personalities and events that affect many members of the R community and have some sort of analytical/statistical significance.

World Statistics Day
I mean, I missed picking up a greeting card - but the objective of the celebration is pretty worthwhile:

building support and better understanding for official statistics among the general public and the policy-makers worldwide.

R-Bloggers
For the R community, R-Bloggers has had a banner year and provided a great deal of visibility for the R community.  They are looking for sponsorship - so please consider supporting them.


U.S. Economic News
News involved the use of additional zeros tacked on to end of numbers.  The recovery.org web site has been somewhat underwhelming.  Edward Tufte's nomination to serve on the Recovery Independent Advisory Panel was a fascinating development.  His emphasis on clear and truthful presentation of information could be a Good Thing.


New Era of Data Journalism
The World Bank has continued to provide more data on economic and social topics.  
A couple of blog posts covered this, and an R package is also available to access the World Bank Data API.  There has been an increased refinement in data journalism as well as controversy surrounding WikiLeaks during 2010.  


Data Marketplace
InfoChimps is pioneering an online marketplace for buying and selling data.  Seems that they have a plausible idea - they recently landed 1.2 million dollars in funding.


Benoit Mandelbrot
Another noteworthy death this year that was not mentioned was the loss of the "Father of Fractals" - Benoit Mandelbrot.  



Share/Bookmark

Friday, December 31, 2010

R-Chart: Year End Wrap Up



Thanks to everyone who visited and commented here at R-Chart over the last year!  Blogging has forced me to crystallize my thoughts and I hope others have benefited a bit from these meanderings.  It it great to interact with the knowledgeable, educated and friendly folks in the R community.

I make no claims to be an expert or authority on statistics, visualization, design or any of the myriad of other topics touched on over the past year.  I appreciate all who have provided encouragement, suggestions and corrections.  Unlike many of you more scientifically minded types who meticulously verify all conclusions before speaking, I tend to throw ideas out in the blog and make adjustments and corrections based upon feedback.  This is really one of the great values of blogging - and so again, thank you for your responsiveness.  It was unexpected and very helpful.

Lessons Learned
In case you blog or are thinking of blogging, I thought you might be interested in how things have worked here at R-Chart to this point.

Make Good Titles
It was interesting to find out which items were of most interest (based upon the number of hits per page).  A great deal seems to be based upon the headline to the blog - never underestimate the value of a well-constructed-sound-byte of a title.  This often dictates the future of a posting.  Bad title = no response.  I really never gave much thought to how important it is to construct a meaningful, attention grabbing title.

Blog Promotion
Promotion of each article also took more time than I expected.  Tal over at R-Bloggers really does the R community a service - bloggers who sign up have content aggregated automatically.  If you want to draw additional readers you have to do a certain amount of footwork yourself.  I get about 15% of total traffic to the site from search engines - which is kind of low.  Most of the generic sites that I submitted the blog to didn't send any traffic.  Content that was of specific interest to a given community ended up resulting in the most traffic.

The top sites that have sent traffic this way are shown below.


www.reddit.com       15,218
www.google.com 8,932
news.ycombinator.com 7,211
www.r-bloggers.com 4,885
www.dzone.com 3,682
habrahabr.ru         1,167  (Hi to friends in Russia for this - the highest ranking non-English site)
twitter.com            689
www.google.co.in            565
www.google.co.uk            531
www.rubyflow.com            470


R is International
I was really amazed at the international response - folks from 164 countries around the world hit the blog since its inception.  Germany was the top non-English site in total visits and France was also well represented.

This probably is of no surprise to many - R has been widely used in academic research and there are a relatively small number of highly specialized professionals around the world using R.  It's obvious that the web reaches everywhere - it is not obvious who will end up visiting a given site.

Interest as Indicated by Traffic
A few other numbers of note:
96,928  R-Chart Pageviews all time history as of 01/31/2010.  
620         Downloads of the free R-Chart iPhone application
237         Total days blogging at blogspot (as.Date('2010-12-31') - as.Date('2010-05-08'))
195            Days blog has lived at r-chart.com (as.Date('2010-12-31') - as.Date('2010-06-19'))
158         Comments on this blog

Advertising
Apologies to folks who are put off by the advertising.  I had a goal to dip into this area a bit to come to offset costs and maybe buy a book or two.  This may happen eventually...

$ 42.89 AdSense Revenue
$ 13.46 Advertising Revenue through Amazon affiliates


Again - thanks to all - and have a Happy New Year

Share/Bookmark

Wednesday, December 8, 2010

Google AI Challenge: Scores/Rank by Language

A quick follow up to the previous post: about the the scores in the 2010 Google AI competition relative to programming language.  The chart above makes each language visible and discrete - and the scales are the same.

library(ggplot2)
df<- read.csv('googleAI2010.csv',sep=';',header=FALSE)
df$V7 <- NULL
names(df)<- c('rank', 'username','country','organization','language','elo_score')


ggplot(data=df, aes(x=rank, y=elo_score, color=language)) + 
+ geom_point(size=1) + 
+ facet_wrap(~ language) + opts(title='Google AI 2010: Score by Rank for each Language')

It is based upon a simple comparison of rank and score.




df<- read.csv('googleAI2010.csv',sep=';',header=FALSE)
df$V7 <- NULL
names(df)<- c('rank', 'username','country','organization','language','elo_score')

ggplot(data=df, aes(x=rank, y=elo_score)) + geom_point(size=1) + opts(title='Google AI Score by Rank')


Another approach to viewing this information is a histogram by score (which ignores rank).  With a binwidth of 100 (and ignoring low scores of people who signed up but who dropped out relatively early) a (nearly) bimodal distribution appears.

qplot(data=df, x=elo_score, geom='histogram', binwidth=100)


Any ideas about why this is not normal?  Is there some aspect of ELO scoring that leads to this shape?  Or are there different types of programmers represented?

This can be broken down by language.  To avoid difficulty distinguishing colors, the rainbow palette is used and a few languages are not reported (since they were not highly represented in the competition).

library(sqldf)

df2=sqldf("select * from df where language not in ('Groovy','Scala','Go','OCaml')")
df2$language=factor(df2$language)
qplot(data=df2, x=elo_score, fill=language, geom='histogram', binwidth=100) + scale_fill_manual(values=rainbow(12)) 



As mentioned in the previous post, the data is available at GitHub - feel free to post some of your own visualizations of this data.
Share/Bookmark

Thursday, December 2, 2010

Google AI Challenge: Languages Used by the Best Programmers





The Google AI Challenge recently wrapped up with a Lisp developer from Hungary as the winner.  The competition challenges contestants to create bots that push the limits of AI and game theory.  These bots compete against one another, and a complete ranking of competitors is available.  The big story today is that the winner (Gábor Melis) used Lisp to beat out over 4000 other contestants around the world using a host of different programming languages.   




Paul Graham has stated that Java was designed for "average" programmers while other languages (like Lisp) are for good programmers.  The fact that the winner of the competition wrote in Lisp seems to support this assertion.  Or should we see Mr. Melis as an anomaly who happened to use Lisp for this task?



Programming Languages Usage


Java, C++, Python and C# were heavily used overall.

     language count(*)

1        Java     1634
2         C++     1232
3      Python      948
4          C#      485
5         PHP       80
6        Ruby       55
7     Haskell       51
8        Perl       42
9        Lisp       33
10 Javascript       19
11          C       18
12      OCaml       12
13         Go        6
14      Scala        4
15     Groovy        1

In the Top 200
     language count(*)
1        Java       70
2         C++       64
3      Python       34
4          C#       17
5           C        4
6     Haskell        3
7         PHP        3
8        Ruby        2
9  Javascript        1
10       Lisp        1
11      OCaml        1


Top 100

1     Java       33
2      C++       32
3   Python       20
4       C#        9
5        C        3
6  Haskell        1
7     Lisp        1
8    OCaml        1

Top 10
  language count(*)
1     Java        4
2      C++        3
3       C#        2
4     Lisp        1


The plot above is a bit difficult to discern due to the number of languages represented (and similarity in colors).  So here is a breakdown by language.

Lisp does appear to be skewed towards higher ranking.  But even more striking are the C hippies:

The functional crowd represented with Haskell also ranked on the higher end:


How about Java?  There is a trend towards the average - but a significantly larger number of entrants used Java.  It also is a language taught in many colleges, and might reflect greater student participation in these languages (although MIT did focus on Lisp back in the day...).
How about representatives from the Microsoft?  Einstein and Elvis showed up - Mort was not interested.

I can post charts of other languages if anyone asks - otherwise, download the files for yourself and draw your own conclusions.  And congratulations to 

Gábor Melis - I am again feeling the inspiration to delve into the mysteries of Lisp and meander among mountains of parenthesis...




Methodology Used
No need to proceed further unless you are interested in how the results listed above were derived.

Basically, I used Ruby to scrape the results from the Google AI Rankings site.  The results were read into Ruby, and ggplot2 and sqldf libraries were used to analyze the results.

Get the Data into R
So to find out more...I whipped up a ruby script to create a delimited file from the 47 page listing online.  (Feel free to get these from their GitHub location and do some additional validation/analysis of your own).   Read this file into R:


df<- read.csv('googleAI2010.csv',sep=';',header=FALSE)
df$V7 <- NULL
names(df)<- c('rank', 'username','country','organization','language','elo_score')


Sanity Check
Most of this work can be done in idiomatic R (which has some significant Lisp influences) - which might be a better way to honor the winner.  However, I find myself using sqlite more and more these days - particularly in mobile development.  So I used the sqldf library which uses this database behind the scenes.

Country rankings are available online, and the following emulates these results.  Specifically, the number of entrants in the top 200 ranked contestants from each country can be derived as follows:




library('sqldf')


top200=df[df$rank <= 200,]


sqldf('select country, count(*) from top200 group by country order by 2 desc')


Organization rankings are similar, representing the top organizations within the top 100.  There are some anomalies here, the highest ranking "Other" is not shown in the online version for obvious reasons, and only the most of these have only one entrant in the top 100 an are listed in an arbitrary manner.  However, the results are otherwise the same in R.




top100=df[df$rank <= 100,]
sqldf('select organization, count(*) from top100 group by organization order by 2 desc')




R Code
The following are additional snippets of R code used to generate the results above.


# Language Usage

sqldf('select language, count(*) from df group by language order by 2 desc')


sqldf('select language, count(*) from top200 group by language order by 2 desc')
sqldf('select language, count(*) from top100 group by language order by 2 desc')



top10=df[df$rank <= 10,]
sqldf('select language, count(*) from top10 group by language order by 2 desc')



 If you fiddle enough with the bucket size for histograms, you might be able to draw some conclusions... but the density plot seemed like a nicer option.  


library('ggplot2')

# Substitute your favorite language of those available for Lisp below
qplot(data=df[df$language=='Lisp',], x=rank, geom='histogram', binwidth=1000) + opts(title='Lisp') 





# The density plot at the top of this posting:

ggplot(data=df, aes(rank, fill=language)) + 
  geom_density(alpha = 0.2) + 

 xlim(0,5000) +

  opts(title='2010 Google AI Challenge Rankings')


ggsave('program_language_density_plot.png')


# Breakdown by language:

ggplot(data=df[df$language=='Scala',], aes(rank, fill=language)) + geom_density(alpha = 0.2) + xlim(0,5000) + opts(title='Scala') 


Update:  I have been keeping up with the comments - and sketched out some other ways of looking at the data in another post.

Share/Bookmark

Wednesday, November 10, 2010

Mortgage Calculator (and Amortization Charts) with R


Mortgage rates have been at historic lows recently.  The rates are posted various places online along with simple mortgage calculators.  Such calculators illustrate the payment schedule for a mortgage based upon selected terms. But with less than a dozen lines of R code, you can do a far more sophisticated analysis.

Mortgage Calculation Function
Rather than reinvent the wheel, you can work with this nice R function by Thomas Girke (Associate Professor of Bioinformatics over at UC Riverside).  At the R prompt, you can grab it from its home online by calling source:

source("http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/My_R_Scripts/mortgage.R")




This loads the function and outputs a helpful description of the function:



The monthly mortgage payments and amortization rates can be calculted with the mortgage() function like this: 

        mortgage(P=500000, I=6, L=30, amort=T, plotData=T)
                P = principal (loan amount)
                I = annual interest rate
                L = length of the loan in years 

So keep in mind that there is a huge amount of R code available online: 
are just the beginning.  


Instant R Graphical User Interfaces

Rather than simply calling the function directly, you can quickly construct a GUI input widget using the fgui library.

library(fgui)
gui(mortgage)


With this trivial invocation, a window pops up.






Not terribly fancy, but about the simplest way you will ever be able to construct a GUI!  In this case a mortgage amount of $90,000 for 10 years at 3.75% is illustrated.  


After entering these values, click OK to actually call the function.  This results in a  good deal of interesting output.  Close the pop up window and look at the R Console:



Monthly payment: $900.5512 (stored in monthPay)


                        Total cost: $108066.1



As indicated in this message, an R object named monthPay contains the amount of the monthly payment and can be used in subsequent R commands and calculations.  You also are greeted with a graph illustrating annual interest and payments as a stacked bar chart.

Plenty of useful information!  But that's not all...


Beyond the Basics
You might have noticed a number of messages regarding data stored in R objects.  This is where the power of R exceeds that of any standard mortgage calculator.  These objects can serve as input to other calculations or charting operations.


The aDFmonth object contains amortization data for each month, while aDFyear contains annual information. In the following example, a new data frame is created from the monthly data that does not include the amortization information and plot it using ggplot2.  (The amortization data is a significantly different scale and better viewed independently).

library(ggplot2)
DF=melt(aDFmonth[-1], id.vars='Year')

ggplot(DF, aes(x=Year,y=value, group=variable)) + geom_line() + facet_wrap(~ variable, ncol=1)
You can quickly manipulate the data frame to view amortization information instead.  Use the exact same ggplot call (though the facet_wrap is removed below as unnecessary for a single variable)  to create a chart scaled to fit the values relevant to the amortization.

DF=melt(aDFmonth[c(1,5)], id.vars='Year')
ggplot(DF, aes(x=Year,y=value, group=variable))+ geom_line() 


The limits of calculations and visualizations available in a web calculator or Excel are reached pretty quickly.  R provides the means to create relatively full featured solutions in only a few lines of code. 

Share/Bookmark

Tuesday, November 9, 2010

Don't be a Turkey


'Indeed, I am moving on: my new project is about methods on how to domesticate the unknown, exploit randomness, figure out how to live in a world we don't understand very well. While most human thought (particularly since the enlightenment) has focused us on how to turn knowledge into decisions, my new mission is to build methods to turn lack of information, lack of understanding, and lack of "knowledge" into decisions—how, as we will see, not to be a "turkey".'


With thanksgiving on the way, an economic lesson provided by a turkey's statistical department seems appropriate.    Our turkey - let's call him auRthur - like most turkeys has a statistical department at his disposal.  His department is in fact tracking an index - the Turkey Welfare Index which is a reflection of how much the human race cares about auRthur.  Notice the relatively positive trend... until Thanksgiving Day...

Evidently, our auRthur's statistical department utilized a model that had some flaws - "past performance is not necessarily a predictor of future returns".   This is because the harvesting of the turkey is a "rare event."  Rare (unprecedented) events are difficult to predict.  The story is not terribly amusing to turkeys to begin with - but becomes less amusing to humans when understood as a metaphor of the financial meltdown and statistical modeling in use by banking institutions.  Essentially, banking institutions assumed a huge amount of risk because a catastrophic meltdown was simply outside the realm of consideration.  It was not represented in most of the models in use.

A great and vivid illustration.  See Nassim Nicholas Taleb's essay where this chart and illustration originally appeared at edge.org.  This article discusses the limits of statistical thinking and is a good springboard to other writings by Taleb - who was a practitioner of risk as he ran a hedge fund for a number of years and saw many of the practices in the financial industry up close and personal.


The chart above was created using R and ggplot2.  The data frame named DF was populated with data related to the Turkey Welfare Index.

> DF
   TWI Day color
1   14   1 black
2   15   2 black
3   16   3 black
4   17   4 black
5   18   5 black
6   19   6 black
7   20   7 black
8 -100   8   red


UPADTE:  This can be entered in a few different ways.  One is through a grid (which requires that you specify the Day as a factor).

  DF=edit(data.frame())
  DF$Day=factor(DF$Day)


Plotted using ggplot2:


library(ggplot2)


ggplot(data=DF, aes(x=Day, y=TWI, fill=color)) + 
  geom_bar() + 
  scale_fill_manual(value= c("black", "red")) + 
  theme_bw() + scale_x_discrete(breaks = NA) + 
  opts(legend.position='none', axis.title.x=theme_blank(), 
        axis.title.y=theme_blank(), 
         title='Turkey Welfare Index')

This included a couple of somewhat unusual settings to shut off labels and axes that results in the simple "plain" appearance you see above.

So - Happy Thanksgiving - understand statistics and don't be a turkey...

Share/Bookmark