The Chart's Meaning
Note that the chart does not represent the number of posts by a given user, it is just a list of distinct users with their start dates grouped in monthly buckets. I suppose that the shape of the graph makes sense - folks sign up so that they can post, and older users drift away and cease posting at some point. The chart does indicate that - as of a few days ago - a a given user who posted was more likely to be someone who signed up in the last year or two than a veteran.
Scraping the Data
I used Ruby and Hpricot (still missing you _why) to parse the site and Active Record to store the list of users in a MySQL database. I use ActiveRecord outside of rails rather frequently. It does great straightforward object to relational mapping - and even an arbitrary query is returned as an object that can be manipulated.
Noticed a couple of differences using MySQL vs Oracle/RODBC with R.
1) Oracle/RODBC capitalizes column names in the result set.
2) Using RMySQL, there is no need to set up a ODBC connection.
3)
function dbGetQuery that allows both actions to be taken in a single step.
4) I use TRUNC in Oracle - but ended up using the EXTRACT function and tagging on a 01 for the first day of the month with MySQL.
Creating the Chart
R speaks for itself:
library(RMySQL)
drv <- dbDriver("MySQL")
con <- dbConnect(drv, username='xxxx',password='xxxx',dbname='xxxx')
# Buckets by month
sql='select extract(YEAR_MONTH from hn_created_date) hn_created_date, count(*) from users group by extract(YEAR_MONTH from hn_created_date);'
# Execute the Query and Fetch the Data
rs <- dbSendQuery(con,sql)
df <- fetch(rs)
# Set the date to the first of the month (buckets of user by start month)
df$hn_created_date = as.Date(paste(df$hn_created_date,'01',sep=''),format='%Y%m%d')
# The Actual Plot
p=ggplot(data=df, aes(hn_created_date, df$`count(*)`))+geom_line()+xlab('User Start Date')+ylab('Number of Users Who Posted recently')
p+stat_smooth()

5 comments: