February 2012
7 posts
The Big Data Evolution: The Creep Factor →
bigdataathawkeye: -Why Some Companies Get It, and Others Don’t if my pharmacy used my prescription records to send me targeted catalogs, I would be upset because they have violated a confidence I had in them.  However, if they used that same information to warn me of possible drug interactions, or suggest a generic product during a store visit, I would be very grateful. This is a great little...
Feb 22nd
2 notes
1 tag
Feb 20th
15 notes
Feb 16th
215 notes
4 tags
Finding an optimal seating chart for a wedding →
This is such a great optimization problem. I wonder what the best way to fill in the relation matrix. Besides setting a few key entries to the maximum value as they did in the paper (couples, immediate family of the couple being married, etc), I’m guessing some good features would be: Number of overlapping facebook friends. Number of group emails that the attendees appear on. Total rounds...
Feb 14th
5 notes
3 tags
CONE Welder: Collaborative Observatory for Natural... →
This could be the basis of a really fun image recognition project. With over 38,849 bird photos this collection can serve as a training set for visual bird detection and image processing research.  Via @siah
Feb 14th
1 note
High Scalability: Tumblr Architecture - 15 Billion... →
buzz: If you’ve ever wondered what’s involved in keeping Tumblr humming along, here’s a great overview courtesy of Blake Matheney.
Feb 13th
78 notes
6 tags
Feb 5th
3 notes
January 2012
7 posts
2 tags
Jan 30th
4,834 notes
3 tags
Jan 19th
5 notes
bitly blog: SOPA and PIPA on the social web -... →
bitly: The social web is exploding with SOPA and PIPA related content today! We’re seeing nearly ten clicks per second on the Electronic Frontier Foundation’s “Stop the Internet Blacklist Legislation”, over two clicks per second on SOPA related web pages, and almost 1 click a second on PIPA related…
Jan 18th
13 notes
4 tags
Jan 16th
19 notes
Jan 13th
7 notes
2 tags
Setting up python on a fresh OSX install.
I have seen WAY TOO MANY blog posts about this, but none of them seem to get it right. Python comes pre-installed, but if you use homebrew as a package manager, you should be using the version of python that they provide. To first install homebrew, make sure you installed xcode through the App Store, then run this command: /usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)" ...
Jan 10th
21 notes
R has an interesting list of built-in constants
Of all of the mathematical and scientific constants, they decided to go with lower/upper case letters, month names/abbreviations, and pi. Constants {base}           R Documentation Built-in Constants Description Constants built into R. Usage LETTERS letters month.abb month.name pi Details R has a small number of built-in constants. The following constants are available: LETTERS: the 26...
Jan 9th
1 note
December 2011
11 posts
4 tags
Python function for sampling from an arbitrary...
Sampling is fun, right?  Here’s a simple implementation of a slice sampler for discrete probability distributions. And here’s how to call it: >>> px=[.2, .4, .1, .3] >>> slice_sampler(px, N=5) array([2, 3, 3, 3, 3]) >>> slice_sampler(px, N=5, x=[100, 200, 300, 400]) array([200, 200, 400, 200, 200]) Set N to something high and take a histogram and...
Dec 29th
8 notes
2 tags
Dec 28th
505 notes
Dec 26th
130 notes
Dec 24th
NYC OpenData: We invite you to create and submit... →
nycopendata: We invite you to create and submit data visualizations from now until January 3rd using the dataset below on SAT Scores in New York City Public Schools. You are welcome to use correlative data from other sources as well (e.g. the U.S. Census. We will post some of our favorites on the…
Dec 23rd
101 notes
4 tags
Dec 20th
1 tag
Dec 12th
169 notes
3 tags
Amanda Cox Talks about Developing Infographics at... →
The talk from the NYC R meetup was really great.
Dec 8th
7 notes
2 tags
Dec 8th
1 tag
Dec 6th
8 notes
2 tags
Dec 5th
1 note
November 2011
13 posts
3 tags
Colored plotmatrix in ggplot2
I feel like plotmatrix is a much-neglected ggplot2 function (it’s not even on the ggplot2 webpage). It’s the equivalent of plot(dataframe) from the core graphics package, with the added bonus of a kernel density plot along the diagonal. The one thing that seems to be missing is the ability to color the points by some factor. I modified the plotmatrix function to enable this....
Nov 29th
3 notes
3 tags
Nov 22nd
67 notes
Nov 18th
240 notes
1 tag
Nov 17th
12,461 notes
9 tags
Nov 16th
1,386 notes
3 tags
Nov 16th
5 notes
4 tags
Nov 16th
331 notes
2 tags
Machine Learning for Email →
Drew Conway and John Myles White posted the code used in their O’Reilly book Machine Learning for Email on github. Check it out to see implementations (all in R) of priority inbox, spam classification, and other algorithms.
Nov 15th
3 notes
1 tag
Slides from our Velocity Europe talk on MySQL... →
engineering: Special thanks to Dallas and Matt for helping to formulate our shard-splitting algorithm, Zack for the slide deck visual excellence, and the fine folks at Velocity / O’Reilly Media for running a great conference.
Nov 11th
26 notes
5 tags
Nov 4th
10 notes
3 tags
Bring back Google Reader!
Yesterday, Google removed my single favorite thing about the internet. Google reader was an above-average RSS reader with a functional layout and unmatched sharing/comments features. Now it’s a decent-at-best RSS reader thanks to the horrible new layout, and a few useless json files with some contacts and 4 years worth of shared items. I was wondering what is left of the Trends page that...
Nov 2nd
3 tags
Nov 1st
2 tags
“There’s this culture in the Valley of starting a company before they know what...”
– Mark Zuckerberg, via TechCrunch.
Nov 1st
2 notes
October 2011
13 posts
4 tags
Oct 27th
3 notes
2 tags
Oct 24th
2 tags
“I don’t really consider or use Google Reader as “social” product like Facebook,...”
– I don’t know how to add friends to Google Reader, or how I ended up with the friends I have on there, but it’s perfect and Google is going to ruin it by folding it into Google+.
Oct 22nd
3 tags
Oct 18th
11 notes
3 tags
Comparison of high level languages for mapreduce:... →
via the Revolution Analytics blog
Oct 14th
4 notes
Oct 13th
301 notes
2 tags
A quick outlier detector for streaming data
Batch processing is great and all, but nobody wants to wait until the next day to find out that their process is lagging. Here’s a quick-and-dirty script to detect outliers.   It’s just a moving average filter where I calculate the variance of the window with each data point. I hard coded the definition of “outlier” as being more than three standard deviations away from...
Oct 13th
7 notes
2 tags
Data Reveals That “Occupying” Twitter Trending... →
A great analysis of trending topics on twitter, and why #OccupyWallStreet hasn’t shown up (and never will).
Oct 12th
12 notes
3 tags
Top 50 statistics blogs →
There’s some great stuff on this list. See anything that was missed? Some that I’d add: Journal of Statistical Software John Myles White’s blog R-Bloggers (a R blog aggregation site) @bigdata Anything with Hadley Wickham’s name on it. My own blog, obviously.
Oct 11th
50 notes
4 tags
Oct 11th
12 notes
8 tags
Tips for making a technical blog on tumblr
Technical blogs have a few special formatting requirements (code samples, equations, and graphs/figures), and anyone making a technical blog probably hates using WYSIWYG/rich text editors. Here are a few tips that will hopefully be useful. If you have any questions or better suggestions, hit up the comments section. Default Post Editor First of all, you can change the default editor mode in your...
Oct 10th
158 notes
2 tags
Oct 9th
4 notes