February 2012
7 posts
The Big Data Evolution: The Creep Factor →
bigdataathawkeye:
-Why Some Companies Get It, and Others Don’t
if my pharmacy used my prescription records to send me targeted catalogs, I would be upset because they have violated a confidence I had in them. However, if they used that same information to warn me of possible drug interactions, or suggest a generic product during a store visit, I would be very grateful.
This is a great little...
1 tag
4 tags
Finding an optimal seating chart for a wedding →
This is such a great optimization problem. I wonder what the best way to fill in the relation matrix. Besides setting a few key entries to the maximum value as they did in the paper (couples, immediate family of the couple being married, etc), I’m guessing some good features would be:
Number of overlapping facebook friends.
Number of group emails that the attendees appear on.
Total rounds...
3 tags
CONE Welder: Collaborative Observatory for Natural... →
This could be the basis of a really fun image recognition project.
With over 38,849 bird photos this collection can serve as a training set for visual bird detection and image processing research.
Via @siah
High Scalability: Tumblr Architecture - 15 Billion... →
buzz:
If you’ve ever wondered what’s involved in keeping Tumblr humming along, here’s a great overview courtesy of Blake Matheney.
6 tags
January 2012
7 posts
2 tags
3 tags
bitly blog: SOPA and PIPA on the social web -... →
bitly:
The social web is exploding with SOPA and PIPA related content today! We’re seeing nearly ten clicks per second on the Electronic Frontier Foundation’s “Stop the Internet Blacklist Legislation”, over two clicks per second on SOPA related web pages, and almost 1 click a second on PIPA related…
4 tags
2 tags
Setting up python on a fresh OSX install.
I have seen WAY TOO MANY blog posts about this, but none of them seem to get it right. Python comes pre-installed, but if you use homebrew as a package manager, you should be using the version of python that they provide.
To first install homebrew, make sure you installed xcode through the App Store, then run this command:
/usr/bin/ruby -e "$(curl -fsSL https://raw.github.com/gist/323731)"
...
R has an interesting list of built-in constants
Of all of the mathematical and scientific constants, they decided to go with lower/upper case letters, month names/abbreviations, and pi.
Constants {base} R Documentation
Built-in Constants
Description
Constants built into R.
Usage
LETTERS
letters
month.abb
month.name
pi
Details
R has a small number of built-in constants.
The following constants are available:
LETTERS: the 26...
December 2011
11 posts
4 tags
Python function for sampling from an arbitrary...
Sampling is fun, right? Here’s a simple implementation of a slice sampler for discrete probability distributions.
And here’s how to call it:
>>> px=[.2, .4, .1, .3]
>>> slice_sampler(px, N=5)
array([2, 3, 3, 3, 3])
>>> slice_sampler(px, N=5, x=[100, 200, 300, 400])
array([200, 200, 400, 200, 200])
Set N to something high and take a histogram and...
2 tags
NYC OpenData: We invite you to create and submit... →
nycopendata:
We invite you to create and submit data visualizations from now until January 3rd using the dataset below on SAT Scores in New York City Public Schools. You are welcome to use correlative data from other sources as well (e.g. the U.S. Census. We will post some of our favorites on the…
4 tags
1 tag
3 tags
Amanda Cox Talks about Developing Infographics at... →
The talk from the NYC R meetup was really great.
2 tags
1 tag
2 tags
November 2011
13 posts
3 tags
Colored plotmatrix in ggplot2
I feel like plotmatrix is a much-neglected ggplot2 function (it’s not even on the ggplot2 webpage). It’s the equivalent of plot(dataframe) from the core graphics package, with the added bonus of a kernel density plot along the diagonal.
The one thing that seems to be missing is the ability to color the points by some factor. I modified the plotmatrix function to enable this....
3 tags
1 tag
9 tags
3 tags
4 tags
2 tags
Machine Learning for Email →
Drew Conway and John Myles White posted the code used in their O’Reilly book Machine Learning for Email on github. Check it out to see implementations (all in R) of priority inbox, spam classification, and other algorithms.
1 tag
Slides from our Velocity Europe talk on MySQL... →
engineering:
Special thanks to Dallas and Matt for helping to formulate our shard-splitting algorithm, Zack for the slide deck visual excellence, and the fine folks at Velocity / O’Reilly Media for running a great conference.
5 tags
3 tags
Bring back Google Reader!
Yesterday, Google removed my single favorite thing about the internet. Google reader was an above-average RSS reader with a functional layout and unmatched sharing/comments features. Now it’s a decent-at-best RSS reader thanks to the horrible new layout, and a few useless json files with some contacts and 4 years worth of shared items.
I was wondering what is left of the Trends page that...
3 tags
2 tags
There’s this culture in the Valley of starting a company before they know what...
– Mark Zuckerberg, via TechCrunch.
October 2011
13 posts
4 tags
2 tags
2 tags
I don’t really consider or use Google Reader as “social” product like Facebook,...
– I don’t know how to add friends to Google Reader, or how I ended up with the friends I have on there, but it’s perfect and Google is going to ruin it by folding it into Google+.
3 tags
3 tags
Comparison of high level languages for mapreduce:... →
via the Revolution Analytics blog
2 tags
A quick outlier detector for streaming data
Batch processing is great and all, but nobody wants to wait until the next day to find out that their process is lagging. Here’s a quick-and-dirty script to detect outliers.
It’s just a moving average filter where I calculate the variance of the window with each data point. I hard coded the definition of “outlier” as being more than three standard deviations away from...
2 tags
Data Reveals That “Occupying” Twitter Trending... →
A great analysis of trending topics on twitter, and why #OccupyWallStreet hasn’t shown up (and never will).
3 tags
Top 50 statistics blogs →
There’s some great stuff on this list. See anything that was missed?
Some that I’d add:
Journal of Statistical Software
John Myles White’s blog
R-Bloggers (a R blog aggregation site)
@bigdata
Anything with Hadley Wickham’s name on it.
My own blog, obviously.
4 tags
8 tags
Tips for making a technical blog on tumblr
Technical blogs have a few special formatting requirements (code samples, equations, and graphs/figures), and anyone making a technical blog probably hates using WYSIWYG/rich text editors. Here are a few tips that will hopefully be useful. If you have any questions or better suggestions, hit up the comments section.
Default Post Editor
First of all, you can change the default editor mode in your...
2 tags