Who to follow at Strata 2012

(This is an update of a post I wrote last year with all-new 2012 data. There are 90% more tweeters this year, and only 12% of tweeters were also at Strata 2011). 

Strata is an O’Reilly-run conference for people who analyze and visualize Big Data; if you are reading this, you probably already know that.

I was curious as to who I should follow on Twitter here, and recalled that the Attendee Directory asked people to fill in their Twitter usernames. What better way to find people who are actually here? (Obviously you can search #strataconf or look at the users @strataconf follows, but that won’t tell you who is popular, fast).

So I pulled together a python script to scrape the Attendee Directory for users that listed Twitter accounts, look those accounts up on Twitter to see how many followers they have, and get some related information. I wrapped up by linking each Twitter user back to the Attendee Directory, so you can view their profile, etc.

Perhaps most importantly, this list includes locations, so you can find other people from your area.

Here it is: Twitter users at Strata 2012, ordered by number of followers

The raw data is available here as JSON dictionary. The keys are Twitter user names, the values are lists: (num followers, profile image, home page, full name, location, description, Strata Attendee ID). 

If you liked this, follow me on Twitter (@jsundram). And feel free to say hi — here’s my Attendee Directory entry.

Update (3/2/2012):

I ran this code again after the conference was over to accomodate 15 people who linked their twitter accounts to their attendee directory entry after the conference started. The updated list is here, and the backing json data is here.

On average, twitter users gained 7.9 followers during the conference. Congrats to @haseebbudhani who was the biggest mover and shaker, moving 22 places from #367 to #345.

Wednesday, February 29, 2012   ()

Hurricane name hotness

With all the publicity surrounding Hurricane Irene, I began to wonder how hurricanes are named. Mostly, I wondered if there would be a Hurricane Jason. Vanity. But it turns out there won’t be. I’m not the only person to wonder. After figuring out that there are no hurricane names that start with ‘U’, I found that the rules are simple. Since 1953, hurricanes were given female names, ascending through the alphabet. In 1979, male names were added, alternating with female names (so, the tropical storm that succeeded Irene was Jose). If there are more storms than letters in the alphabet, the extra storms are named by Greek letters (Alpha, Beta, Gamma, etc.). The World Meteorological Organization maintains a cycling 6-year list of names. If a storm causes loss of life or significant damage, (e.g. Katrina), its name gets retired, and a new one is chosen to fill in.

Retirees

It turns out that names in the first half of the alphabet get retired more frequently. This is unsurprising: between 1950 - 2000, only about 10 storms a year received names, so it makes sense that most of the retired names fall between A and J. Interestingly, 2005 was a banner year for hurricanes: All the names, and 6 of the Greek letter names (up to Zeta) were used, 5 names were retired that year (Dennis,KatrinaRitaStan, and Wilma), and a 6th, Tropical Storm Gamma may well have been retired, as it killed 39 people. But it seems that Greek letters don’t get retired.

Familiarity

The names on the list are chosen to be “familiar to the people in each region.” In the case of the Atlantic, I wondered that a name like Joaquin (slated for use in 2015) might be chosen, while Jason was overlooked. Indeed, how many of the other chosen names are in fact not all that popular?

But why wonder? The Social Security Administration keeps records of baby names from every social security card application. You can get the top 1000 male and female names for a given year directly from their site. That list typically covers 73% of the US population. It covers 96% of the 126 names in the hurricane rotation. But for names like Fernand, Gert, Hermine, Shary, and Virginie, we’ll have to dig a bit deeper. For that, we’ll have to go beyond the top 1000 names, and download/parse a 7MB dump of all baby names that were used more than 5 times, by year (starting with 1880!). I chose to only use the 2010 data, as it is a first approximation of the current popularity of various names. 
Of course, Atlantic hurricanes don’t just affect the United States, and the World Meteorological Organization is likely to use names that may be popular in Canada, Mexico, and Cuba. So the popularity data is not perfect. Nonetheless, with this data, it is easy to see which hurricane names are likely to be unfamiliar to US citizens.


For a great roundup of other hurricane resources, check out the post on Irene at Microbes to Manatees

Sunday, September 11, 2011 — 20 notes   ()

‘W’ Considered Harmful

Not the magazine and not even the former president. But the letter ‘W’ itself. The letter ‘W’, 23rd in the English alphabet, is unique in two ways: it is the only letter whose name is more than one syllable, and also the only letter whose name doesn’t include the sound it makes. 

The fact that ‘W’ takes 3 syllables to say bothers me. Even Wikipedia’s entry on ‘W’ points out, twice, that the abbreviation www requires nine syllables to say. Crazy. So I wondered, how often is it the case that words that start with W (hereafter W-words) have fewer syllables than the letter W (double-yew)

Syllabification in general is a hard problem in English, but fortunately I don’t have to solve it. The Carnegie Mellon University (CMU) Pronouncing Dictionary provides the pronunciations for over 125,000 words. I say pronunciations, plural, because words can be pronounced in a variety of different ways (e.g. fire can be pronounced to rhyme with higher, or in a single syllable. Only 41 W-words in the CMU dict have pronunciations with different numbers of syllables). Using the CMU Pronouncing Dictionary, it’s possible to count syllables in a word in a short (if cryptic) Python function, courtesy of Jordan Boyd-Graber — I found it on the nltk-users google group:

from curses.ascii import isdigit 
from nltk.corpus import cmudict 

d = cmudict.dict() # get the CMU Pronouncing Dict

def nsyl(word): 
    """return the max syllable count in the case of multiple pronunciations"""
    return max([len([y for y in x if isdigit(y[-1])]) for x in d[word.lower()]])

So, now that we’ve got a syllable counter, let’s get all the W-words in the CMU dictionary, and see what the syllable distribution looks like.

import pylab

w_words = dict([(w, nsyl(w)) for w in d.keys() if w[0] == 'w'])
worth_abbreviating = [(k,v) for (k,v) in w_words.iteritems() if v > 3]
pylab.hist(w_words.values())

Only 101 W-words in the CMU dictionary (of 3805 total W-words) have more than 3 syllables. That’s 2.6%. Here’s a sampling of the words where using W to abbreviate them actually saves syllables: wagnerian, wallpapering, washingtonians, weatherperson, workaholic. So, by all means, call a meeting of the Wagnerian Wallpapering Workaholic Weatherperson Washingtonians the WWWWW. It will save time. Otherwise, consider not using an abbreviation. Or looking for synonyms. 

Suggestions for further work:

Monday, February 28, 2011 — 21 notes   ()

Who to follow at Strata 2011

Strata is an O’Reilly-run conference for people who analyze and visualize Big Data; if you are reading this, you probably already know that.

 I was curious as to who I should follow on Twitter here, and recalled that the Attendee Directory asked people to fill in their Twitter usernames. What better way to find people who are actually here? (Obviously you can search #strataconf or look at the users @strataconf follows, but that won’t tell you who is popular, fast).

So I pulled together a python script to scrape the Attendee Directory for users that listed Twitter accounts, look those accounts up on Twitter to see how many followers they have, and get some related information. I wrapped up by linking each Twitter user back to the Attendee Directory, so you can view their profile, etc.

I also created a Twitter list with all the attendees listed below: StrataConf2011-3.

If you liked this, follow me on Twitter (@jsundram). And feel free to say hi — here’s my Attendee Directory entry.

Here it is: Twitter users at Strata 2011, ordered by number of followers

Update: The raw data is available here as JSON dictionary. The keys are Twitter user names, the values are lists: (num followers, profile image, home page, full name, location, description, Strata Attendee ID). Let me know if you do something cool with it!

Wednesday, February 2, 2011 — 3 notes   ()

Danceability and Energy: Introducing Echo Nest Attributes

Tristan and I have been working hard to add a new kind of information to the Echo Nest’s audio APIs. We’re calling these things attributes, and they are quantities that are calculated with data from our track analysis. Our attributes depend on ground truth data generated by The Echo Nest’s awesome Data QA Team, a passionate group of musicians and music lovers that includes several Berklee students. When they tell us a song is danceable, we believe it.
 

Danceability 

We each groove to different music; what constitutes dance music is inherently subjective. The Echo Nest defines danceability as the ease with which a person could dance to a song, over the course of the whole song. We use a mix of features to compute danceability, including beat strength, tempo stability, overall tempo, and more. One cool thing that I’ve noticed is that remixes of songs tend to have a higher danceability score than the originals.

Here’s the distribution of danceability over all the songs we have analyzed (over 14 million).danceabilit histogram (50,000 songs)

Energy

Energy is less subjective. How energetic is the music? Does it make you want to bop all over the room, or fall into a coma? The feature mix we use to compute energy includes loudness and segment durations.

Here’s the distribution of energy over all the songs we have analyzed (over 14 million):Energy Histogram (50,000 songs)

Where’s the dance(ability) at?

Paul posted a teaser yesterday that hints at what kinds of things you can do with energy and danceability, together.

Here are the various ways you can interact with danceability and energy through the API:

  • song.search - You can specify min/max danceability & energy when searching and sort your results by ascending or descending danceability/energy.
  • track.profile - You can retrieve danceability/energy for a track.
  • audio_summary -  danceability and energy are now part of a track’s audio summary.  Any place you can specify the audio_summary will give you danceability and energy (e.g. track.upload)
  • playlisting (both static and dynamic) - You can specify min/max danceability & energy when creating a playlist  and sort the output by ascending or descending danceability/energy.


Attributes show how powerful and complete The Echo Nest’s analyze data is. Armed with only those JSON documents, you could make own attribute, too. Maybe you want to implement goodness? But seriously, what are you going to do with danceability & energy for Boston Music Hackday? I can’t wait to find out.

Friday, October 15, 2010 — 10 notes   ()

Songbird’s first music visualizer: SongbirdVis

 

For Music Hack Day San Francisco, I teamed up with Steven Lloyd to integrate a music visualizer into Songbird. As far as I know, this is the first music visualization for Songbird. Since we didn’t have access to the raw audio through Songbird, we used The Echo Nest’s analysis data, which gives high-level, musically meaningful information to display. SongbirdVis represents pitch, timbre, loudness, rhythmic, and structural information from the analysis data, and synchronizes it with the audio. In addition to showing information about the currently playing track, SongbirdVis is interactive. You can click on part of the visualization to hear the music at that point. You can also click and drag to select a part of the visualization to zoom in on.

Analyze
To get analysis data for a track, all you need is a free Echo Nest API key. Instead of uploading the track for analysis, we use the new v4 beta search API to see if The Echo Nest has already analyzed the currently selected track. When the API is updated sometime in the next month, we’ll update the code to upload the track, or search by the track’s MD5 checksum.


The analysis data contains a lot of deep and musically interesting information about a piece of music. Bars, beats, structural sections … there’s pitch and timbre data for every perceptual event in the track. The trick for a developer is to decide how to visually represent that. 


I’ve been thinking a lot about those issues lately as a part of visualizer.fm, a project to synchronize HTML 5 audio playback of music with visualizations of Echo Nest Analysis data using processing.js (more about visualizer.fm in a separate post). I decided to port the diagnostic visualizer to work in Songbird. I thought it was a good choice since it shows the whole song at once, so it is good for both viewing the analysis for a track, and for use during playback. 

The Visualization
Songbird displaying SongbirdVis for the track “Dancing Queen” by ABBA.

Zooming in on the first section.

The timbre, pitch, and loudness features are all in terms of segments. A segment corresponds to a perceptual event (e.g. guitar note, drum hit) in a song. 


Working from top to bottom, here’s what SongbirdVis displays:

timbretimbre
The timbre display shows the the 12-dimensional timbre vector for each segment. Longer segments take up more horizontal space. The timbre vector is colored by interpreting the first 3 dimensions as RGB values. 


pitchpitch
The pitch display shows the 12-dimensional pitch vector for each segment. Each value corresponds to the strength of a pitch at that point. So, if a guitar plays the note G, and there is no other sound, only the bin for G would be colored in. Because of percussive and other non-pitched sounds, we see a lot of color on the pitch display. Some filtering or weighting might be in order to make this display a bit cleaner.


Pitch colors are chosen by taking a note frequency in hertz and finding the color of the corresponding wavelength of light.
pitch colors

loudness
The loudness curve shows the perceptual loudness (in dB) over the course of the track. The thickness of the curve shows the difference between the loudness at the beginning of a segment (loudness start), and the maximum loudness for that segment. The white line shows the overall loudness for the track. The vertical white lines show where the analysis data has marked “end of fade in” and “start of fade out”. 

metermeter
The meter display shows Bars, Beats, and Tatums as blue, red, and white squares, respectively. Because the beats and tatums are so close to each other when fully zoomed out, it looks like they are thick lines.  The blue, red, and white curves show the confidence associated with each of the squares they correspond to. In Dancing Queen, you can see that the confidence associated with beats is a lot higher, on average, than the confidence associated with bars or tatums. The gray vertical bars correspond to sections, higher-level structures in the song, such as chorus and verse. This makes it easy to jump around between chorus and verse.

Songbird
The Songbird Developer Tools add-on makes it very easy to create an extension. There’s a wizard and everything. Because Songbird is based on Firefox, writing extensions is just a matter of editing some XUL and javascript. No C++ to compile or anything. SongbirdVis is a Media Page, another view of the current media. We’re working with XUL instead of HTML, so although things are familiar, code changes are necessary to get things to work. The visualizer is tied in to Songbird in a couple of ways: 

  • Sync - SongbirdVis receives an updated timestamp from Songbird as the track plays, allowing the visualizer to match what is playing.
  • Seek - Clicking on SongbirdVis allows you to hear the track at the place you clicked by setting the currently playing position in the track.
  • Track changed - When a new track starts playing, SongbirdVis queries the Echo Nest API to display the visualization for the new track.
  • Resizing - When the Songbird window is resized, SongbirdVis resizes itself.


Processing.js
Processing.js is an awesome port of the Processing.org language to javascript. It’s not complete, and it’s moving fast, but it’s a lot of fun to use, and works in the browser without any need for plugins. It’s a great framework for quickly developing Echo Nest visualizers, as I’ve found out in my work on visualizer.fm. Steve took on the heroic task of updating processing.js (which I use to do all of the drawing) to work inside XUL. This allowed me to avoid writing custom code for Songbird; The diagnostic visualizer used in SongbirdVis is written against processing.js, which means it can run anywhere processing.js can.


Future Work
SongbirdVis needs the Echo Nest to update the v4 search API and track upload API before it can be released for general use. There are also a couple of things that remain to be fixed in our hacked-up copy of processing.js. In the meanwhile, the code is available on github for your forking pleasure.
SF Music Hack day was a great experience, with loads of smart developers doing cool things. You should definitely check out the full list of other hacks, here. My favorite was Leonard’s Set Summary. The best part about it, for me, was that it uses Capsule. It’s a real joy to see people make cool stuff with code I’ve developed.

Wednesday, June 2, 2010 — 28 notes   ()

Earworm and Capsule

Over the past month, Tristan and I have been hard at work adding some new features to The Echo Nest Remix API, which is now at version 1.3. Here’s what’s new:

  • cloud.py - functionality to search for analyzed tracks to remix. No need to have your own audio.
  • pydirac - a new, great-sounding time-stretcher, which is stereo, and sample accurate. It’s a Python wrapper around a C module that makes use of the Dirac LE library.
  • action.py and cAction - crossfade, crossmatch, fadein, fadeout, and jump, with C implementations of core functionality for speed.

We’ve developed two cool new examples to show off the new functionality: Earworm and Capsule.

Earworm

Earworm (thanks to Jonathan Feinberg for the name) makes it possible to extend or shrink a song to any length you might desire, without changing the tempo. It does this by constructing a network graph of the piece, using The Echo Nest’s analysis data. Each node in the graph is a beat in the song, and an edge exists between two nodes if the two beats, and the several beats that follow them, sound similar (close in timbre and pitch).

Song Graph

    One possible rendering of a song graph.

The graph shows us where we can make seamless transitions between different parts of the song. Stretching (or shrinking) the song is then just a matter of minimizing the number of “loop” points to reach a requested duration.

Tristan took If I Ever Feel Better by Phoenix and made If I Ever Feel Longer, seamless 10 and 60-minute renditions of that track using earworm. He also made a track I call I Can’t Get Any Shorter, the shortest path through the song with reasonable transitions (about 25% as long as the original). And if you just can’t get enough of If I Ever Feel Better, you can have If I Ever Feel Better Forever, a 200-second version of the song that ends where it starts. If you loop it, you can listen to the track nonstop, seamlessly, forever.

It’s easy to make your own earworm, even without audio. Install beta pyechonest, install remix, and cd to the earworm example:

    > python earworm.py INXS ‘Need you Tonight’

Wait a moment for the audio and analysis to download, and before you know it, you’ll have a 10-minute version of ‘Need you Tonight’ by INXS. What you do next is up to you …

Capsule

Capsule takes a list of tracks, and automatically arranges and mixes them together. You can specify the transition time, and how long to hear each song. First, it searches for the best location where to make a transition and aligns beats. Because the songs may have different tempos, it uses one of the new actions, Crossmatch, which simultaneously cross-fades and beat-matches audio in the transitions. Crossmatch uses pydirac, the new time stretcher, to smoothly speed up and slow down the two song’s tempos. Crossmatch is an incredibly powerful tool for making great-sounding transitions.  So if you pass in two songs, with the default parameters, you’ll get 8 seconds of the first one, an 8 second crossmatch, 8 seconds of the second song, and a 6 second fade-out. 

Hear it transition between two Bob Marley songs (Jamming -> Everything is Gonna Be Alright).

The capsules render pretty quickly, due to cAction, the Python C Extension that handles some of the most computationally intensive code. So why stop at two songs? You can pass as many as you like. 

Try it out, and post your results. I can’t wait to hear them. I’ll leave you with some examples to get those creative juices flowing:

Bonus Earworms:

Bonus Capsules:

Friday, May 14, 2010 — 9 notes   ()

streamgraph.js

Streamgraphs are cool. They’re great at displaying trends in data over time, similar to a stacked graph, but much prettier. The first one I saw was Lee Byron’s Last.fm listening history graphic, a beautiful poster showing trends in the music he had listened to over the course of two years. The New York Times used an interactive streamgraph (created by Matthew Bloch and Shawn Carter) to great effect to show box office receipts over 22 years in The Ebb and Flow of Movies.

When Lee Byron & Martin Wattenberg open-sourced their streamgraph implementation in Java and Processing, I was pretty excited. I’ve been playing around with processing.js a lot lately, so I decided to port their code. 

I’ve posted the code on github. The algorithms are much the same as Byron & Wattenberg’s, but I’ve added code to make the graphs more interactive and easily configurable. 

The major things I added are:

  • Dynamic sizing (sized to fit the Canvas element it’s on)
  • Settings. These can be dynamically changed, and easily set outside of the streamgraph code.
  • Zoom. You can click and drag to zoom in, and press Escape to zoom out.
  • Hover. The selected layer changes color, and the active layer name and data point is displayed at the top of the screen.
  • Another color option that offers a bit more variation.
  • Examples (including the original beautiful examples).

Streamgraph.js doesn’t quite have feature parity with the interactive flash streamgraph shown in The Ebb and Flow of Movies, but I would like it to get there. I’m waiting for the text handling in processing.js to catch up to what’s available in Processing. The processing.js team is moving fast, so I think it will happen soon.

Here are some screenshots of streamgraphs.js in action. Full source for these examples is on github.

Late Onset with NYT colors

Byron & Wattenberg’s Late Onset example.

 

Byron & Wattenberg’s Believable example, using a more varied color set.

last.fm listening history, top 15 artists.

@reiddraper’s last.fm listening data for top 20 artists over 2 years.

Pitch data from part of the Aria from Bach's Goldberg Variations

Pitch data from part of the Goldberg Variations, obtained using The Echo Nest’s Song API. Note the legend and the custom coloring: a mapping of pitch frequencies onto wavelengths of light.

Get the code here: streamgraph.js on github


Sunday, May 2, 2010 — 16 notes   ()