Running with Data

fun with data, music, and visualization

This project is maintained by jsundram

‘W’ Considered Harmful

Not the magazine and not even the former president. But the letter ‘W’ itself. The letter ‘W’, 23rd in the English alphabet, is unique in two ways: it is the only letter whose name is more than one syllable, and also the only letter whose name doesn’t include the sound it makes.

The fact that ‘W’ takes 3 syllables to say bothers me. Even Wikipedia’s entry on ‘W’ points out, twice, that the abbreviation www requires nine syllables to say. Crazy. So I wondered, how often is it the case that words that start with W (hereafter W-words) have fewer syllables than the letter W (double-yew)?

Syllabification in general is a hard problem in English, but fortunately I don’t have to solve it. The Carnegie Mellon University (CMU) Pronouncing Dictionary provides the pronunciations for over 125,000 words. I say pronunciations, plural, because words can be pronounced in a variety of different ways (e.g. fire can be pronounced to rhyme with higher, or in a single syllable. Only 41 W-words in the CMU dict have pronunciations with different numbers of syllables (e.g. warrior). Using the CMU Pronouncing Dictionary, it’s possible to count syllables in a word in a short (if cryptic) Python function, courtesy of Jordan Boyd-Graber — I found it on the nltk-users google group:

from nltk.corpus import cmudict 

dictionary = cmudict.dict()  # Get the CMU Pronouncing Dictionary
entry = dictionary[word.lower()]

def nsyl(entry): 
    """Return the max syllable count in the case of multiple pronunciations.""" 
    return max([len([s for s in p if s[-1].isdigit()]) for p in entry])

So, now that we’ve got a syllable counter, let’s get all the W-words in the CMU dictionary, and see what the syllable distribution looks like.

import pylab 

src = "The Carnegie Mellon Pronouncing Dictionary [cmudict.0.6]"
w_words = {w: nsyl(entry) for (w, entry) in dictionary.items() if w[0] == 'w'}

pylab.hist(w_words.values(), align='left')
fig = pylab.gcf()
pylab.xlabel("Number of Syllables")
pylab.ylabel("Count")
pylab.title("Number of Syllables in {:,} words starting with 'W'".format(len(w_words)))
pylab.figtext(0.99, 0.01, 'Data Source: ' + src, ha='right', c=fig.get_edgecolor())
pylab.savefig('w.png', facecolor=fig.get_facecolor())

worth_abbreviating = [(w, n) for (w, n) in w_words.items() if n < 3] 

Number of Syllables in words starting with 'W'

Only 101 W-words in the CMU dictionary (of 3,805 total W-words) have more than 3 syllables. That’s 2.6%. Here’s a sampling of the words where using W to abbreviate them actually saves syllables: wagnerian, wallpapering, washingtonians, weatherperson, workaholic. So, by all means, call a meeting of the Wagnerian Wallpapering Workaholic Weatherperson Washingtonians the WWWWW. It will save time. Otherwise, consider not using an abbreviation. Or looking for synonyms.

Suggestions for further work:


Originally published: Mon, 28 Feb 2011 to https://runningwithdata.tumblr.com/post/3576752158