November 12, 2014
This post is a part of a series from Vlad Seghete about his onboarding project, titled Honking Vocabulary
I have been working on a project here at Datascope where I record sounds from outside our office to detect, analyze and categorize car honks. It all started with an initial brainstorm and, in this post, I show how I began to develop some understanding into the “vocabulary” (verbalized through their horn) used by motorists in downtown Chicago.
Before we jump in and blindly use black box algorithms we should first conduct some exploratory analysis on data. How can we visualize sound over time? Wouldn’t it be nice to “write down” sound in a way in which we can look at it and quickly interpret? It turns out musicians have been doing just this for a very, very long time. It is called musical notation, something you've likely seen before.
For example, take this Minuet in G Major by Bach and its corresponding musical notation:
The position of the symbols on the scale represents the note being played, while the type of symbol represents the length of that note. Another way of representing the same piece of music would be to linearly map the x-axis to time and explicitly show the duration of the notes, like so:
The above is a screenshot from GarageBand and what they call the “piano roll”. Most other software music editors have a similar view and musical notation.
Standard musical notation splits an octave into twelve half steps, which results in the representations shown above. Unlike notes on a piano, where each key generates a distinct frequency, sound generally is a mish-mash of various frequencies. Like notes on the piano, some are played louder (forte) than others (piano)—that is, they have a larger amplitude. We can further improve on the piano roll view by visually representing the amplitude with color. This is what the Minuet looks like:
This is known as a spectrogram or the short time Fourier transform of the sound signal. I’ll let wikipedia describe the mathematics behind it and how it can be done automatically for any piece of sound. The main point is that the x-axis is time while the y-axis corresponds with the frequency (or note) and the brightness encodes the amplitude (roughly corresponding with volume) at each of those frequencies. Can you tell where the individual notes start and end? Isn’t this cool? We can also change the instrument that plays the above score, so as to make it sound more or less like a real piano. And guess what, we can do this for any sound. You can try it for yourself!
Enough with the minuets already (although, the interested reader might like to know that musicians like Venetian Snares put images of kittens in their songs' spectrograms!). The noise we are really interested in comes from car horns on the street. That’s a pretty different environment from the digitally generated version of Bach’s minuet. Will this technique work in the messy real world? Here are two audio clips we took from a cheap microphone hanging out of our office window:
Wait a second! Honk B has way more lines than honk A! What’s the deal here, they definitely sound similar enough, why don’t they look the same? The trick lies within harmonics, which is something we will investigate next time, when we will also be introducing the chromagram! (Hint: compare the two versions of the minuet in this post, in particular when G is played). In the meantime we encourage you to check out the code on nbviewer or to fork your own copy of our project on github.