Kiko Android is an app that classifies sounds and localizes their sources. It is a riff on Kiko Glasses, my 2015 engineering capstone team’s project.
This app localizes sounds by measuring the time delay of sound reaching two microphones. Using that delay, one can narrow down the sound source location to a surface in 3D space. But unfortunately, that surface consists of infinitely many points: to narrow the location down to a single point in 3D space, more microphones are required (Kiko Glasses had four, whereas phones tend to have at most two). Luckily, we can work around this limitation by making some assumptions:
- Assume the user only cares about the direction (a vector) the sound is coming from.
- Assume all sound originates in a plane.
- Assume all sound will come from somewhere in front of the phone, as opposed to behind it.
These assumptions allow us to usefully tell the user “where” the sound is coming from by providing a vector tangential to the surface, pointing away from the user, and on the phone screen’s plane.
Classification and the app itself are based on TensorFlow’s Speech Commands Demo. It uses a simple spectrogram classifier based on this paper. I trained the network on the UrbanSound8k dataset + the speech files sampled from the Speech Commands Dataset.