(What celebrity do you sound like?) Diarisation & voice recognition pet project

In this work myself and Dan Sola tried to explore the applications in voice diarisation and voice recognition, and what methods are currently popular in solving these problems. As a result, we realized diarisation is still quite vulnerable to noise in the data and often fails to identify the correct number. Therefore, we focused on studying the voice recognition methods, and researched into the work done by VGG group at Oxford. We picked their state-of-art CNN training code for finding an embedding space with metric encoding perceptual voice similarity. We then modified the code a little bit to fit our purpose: we used pretrained weights for the ConvLayers and trained a couple of outer layers for a voice-to-person classification problem. We used a VGGVox v1 dataset that contains audio records for 1000+ people, as well as links to their headshot images. That allowed to turn our little project into a small Flask app that connects to your microphone, records your speech and shows you who you sounded like over time. If there’s no immediate access to a microphone, one can explore examples ran on out-of-sample records downloaded from Youtube.

The project was quite a challenging one. Here’s a non-exhaustive list of issues we had to resolve:

Installing pyannote + pyBK together. Not a big deal you’d think before you actually start looking into github issues on how to set up those dependencies under Windows…
Tuning pyBK diarisation hyperparameters - the set of parameters (window size, max number of cluster, initial number of clusters) makes diarisation quite sensitive to the dataset - hence why we did not focus on this part.
Downloading VGGVox v1 - not your classic Iris dataset, this thing is 40 GB+
There is no solid web-ready audio player one could interact with, written in Python. Therefore we created one in Bokeh ! :)
Flask. I could do lots of stuff done in R’s alternative to Flask - Shiny. However, none of us two knew Flask, JS or CSS ahead of time - now I have a reasonable baseline knowledge of how Flask, HTML templating and CSS works.
Flask + WebSockets + JS. This was a pure nightmare, since making the app work with a locally installed microphone was such a piece of cake ! However, we conquered that problem too, so now the microphone stream is transfered over socket via browser.

Here’s a quick demo video:

Project link: https://github.com/AntonBiryukovUofC/diurisation-pet-proj