In this work myself and Dan Sola tried to explore the applications in voice diarisation and voice recognition, and what methods are currently popular in solving these problems. As a result, we realized diarisation is still quite vulnerable to noise in the data and often fails to identify the correct number. Therefore, we focused on studying the voice recognition methods, and researched into the work done by VGG group at Oxford. We picked their state-of-art CNN training code for finding an embedding space with metric encoding perceptual voice similarity. We then modified the code a little bit to fit our purpose: we used pretrained weights for the ConvLayers and trained a couple of outer layers for a voice-to-person classification problem. We used a VGGVox v1 dataset that contains audio records for 1000+ people, as well as links to their headshot images. That allowed to turn our little project into a small Flask app that connects to your microphone, records your speech and shows you who you sounded like over time. If there’s no immediate access to a microphone, one can explore examples ran on out-of-sample records downloaded from Youtube.
The project was quite a challenging one. Here’s a non-exhaustive list of issues we had to resolve:
pyannote
+ pyBK
together. Not a big deal you’d think before you actually start looking into github issues on how to set up those dependencies under Windows…pyBK
diarisation hyperparameters - the set of parameters (window size, max number of cluster, initial number of clusters) makes diarisation quite sensitive to the dataset - hence why we did not focus on this part.Here’s a quick demo video:
Project link: https://github.com/AntonBiryukovUofC/diurisation-pet-proj