top of page
Search

Compiling a British Accent Speech Dataset for Machine Learning

  • OscarVanL
  • Nov 13, 2020
  • 2 min read

Recently I've been training Machine Learning models centring around human speech; this area of Computer Science encompasses Automatic Speech Recognition (ASR), Text-to-Speech (TTS), Natural Language Processing (NLP), and Natural Language Understanding (NLU). Each of these frontiers will be necessary to, one day, create a general artificial intelligence.


These technologies have become a part of everyone's life, from the personal assistants in their car, phone, and home, to automated transcriptions on YouTube videos, to the deep fakes that threaten to spawn a new era of fake news.


I have a Southern British English accent, one of my biggest frustrations has been the difficulty finding a good British speaker dataset. This causes worse ASR accuracy, less personal TTS voices, and models that only generalise to the (usually American) accents they are trained on. Even my Google Home can struggle to differentiate between me saying "Light On" and "Light Off".

Photograph: channel5

As we know from the biological neural networks inside American toddlers watching Peppa Pig, using British speakers in our dataset will help it generalise to British speakers!


For any effective Machine Learning model, you must ground yourself with a good dataset that is similar to the data you provide at inference.

The Dataset


One of the most popular English language corpuses (corpora?, corpi?) used in speech processing is LibriTTS.

LibriTTS is a multi-speaker English corpus of approximately 585 hours of read English speech at 24kHz sampling rate. The LibriTTS corpus is designed for TTS research.

LibriTTS is derived from the LibriVox project, a volunteer audiobook project of public-domain books, published in an open-source format.

For a long time I thought LibriTTS included only American speakers, but I came across the LibriVox accents table, and the blog of Ruth Golding, a LibriVox volunteer who compiled a list of British readers within LibriVox!


With the help of these resources, I compiled a list of 85 British English speakers contained within LibriTTS, totalling 23 hours, 33 minutes of transcribed speech!

Unfortunately, this dataset is not perfect, 59 of these speakers are Male and just 26 are female. A large proportion are Southern English accents.


Scottish, Welsh, and Irish people are in less luck, with 2, 2, and 3 speakers respectively. This is because there are fewer LibriVox readers with these accents (I'm afraid you're still going to get stuck in voice-activated elevators).


The dataset can be found on my GitHub Repository here.


This problem raises an interesting question: As speech recognition becomes more ingrained in our lives, will people with regional accents be left behind?

 
 
 

Comments


Post: Blog2_Post


bottom of page