Google Releases WAXAL, an African Language Speech Dataset: Supports Training Automatic Speech Recognition and Text-to-Speech Models
Automatic speech recognition (ASR) and text-to-speech (TTS) technologies have rapidly advanced for high-resource languages, but low-resource languages like African languages have faced challenges due to data scarcity. To address this issue, a collaborative research team, in partnership with Google, has released WAXAL, a multilingual speech dataset encompassing 24 African languages. WAXAL consists of an ASR component based on natural speech data and a TTS component based on high-quality studio recordings. This release is expected to make a significant contribution to the advancement of African language speech technology.
Conventional dataset construction methods often fail to consider the differing requirements of ASR and TTS. For example, a dataset for robust speech recognition in noisy environments may not be suitable for creating high-quality TTS models. WAXAL recognized this and collected data by separating ASR data, using voices from various speakers in natural environments, and TTS data, recording single speaker voices in a clean and consistent environment.
ASR Data Collection Method: Utilizing Image Prompts for Natural Expression
The ASR component of WAXAL was built using an image prompt-based speech data collection method. Speakers were asked to describe the presented image in their native language, which is a far more natural setting than simply reading. Recordings were also conducted in the speakers’ natural environment to reflect various environmental conditions. Metadata such as the speaker’s age, gender, native language, and recording environment were also recorded during the data collection process. Approximately 10% of the collected audio was transcribed into text, performed by local language experts. Local scripts were used when available; otherwise, the English alphabet was used for transcription. This method contributed to improving the accuracy of African language speech data collection.
The image prompt-based speech data collection method has the advantage of capturing more natural vocabulary and grammar variations than simply reading. However, it also increases the difficulty of text transcription and increases the variability of speaker, domain, and acoustic conditions. WAXAL considered these advantages and disadvantages when constructing the dataset, ultimately providing a multilingual ASR dataset containing the diverse variability found in data collected from real-world environments.
TTS Data Collection Method: Building a Studio Environment for High-Quality Synthesis
Unlike ASR, the TTS component of WAXAL focuses on creating high-quality single-speaker synthetic voices. For each target language, a voice-balanced script of approximately 108,500 words was created, and 72 voice actors, 36 male and 36 female, were recruited and recorded in a professional studio environment. The studio environment played a crucial role in reducing background noise and preserving audio quality. The goal was to secure approximately 16 hours of clean and edited audio per voice actor. TTS models consider factors such as pronunciation consistency, recording environment, microphone quality, and speaker identity to be more important than ASR models. Therefore, WAXAL provides a dataset optimized for training African language TTS models.
The Significance and Future Prospects of WAXAL
The release of WAXAL is expected to bring significant progress to African language speech technology research and development. Developers who previously struggled with data scarcity can now build more accurate and natural speech recognition and synthesis models. This will contribute to improving digital accessibility in the African region and enabling a wider range of language services. Furthermore, WAXAL can positively influence the construction of datasets for other low-resource languages, and it is expected to lead to further advancements in speech technology for various languages. Google’s efforts will contribute to promoting the democratization of artificial intelligence technology and fostering a more inclusive technological environment worldwide.
However, it is important to note that the WAXAL dataset is not a perfect benchmark dataset. Because it is data collected from real-world environments, there is variability within the data, which is something that needs to be considered in the model development process. This variability can actually help predict model performance in real-world service environments and enable the development of more robust models. It is expected that various research studies will be conducted based on WAXAL, and many benefits will be enjoyed as African language speech technology advances.