In this guest blog, Sasha Wanasky and Myfyr Prys from the translation company Cymen discuss their project to improve the resources available to train voice recognition systems to understand Welsh. The project was funded through a grant awarded by the Cronfa Her ARFOR scheme. Over the coming months we will be giving others who have received funding through the scheme the opportunity to share their experiences on what they have learnt in realising their projects.
Why do we need Welsh speech recognition systems?
Speech recognition systems enable computers to transform human speech into text. This technology can be used to produce automatic subtitles in YouTube videos or Teams meetings or allow Alexa and similar programs to understand commands such as “what is the weather forecast for tomorrow” and then respond accordingly.
Producing speech recognition technologies for the Welsh language would allow Welsh speakers to use Welsh in aspects of their lives in which they are currently still forced to use English. One such example is Alexa’s inability to understand and speak Welsh. While the language technologies Unit at Bangor University has been working on a similar system called Macsen, there is still a lot of work that needs to be done to get these technologies on par with their English counterparts. When presenting the project at events, one statement we heard a lot was that Alexa is the only member of people’s household who doesn’t speak Welsh. Improving Welsh speech recognition technology would allow organisations and companies to create Welsh virtual assistants, which would have a positive impact on the amount of Welsh people speak in their everyday lives.
On top of that Welsh speech recognition systems would allow a larger number of people to access online meetings and media content through the medium of Welsh. It is fairly common practice for S4C to produce English subtitles for their short and longform content on social media but this forces deaf people or people with an audio processing disorder to interact with this content only through English. Shortform content is also increasingly watched without sound and offering Welsh subtitles for these videos could increase their popularity as well as improve the Welsh reading skills of young people that would otherwise only read Welsh in school. So, having freely available Welsh transcription services would allow anyone to create content with Welsh subtitles without needing a lot of transcription experience and would increase the number of people that can interact with Welsh content through the medium of Welsh.
Why has Welsh fallen behind other languages and especially English?
There are two main reasons why this technology currently is not as widely used in Welsh. Firstly, while these technologies do exist a lot of the publicly and commercially available systems do not produce satisfactory output. Anyone who has tried to use the automatic subtitle function in Microsoft Teams in Welsh knows that it is almost impossible to understand what the speaker is saying by only reading the subtitles. This performance can only be improved by adding more training data to these speech recognition models. In comparison to English models, which have been trained on hundreds of thousands of hours of audio recordings and their corresponding transcriptions, Welsh models would have been trained on only about 200 hours of recordings, which is all the data that is currently publicly available. Of these 200 hours about 37 are recordings that have been transcribed by hand while the other 163 are recordings of people reading sentences. This enormous gap in available data is mostly the reason for the poor performance of Welsh speech recognition technology.
Secondly, many large tech companies have not adopted Welsh into their available language repertoire. Microsoft is currently one of the only tech giants that provide most of their software and automatic subtitles in Welsh while Apple, Amazon and Google have done very little or nothing at all to offer these technologies in Welsh. This is because there is a lack of incentive for these large companies to spend time and resources on improving Welsh speech recognition since these efforts are unlikely to produce a profit. Therefore it is up to organisations, governments and individuals in Wales to ensure that the data that is needed to improve those systems is being produced and made available publicly for both small and large companies to use for free.
Goal of the project
It is the goal of this project to collect as many hours of audio recordings and transcribe them by hand as possible. Specifically, addressing additional challenges such as a lack of informal and conversational data, data from mid and south west Wales (Arfor area), accents and dialects of people that are not public figures, such as radio presenters, actors, musicians etc, and producing a data set that is completely open source so that any developers can include Welsh speech recognition in their own products or programs.
Main methods of collecting data
Throughout the project we collected data from three main sources: Podcasts, slightly more formal online meetings, talks and presentations and very informal conversations between volunteers that were recruited as part of the project. We found most of the podcasts through the Welsh podcast website ypod.cymru. From there we began to contact the owners of podcasts that fit certain criteria, such as not having too many guests or podcasts that could help with addressing the challenges mentioned above. The podcast that were used range from discussions on the latest episodes of reality tv shows to Welsh literature and comedy programs.
The talks and conversations were partially recorded through Microsoft Teams and partially in person at events such as Gŵyl Ddewi Arall and the Eisteddfod and by visiting volunteers that lived in the area. This opened up the opportunity to talk to people from all sort of backgrounds and walks of life about the project, their experience with Welsh technologies and their wishes for the future. Most volunteers were recruited by posting flyers in local Welsh language clubs and societies, through being invited to speak about the project on Radio Cymru and by asking friends and family to spread the word. A majority of them were from the South West and recording the conversations through Teams proved a very successful method of efficiently collecting this type of data. Another very successful way of gathering a large amount of data in a short period of time was the National Eisteddfod in Pontypridd. As the project officer, I was able to travel there and record a wide variety of talks and conversations from mostly smaller events in the Paned o Gê, Cymdeithas yr Iaith and University tents.
Overall outcomes and effect on the Welsh language
After working on this project for almost a year we successfully transcribed 50 hours of audio data. Considering the challenging nature of this type of work and the fact that we were able to achieve this in less than a year is proof of the overall success of this project. In comparison, the total number of hours of conversational speech data that had been collected before this project was 35 hours. This amount of data would of course not have been collected without the efforts of 13 freelance transcribers that were recruited and trained as part of this project and 5 students that were employed by Cymen over the summer to gain work experience in an entirely Welsh workplace. About 5 of these freelance transcribers have expressed interest in continuing to offer their transcription services to Cymen and other companies. Finding skilled transcribers was one of the biggest challenges at the beginning but through this project we were able to make a significant contribution to this previously underdeveloped field. This project also enabled Cymen to employ one full time project officer and through that support the local economy by creating jobs. Throughout the project Cymen was also able to establish and strengthen ties to other companies and organisations in the field, such as the Uned Technolegau Iaith at Bangor University, Bangor AI and independent researchers in Brittany. The latter could lead to further projects focussing on knowledge exchange between the researchers and hopefully result in new developments and improvements in speech recognition in both countries.
Finally, the first speech recognition models have been trained on a subset of the data that we collected and the results look very promising. The graph below shows the word error rate (WER) before using our data to train the model and after training the model with 5, 10, and 15 hours of our data. The WER is a standard measure in Speech Recognition research, which gives a percentage of errors in the machine produced transcription compared to the manually produced transcription. Since it counts errors in the automatic transcription the lower the WER the better the model. After only training the model with 5 hours of data the WER can be reduced by 24%, after a further 5 hours it decreases by 4% and after another 5 hours it improves by only 0.4% reaching a kind of plateau. When training the model with both our data and data collected by the Language Technologies Unit at Bangor University the model achieved a WER as low as 28%, 7% lower than out best model. A model with only the Language Technologies Unit’s data achieves a WER of 45%, 10% higher than our model. This shows that combining the two data sets results in a model that is familiar with a larger variety of voices, accents, topics and registers and that further projects should be focussing on collecting all the variation Wales has to offer to ensure that Welsh speech recognition technologies work for every Welsh speaker.