Improved VITS-Based Multilingual AI Speech Synthesis Model with Domain Adaptors and Acoustic Feature Optimization
Abstract
Speech synthesis technology plays an important role in global economic and cultural exchanges, and multilingual speech synthesis and output are still unable to meet the current development needs of the global market. The study proposes the use of acoustic feature conversion methods and steps for decoupling multilingual information, combined with modules of domain adaptors to improve end-to-end text to speech variational inference and adversarial learning models, to adapt to the application of multilingual speech synthesis. Through the evaluation of speech synthesis technology indicators, it was found that the average selection score of the model after removing the regularization term for similarity in different languages was 4.93. The synthesis model without domain adaptors significantly reduced the naturalness of speech synthesis by 0.8 compared to multilingual speech synthesis models, indicating that domain adaptors have a good effect on the naturalness of speech synthesis. In cross-lingual indicator analysis, the model proposed by the research achieved the highest naturalness result, with an average selection score of 4.26 and 3.96 for naturalness and similarity in transit English. In the intermediate day voice synthesis with a data volume of 200, the highest accuracy was 94.58%, which was 16.53% higher than traditional speech synthesis frameworks. Comparing the cross-lingual synthesis performance of the synthesis model, it was found that the model had an accuracy rate of 94.58% and a time of 3.12 seconds for the synthesis of Chinese to Japanese conversion with a data volume of 200. The above results demonstrate the feasibility and superiority of the multilingual speech synthesis model based on domain adaptors, which adds multilingual imagery to speech synthesis applications in the field of artificial intelligence and promotes the industrial development and intelligent services of speech synthesis technology.DOI:
https://doi.org/10.31449/inf.v49i19.7622Downloads
Published
How to Cite
Issue
Section
License
I assign to Informatica, An International Journal of Computing and Informatics ("Journal") the copyright in the manuscript identified above and any additional material (figures, tables, illustrations, software or other information intended for publication) submitted as part of or as a supplement to the manuscript ("Paper") in all forms and media throughout the world, in all languages, for the full term of copyright, effective when and if the article is accepted for publication. This transfer includes the right to reproduce and/or to distribute the Paper to other journals or digital libraries in electronic and online forms and systems.
I understand that I retain the rights to use the pre-prints, off-prints, accepted manuscript and published journal Paper for personal use, scholarly purposes and internal institutional use.
In certain cases, I can ask for retaining the publishing rights of the Paper. The Journal can permit or deny the request for publishing rights, to which I fully agree.
I declare that the submitted Paper is original, has been written by the stated authors and has not been published elsewhere nor is currently being considered for publication by any other journal and will not be submitted for such review while under review by this Journal. The Paper contains no material that violates proprietary rights of any other person or entity. I have obtained written permission from copyright owners for any excerpts from copyrighted works that are included and have credited the sources in my article. I have informed the co-author(s) of the terms of this publishing agreement.
Copyright © Slovenian Society Informatika







