| As the basis of intelligent interaction,speech synthesis has advanced quickly in recent years,significantly boosting the popularity of artificial intelligence technology in daily life.Currently,speech synthesis models based on deep neural networks can produce high-quality speech comparable to real humans.However,as the number of parameters and complexity of the neural network rise,speech synthesis models encounter many efficiency problems.In particular,the efficiency of acoustic modeling deserves special attention as it forms the basis of speech synthesis.The study of building efficient neural acoustic models is of great importance for either neural speech synthesis or neural network technology itself.To this end,this research focuses on the two mainstream acoustic modeling strategies and the front-end processing for Chinese Mandarin acoustic modeling to remove the existing barrier toward more efficient speech synthesis systems.(1)An autoregressive acoustic model with mixed self-attention and lightweight convolution is proposed to alleviate the problem of the large number of parameters and poor computation efficiency in purely self-attention-based autoregressive acoustic models.Particularly,our model replaces part of the self-attention operations in the original model with more efficient lightweight convolutions,which may reduce the potential computation redundancy and thus improve the training and inference efficiency.The baseline and the proposed model are validated on Chinese and English datasets,respectively.Experimental results demonstrated that our model may achieve the same performance as the baseline model while improving training efficiency by 36% and inference efficiency by 95%.Though self-attention has strong modeling capabilities,convolution often has higher computational efficiency for obtaining local features.Therefore,reasonably organizing and combining different network structures may bring more benefits than simply increasing the number of network parameters while building neural models.(2)Unlike the training procedure,the self-attention-based autoregressive acoustic model cannot run in parallel during the inference,leading to unsatisfactory inference speed.To mitigate the problem,we replace the original self-attention network using an efficient decoding self-attention network with only linear computing complexity during the inference,which effectively improves the inference efficiency of the model.Besides,since our model adopts a decoding algorithm based on dynamic programming,it can be easily combined with some constraint algorithms for attention alignment to improve the model’s stability when synthesizing speech for long input text.Experimental results show that the proposed method can perform similarly to the original baseline with an inference speedup of 450% to 720% on the CPU and20% to 50% on the GPU.This method validates the possibility of replacing the quadratic complexity self-attention in speech processing with a novel linear complexity operator,serving as an example for accelerating self-attention operators in other tasks.(3)A cooperative learning strategy is proposed to simplify the complex pipeline of the traditional non-autoregressive acoustic modeling,where an external alignment tool is usually needed to obtain the pronunciation duration of input tokens.Moreover,the one-to-many mapping problem in non-autoregressive acoustic modeling can also be addressed via the introduced method.With the proposed cooperative learning strategy,the non-autoregressive model can learn the duration information from scratch during training.Meanwhile,the non-autoregressive model may also get extra prosody information from its collaborator to further enhance the performance.The proposed approach is simpler and more flexible than the conventional non-autoregressive acoustic modeling method,which significantly boosts the efficiency of model construction.Experimental results show that the proposed method may perform similarly to the most popular Fast Speech 2 model while significantly reducing the workload.The suggested cooperative learning approach may establish a pathway for sharing knowledge between autoregressive and non-autoregressive acoustic modeling,which is advantageous for transferring prior study outcomes based on the autoregressive models to the non-autoregressive acoustic modeling process.(4)Considering the Chinese acoustic modeling front-end relying on large-scale pre-trained language models,two different model compression methods are introduced to improve the computation efficiency of the front-end.The traditional model compression method with knowledge distillation directly employs the same self-attention module as in the teacher model to build the student rather than seeking a more efficient structure.Therefore,this paper proposes a heterogeneous knowledge distillation method where a lightweight convolutional network is designed as the student model,thus achieving 11.8 times the running speed of the teacher model.In addition,a hybrid model compression strategy using both knowledge distillation and network pruning is also presented.Compared with the previous compression method with pure knowledge distillation,this method can remarkably reduce the training time by more than 50%.Experimental results show that the two methods above can produce compressed front-ends with comparable performance to those obtained via the traditional knowledge distillation method.The results also demonstrate that the traditional large-scale pre-trained language model often leads to computational redundancy when applied to specific tasks,and the suggested method may serve as an example for solving efficiency issues with pre-trained language models for other applications. |