| Acoustic-to-articulatory inversion(AAI)is the research of conversing the movement of articulators based on speech signals.It has great application value in language learning and rehabilitation guidance.Most of current work only use speech features as input,causing an inevitable performance bottleneck.And since the framework with bidirectional recurrent neural network was proposed,there has been very little progress on the development of inversion framework.To solve this problem,in this work,we propose a novel method called auxiliary feature fusion network(AFFN).This paper mainly studies the feature processing and inversion network in acoustic-to-articulatory inversion.In terms of input features,we use the trajectory of the non-tongue positions in the EMA dataset as an auxiliary feature to enhance the diversity of input feature together with speech features.And then,the input feature is kept unique and an extraction unit is used to predict the auxiliary features to enhance the performance of inversion network.At the same time,inspired by the idea of feature fusion based on canonical correlation analysis,we propose a feature transformation module to generate a joint feature with higher correlation as the input of the articulatory inversion module.Finally,the encoder-decoder network with attention mechanism is used to replace the general multi-layer LSTM network to add more the context relation.Experiments are conducted on two public datasets,namely mngu0 and MOCHA.The effectiveness of our proposed methods are verified.Experimental results show that the proposed acoustic-to-articulatory inversion model with feature transformation fusion and attention mechanism can greatly improve the performance with the same input speech feature,reducing the average RMSE by more than 15%,compared with the state-of-the-art method using the audio speech feature only. |