| With the rapid development of speech recognition technology,end-to-end speech recognition framework has become the mainstream.However,this framework requires a large amount of labeled data to be trained,which is very difficult in the case of lowresource languages.In order to solve the problem of model performance degradation caused by insufficient training of low-resource language,this paper conducts research on Tibetan language data based on semi-supervised speech recognition method.The main work is as follows:(1)Tibetan speech recognition based on semi-supervised learning method.In this paper,the baseline model is trained with Tibetan data to ensure the reliability of the semi-supervised model.Then,the semi-supervised learning method is used to make full use of the speech and text related information in the unlabeled data,so as to improve the learning ability of the model.In the study of semi-supervised speech recognition,this paper designs a shared encoder which can encode both speech and text,and optimizes the model by taking advantage of the advanced feature differences between unlabeled speech and text.In order to better study the effects of semi-supervised methods on the performance of speech recognition models,experiments were conducted on 25.4 hours of labeled data sets and 139.5 hours of unlabeled data sets.The structure of the shared encoder is transformer and conformer models,and the modeling units are Tibetan characters and Tibetan subwords.The experimental results show that the semi-supervised learning speech recognition model based on Tibetan characters performs well.The character error rate of the semi-supervised model is8.28%,which improves the performance of the baseline model with character as the modeling unit by 30.36%.(2)Tibetan speech recognition based on self-training and Spectrum Augment.In the case of limited resources,the acquisition of high quality labeled data requires a large amount of expenses and labor costs,so this paper proposes a self-training method and Spectrum Augment method to expand the labeled data.The convergent baseline model or semi-supervised model generates pseudo tags and uses the confidence strategy to filter the pseudo tags with high confidence.The unlabeled data with high confidence and the pseudo-tags were combined into new data pairs,which were added to the labeled data set to expand the training data.Finally,the spectral perturbation method is used to Spectrum Augment.In Tibetan language recognition task,the experimental results show that the self-training method and Spectrum Augment method can significantly improve the performance of the speech recognition model.In speech recognition tasks using characters as modeling units,the character error rate of the model after self-training and data enhancement is 7.35%,which is 17.97% higher than that of the semi-supervised model. |