With the increasing trend of globalization and cultural trade exchanges between countries,multilingualism has become a common phenomenon in daily life.As the gateway to human-computer interaction,most of the existing state-of-the-art speech recognition systems focus on monolingual speech recognition,i.e.,they can only handle one language at a time and cannot recognize code-switching speech.Therefore,it is important to build automatic speech recognition systems for code-switching speech.The DNN-HMM algorithm has become the mainstream framework for acoustic modeling of speech recognition in recent years,but it has some obvious limitations for code-switching speech recognition tasks.First,the conventional DNN-HMM speech recognition model is based on modeling some acoustic units such as pinyin and phoneme.Those acoustic units of different languages are independent of each other and have different acoustic properties.The connection between acoustic attributes of different language cannot be well modeled by independent pronunciation dictionaries of each languages.Secondly,due to the specificity of code-switching speech and sparse training data at the code-switching points,the DNN-HMM model cannot effectively model the acoustic properties at the junction of two languages(code-switching point).Therefore,this paper adopts an end-to-end(E2E)strategy to build and study an end-to-end Chinese-English code-switching speech recognition system based on Transformer framework and joint CTC training.The end-to-end(E2E)model is completely based on the unified modeling of neural networks,eliminating the modules of dictionary,acoustic model,and language model in DNN-HMM,which can optimize the overall input to output.Moreover,the end-to-end model is usually based on character modeling,and the modeling unit no longer corresponds to the acoustic unit one by one,which can blur the association between the modeling unit and acoustic attributes,enabling the network to automatically balance the similarity and distinction between different language speech.Also,because the end-to-end model is free from the independence assumption,it is able to learn the acoustic properties at the codeswitching point.Further,this paper also innovatively proposes two Transformer-based structures:(1)Propose an acoustic modeling algorithm based on the Transformer framework with self-and-mixed attention mechanisms.(2)In order to better explore the acoustic commonality and distinction of between Chinese and English languages,a "multi-encoder-decoder Transformer" structure is proposed in this paper.The experimental results on SEAME dataset demonstrate that the two acoustic modeling algorithms proposed in this paper have significantly improved the recognition performance compared with the baseline standard Transformer model and the baseline DNN-HMM model. |