| Language model, which has the responsibility for the transformation from Pinyin to Chinese words, plays a key role in speech recognition. Its performance has a direct impact on the outcome of speech recognition. The current most widely used language model is statistical language model, and data sparseness is one of the main problems it has to face. Meanwhile, statistical language model just takes the local information into account, so it is meaningful to add some global information into it.There are many forms of smoothing techniques applied in statistical language models, including Katz smoothing and Church-Gale smoothing which are widely used in speech recognition. In this thesis, we quote Bellegarda's latent semantic analysis language model to join global information to statistical language model. Latent semantic language model predicts the probability of word occurrence from the perspective of content, so it is a good supplement to statistical language model. Through the singular value decomposition of word-document matrix, all documents and words are represented by the same dimension vectors, then the similarity of their corresponding vectors can be used to measure the prediction ability that documents affects words occurrence. It forms a new mixed language model by combining statistical language model and latent semantic language model, and it considers both local information and global information. Perplexity, as an important way to measure the performance of different language models, can be used to compare the performance of the mixed language model and the statistical language model.In the experiment, a bigram language model with Katz smoothing and a latent semantic language model with direct modeling are constructed. Combining the two different types of language models forms a mixed language model. The experimental results show that, compared to the bigram language model, the perplexity of the mixed language model declines, and the performance improves. |