Font Size: a A A

Research On Confusion Network And Side Information For Speech Recognition

Posted on:2008-06-19Degree:DoctorType:Dissertation
Country:ChinaCandidate:H L WangFull Text:PDF
GTID:1118360245996610Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Communicating freely with computer via speech is always people's dream for many years. Although some great progress has been achieved in speech recognition area after several decades of unremitting efforts, it is still far away from the practical applications. How to further improve the performance and robustness has become the bottleneck of speech recognition.It is well-known that very limited acoustic and linguistics knowledge, i.e. spectral feature of speech signal and N-gram based statistical language model, is used in automatic speech recognition system. This information is far from enough for the complicated tasks like speech recognition since a large amount of information is implicitly utilized for human in the process of speech apperception.The performance of speech recognition can be improved by more effectively modeling and applying other side information. Confusion network is a more compact form representing multiple candidates, and word error rate can be minimized by performing second-pass decoding on confusion network. It is more significant for improving recognition performance to use confusion network as a decoding platform where various side information can be well integrated.Accordingly, two subjects are studied in this thesis: confusion network and side information. It is attempted to reduce character error rate by performing confusion network decoding with various side information. In the aspect of confusion network, the efficient approachs to generating and decoding confusion network are studied. In the aspect of side information, the effective methods are investigated to model and apply it. Major original works in the research are listed in details as follows:1 . Two approaches to efficiently generating confusion network are proposed. In the first one, lattice scale is reduced by segmenting original lattice into multiple sublattices, which can improve generation speed at a cost of slight decline of its quality. In the second one, the constructing process of confusion set is guided by the arc with maximum posterior probability, which can reduce the complexity of generation algorithm to linearity. Moreover, K-L divergence is introduced to measure the similarity between two arcs, which can increase the quality of confusion network. Finally, for Chinese speech recognition task two new structures of confusion network are introduced: character-based confusion network and logical confusion network.2 . Decoding methods integrating two types of side information on confusion network are studied. Trigger language model based on semantic class pairs is proposed to model dependence relationship between long-span words. The model is integrated with confusion network decoding process. Different speech recognition systems utilize different knowledge sources and modeling methods, consequently their error pattern is also different. A decoding method is proposed to combine the results from multiple recognition systems on confusion network. Experimental results show both methods can relatively reduced character error rate by 7.9% and 10.7%, respectively.3.It is investigated to use tone information to improve the performance of Chinese speech recognition. In the acoustic decoding stage, multi-space probability distribution based HMM (MSD-HMM) is adopted to model tone pattern, which resolves the problem that tone feature is discontinuous in the whole utterance. In the framework of two-stream HMM, spectral and pitch features can be decoded synchronously. In the second pass, tone information over a horizontal, longer time span is used to build explicit tone models which are apply to decoding on the confusion network generated in the first pass. Experimental results show that in the first-pass decoding 15.9% relative error reduction can be obtained in character recognition and an additional 8.0% relative error reduction by the second-pass decoding.4.A reliable speech input system with the ability of fast correcting input error is developed. Character-based confusion network is used to decompose sentence-level hypothesis into character-level one, which can allow the user to correct about half of recognition errors quickly and conveniently. In order to speed up new character input, speech recognition method assisted by handwriting information is proposed. It has faster input rate than single handwriting input and more reliable than single speech recognition.To sum up the above arguments, generation method of confusion network, its decoding methods integrating side information, modeling methods of side information and their application are investigated in this thesis, and the performance improvement is achieved for speech recognition. Efficiently constructing confusion network with high quality is the base of decoding, which is significant not only for speech recognition task but also for other tasks based on confusion network (such as speech document retrieval). The study on confusion network decoding methods, which integrate trigger language model based on semantic class pairs and the results from multi-system combination, also provides beneficial reference for utilizing other types of side information. Application of tone information remarkably improves the performance of speech recognition and also exhibits a good beginning for better utilizing various acoustic side information (such as stress, intonation etc). Speech input system becomes more reliable and its error correction process more convenient and efficient by using confusion network and handwriting information. This is a successful application of side information and confusion network in speech recognition.
Keywords/Search Tags:speech recognition, confusion network, side information, multi-system fusion, tone modeling
PDF Full Text Request
Related items