Font Size: a A A

Research On Extracting Speech Topic Based On Topic Model

Posted on:2016-12-20Degree:MasterType:Thesis
Country:ChinaCandidate:Q TangFull Text:PDF
GTID:2308330461955989Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
This paper studies the process of speech topic extraction:mainly by data of speech preprocessing, text representation, feature extraction, parameter estimation, model training and topic classification and through the Gibbs-LDA++and libsvm environment platform to realize the simulation of the model.Data of speech preprocessing mainly includes the transformation of speech, division of words, remove stop words and word frequency statistics. Speech conversions are used to get the text data, and by ICTCLAS to divide words and remove the stop, in order to reduce interference without words and reduce the amount of data. After dividing words and removing the stop, we do words frequency statistic to convenient and at the back of the handle, as well as the weights to the word given.Text representation and feature extraction are relationship with performance of computer dealing with data and data extraction. We use vector space model to express the text. It is natural language processing commonly used models and has a reliable theoretical support. Feature extraction is improved by the method of χ2 statistics. It mainly use the relationship between the feature and categories to decide and avoid the loss of important information.After feature extraction, we need do parameter estimation and model training on the feature set. Parameter estimation provides the necessary three parameters for the LDA model. The necessary three parameters are φ,β and T.φ and β cannot be directly get in LDA. They can only be getting through some approximate algorithm. Thus, we use the Gibbs sampling to get in the MCMC. T is a topic for the size of the value and need to set. But how much value is the best? By optimizing the DBSCAN algorithm, we use sample density to determine the relationship between different topics to choose the optimal number of topics. It implements the performance and reduces the number of iterations. The parameters to be obtained, we need train the LDA model and let the model generate a hidden topic-text matrix for SVM.Finally, by the Gibbs-LDA++ and libsvm environment platform, we do the extraction experiments of Chinese and English speech data. Comparing the experimental results and the performance evaluation methods, we can clearly demonstrate to speech topic extraction based on the topic model is superiority and effectiveness.
Keywords/Search Tags:LDA model, topic extraction, Gibbs sampling, topic
PDF Full Text Request
Related items