Font Size: a A A

Topic Classification Of Spoken Document Based On LSH

Posted on:2013-04-25Degree:MasterType:Thesis
Country:ChinaCandidate:X W HeFull Text:PDF
GTID:2248330377959172Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
How to manage the speech information rationally and effectively is one of the presentstudy on speech signal processing with the speech information widely used nowadays.Among this, the topic classification of spoken documents is a hotspot for researchers. In thispaper, a new method of Locality-Sensitive Hashing is applied firstly in spoken documenttopic classification, which aims at decrease the high time consumption of existingalgorithms. Compared with current classification algorithms, Locality-Sensitive Hashingcan be conducted on high-dimension sparse matrix directly, since its sub-linear relationshipswith data dimension and numember lead to low time consumption, which enhancesclassification system practicality. Locality-Sensitive Hashing algorithm is well studied here.Based on analysis, its key parameter is adjusted and improved, which increases the accuracyof classification. At last, the algorithm implementation method is improved to reduce itstime cost.Firstly, Vector Space Model of recognized speech documents is build with TF-IDFweight and posterior probability TF-IDF weight to make the documents can be recoginizedand processed by computer. Secondly, document vectors are hashed by locality-sensitivehash functions based on p-stable distributions, which ensures the position relationship ofdata in Euclidean space. Again, the key parameters of LSH are analysised in depth. Afterdetermining the optimal parameter by experiments, the system classifies the documentswith Locality-Sensitive Hashing algorithm under two judgement rules. Finally, theLocality-Sencitive Hashing algorithm is improved, which reduces the time consumptionfurther. In addition, KD tree is adopted in spoken document topic classifation.At the end of the paper, the results of all spoken documents including4categories of7041documents are listed. After analysising the experiment results, we get the conclusionthat compared with KD tree, Locality-Sensitive Hashing can classify spoken documentsaccurately with lower time consumption.
Keywords/Search Tags:Speech topic classification, Locality-Sensitive Hashing, KD tree, Vector SpaceModel, Stable Distributions
PDF Full Text Request
Related items