Font Size: a A A

Research Of Chinese Text Classification Based On Improved Vector Space Model

Posted on:2016-06-23Degree:MasterType:Thesis
Country:ChinaCandidate:K ZhouFull Text:PDF
GTID:2298330452965404Subject:Control Science and Engineering
Abstract/Summary:PDF Full Text Request
With the development and maturity of information technology especially Internettechnology, such an open, free-style way of data sharing and flowing has caused a hugeaccumulation of information. On one hand, people are eager to obtain adequate information;and on the other hand, quickly and efficiently retrieving the needed content on large-scaleof informationis becoming increasingly difficult, which is the so-called ‘informationmisleading’ phenomenon. Text processing on large-scale level is becoming a difficultproblem. Therefore, it is urgentfor a textual information processing tools.So, automatic textclassification technology is born at the right moment.Comparing the domestic and international text classification development, a detailedintroduction of several key technologiesintext classification based on Vector Space Modelisdiscussed in this paper. We analyzed several key factors that affect classification results.Inorder to solve the problem of high-dimension and sparsitycaused by the feture space aftersegments, we proposeda four-dimensional vector space modeland used support vectormachine algorithm (SVM) to design appropriate experiments to verify the validity of themodel.Meanwhile,we designed a self-constructed method of Chinese categorydictionary(SCC-Dict) by improving the traditional feture word weighs formula. It couldslove the problem that the classification method based on dictionary can’t be done in theabsence of expert knowledge.On the basis of SCC-Dict and the four-dimensional vector space model,we designed aChinese text classification system mainly concern on news.The system is composed ofstorage module, text preprocessing module, segmentation module, SCC-Dict buildingmodule, vector mapping module and classification module. It is a dynamic system, that isto say, every time it handles a task, it uses real-time information to construct classifierinstead of relying on previousmodel and samples.Finally, experimental results show that the classification method adopted in thispaperhas a certain improvement in the classification accuracy and speed, the method hasbeen applied to actual projects.
Keywords/Search Tags:Chinese Text Classification, SCC-Dict, SVM, Four-Dimensional Vector SpaceModel
PDF Full Text Request
Related items