Font Size: a A A

Research Of Text Information Distinguishing Based On Classifying

Posted on:2014-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:K X WuFull Text:PDF
GTID:2248330398982993Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet in recent years, the information peoplereceived daily is getting richer. Generally speaking, most of the information is healthyand positive, but harmful information like pornographic webpages, reactionaryremarks and violent-terrorist content can be still easily gotten. According to statistics,all of these have token12%of the whole Chinese webpages, and exert a negativeinfluence on the development of network culture. Therefore, filtering bad webpagesquickly and effectively is an important task in the process of network culturedevelopment.After years of development, technical means combined with regulations havebeen widely used in preventing harmful information. Firstly, we use technical meansto find webpages and websites which contain harmful information, and then makecorresponding punishment according to the law. Although this method is effective,limited by technology, a large part of the bad webpage is still difficult to be identifiedand shielded. In recent years, bad webpage recognition method has obtained newdevelopment opportunities benefited from the development of Chinese text processingtechnology. In these technologies, Chinese text classification is the most outstandingone which has the advantages of low misjudgment rate and high recognition ratecompared to keyword based method. However, it can still not satisfy the requirementof speed and accuracy facing mass webpages.We find it that most of the bad webpages appear in a small number of categories.So this thesis put forward a new recognition method called classify-distinguish. Firstly,we filter out webpages belongs to "high risk" categories by the KNN classifier whichis a multi-class classifier with high recall rate and speed, and then discriminate if themcontain bad information or not by the SVM classifier which is a quickly andaccurately binary classifier. The advantage of this method is filtering out most ofnormal webpages and leaving ones belong to “high-risk” categories, reducingpressure of subsequent discriminant, improving the processing speed, and ensuringthe overall discriminant accuracy at the same time. For this purpose, this thesis has done a lot of work, including the followingresearch contents and innovation points:a) Studying Chinese word segmentation and comparing common word segmentationsystems. Grasping quantization method of TF-IDF, Vector Space Model and theway of dimension reduction.b) Analyzing the theories and characteristic of NB, KNN and SVM classifiers.Besides, testing and verifying the function of the classifiers.c) Designing the system of classify-distinguish system based on the distribution ofbad webpages in the Internet, which includes structure of the system model,improvement of KNN, reform of SVM, system realization and test, collection andarrangement of related text, etc.
Keywords/Search Tags:Chinese word segmentation, KNN, SVM, classify-distinguish cascadingtext classifiers
PDF Full Text Request
Related items