Research Of Text Information Distinguishing Based On Classifying

Posted on:2014-02-18

Degree:Master

Type:Thesis

Country:China

Candidate:K X Wu

Full Text:PDF

GTID:2248330398982993

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

With the rapid development of Internet in recent years, the information peoplereceived daily is getting richer. Generally speaking, most of the information is healthyand positive, but harmful information like pornographic webpages, reactionaryremarks and violent-terrorist content can be still easily gotten. According to statistics,all of these have token12%of the whole Chinese webpages, and exert a negativeinfluence on the development of network culture. Therefore, filtering bad webpagesquickly and effectively is an important task in the process of network culturedevelopment.After years of development, technical means combined with regulations havebeen widely used in preventing harmful information. Firstly, we use technical meansto find webpages and websites which contain harmful information, and then makecorresponding punishment according to the law. Although this method is effective,limited by technology, a large part of the bad webpage is still difficult to be identifiedand shielded. In recent years, bad webpage recognition method has obtained newdevelopment opportunities benefited from the development of Chinese text processingtechnology. In these technologies, Chinese text classification is the most outstandingone which has the advantages of low misjudgment rate and high recognition ratecompared to keyword based method. However, it can still not satisfy the requirementof speed and accuracy facing mass webpages.We find it that most of the bad webpages appear in a small number of categories.So this thesis put forward a new recognition method called classify-distinguish. Firstly,we filter out webpages belongs to "high risk" categories by the KNN classifier whichis a multi-class classifier with high recall rate and speed, and then discriminate if themcontain bad information or not by the SVM classifier which is a quickly andaccurately binary classifier. The advantage of this method is filtering out most ofnormal webpages and leaving ones belong to “high-risk” categories, reducingpressure of subsequent discriminant, improving the processing speed, and ensuringthe overall discriminant accuracy at the same time. For this purpose, this thesis has done a lot of work, including the followingresearch contents and innovation points:a) Studying Chinese word segmentation and comparing common word segmentationsystems. Grasping quantization method of TF-IDF, Vector Space Model and theway of dimension reduction.b) Analyzing the theories and characteristic of NB, KNN and SVM classifiers.Besides, testing and verifying the function of the classifiers.c) Designing the system of classify-distinguish system based on the distribution ofbad webpages in the Internet, which includes structure of the system model,improvement of KNN, reform of SVM, system realization and test, collection andarrangement of related text, etc.

Keywords/Search Tags:

Chinese word segmentation, KNN, SVM, classify-distinguish cascadingtext classifiers

PDF Full Text Request

Related items

1	Chinese Word Auto-segmentation Design And Algorithm Realization For Chinese Network Information Retrieval
2	Research And Implementation Of Chinese Word Segmentation Algorithm
3	Comparative Research On Open-Source Chinese Word Segmentation Machines
4	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-segmentation With Statistic
5	The Research And Implemenation Of The Chinese Word Segmentation System Combining Omini-Segmentation With Statistic
6	Research On Overlapping Ambiguity Treatment For Chinese Word Segmentation
7	Based On The Understanding Of The Chinese Word System Design And Realization
8	Research On Cross-domain Chinese Word Segmentation Method Based On New Word Discovery
9	Research Of Chinese Word Segmentation In BERSE
10	Research Of Combined Chinese Word Segmentation Method