Research On Multi-strategy Keywords Extraction And Quick Text Classification

Posted on:2013-04-16

Degree:Master

Type:Thesis

Country:China

Candidate:H Y Wang

Full Text:PDF

GTID:2298330467478323

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

In recent years, with the increasing of information on the network, efficient retrieval technology and classification technology become more and more important. So natural language processing technology gets wide attention. Its two major application fields are keywords extraction and text classification. This paper proposes some new improvements after the study of some common methods.The traditional keywords extraction methods usually use statistical information, including word frequency, word location, N-gram, and so on. These methods are simple, but the accuracy is not satisfactory. For example, the word "claim" will frequently appear when there is a speech by officials, such as state leaders and government spokesman, in texts which are about politics. But obviously, a word like that canâ€™t be a keyword. However, it used to be a keyword in TD-IDF method because it doesnâ€™t widely appear in other classes. For this, Adding semantic analysis is a well solution. We can remove some irrelevant words by finding the class of the test text, and weighting the semantic weight of the topic. Then we can remove many improper keywords.Secondly, this paper also proposes a keywords extraction method of word group and word co-occurrence information. The word groups can represent text better. And keywords extracted contain more information, which can bring an improvement of the accuracy of keywords extraction. And then, precision and recall rate both have a big enhancement, which can prove the effectiveness of this method.At last, this paper proposes a method of twice classification and CHI statistical information about quick text classification. The traditional method takes large calculation of vector distance by using complex model. But in the field of practical application, special in the field of mobile Internet of personal mobile, it becomes more and more necessary that the algorithm has a feature of quick. So in this paper, we try to find a new method with quickly performance. It is a positive classification process. And we try to use twice classification method. For the first time, we use a simple method to classify quickly and remove many irrelevant classes. And then, for the second time, we use a complex method which is with high performance to classify accurately. We compute the CHI statistical information in the stage of training, and then in the stage of testing, we can use the CHI statistical information directly, and so it becomes a positive classification method. After experiment, the average of F1value of this method was86.38%, and the average of F1of twice classification was90.32%. It is useful for practical application. The new method had a great improvement on time performance. The result proved that the new method was satisfactory.

Keywords/Search Tags:

keywords extraction, semantic analysis, text classification

PDF Full Text Request

Related items

1	The Research Of Keywords Extraction Algorithm In Text Mining
2	Automatic Extraction Of Keywords And Text Summarization In Text Mining
3	Research On Keywords Extraction From Weibos Based On Semantic Association Between Image And Text
4	Research Of Text Mining Based On Semantic Analysis
5	Study On Extraction Of Uygur Keywords In Public Opinion Analysis
6	Semantic Feature Extraction Algorithm, The Contents Of Text Classification
7	Illegal Experimental Application Classifier Based On Keywords
8	Based On A Summary Of The Semantic Relation Extraction
9	Narrowing down the semantic gap between content and context using multimodal keywords
10	User Web Information Collection And Analysis System Based On The Smart Router