Font Size: a A A

Research On Multi-strategy Keywords Extraction And Quick Text Classification

Posted on:2013-04-16Degree:MasterType:Thesis
Country:ChinaCandidate:H Y WangFull Text:PDF
GTID:2298330467478323Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
In recent years, with the increasing of information on the network, efficient retrieval technology and classification technology become more and more important. So natural language processing technology gets wide attention. Its two major application fields are keywords extraction and text classification. This paper proposes some new improvements after the study of some common methods.The traditional keywords extraction methods usually use statistical information, including word frequency, word location, N-gram, and so on. These methods are simple, but the accuracy is not satisfactory. For example, the word "claim" will frequently appear when there is a speech by officials, such as state leaders and government spokesman, in texts which are about politics. But obviously, a word like that can’t be a keyword. However, it used to be a keyword in TD-IDF method because it doesn’t widely appear in other classes. For this, Adding semantic analysis is a well solution. We can remove some irrelevant words by finding the class of the test text, and weighting the semantic weight of the topic. Then we can remove many improper keywords.Secondly, this paper also proposes a keywords extraction method of word group and word co-occurrence information. The word groups can represent text better. And keywords extracted contain more information, which can bring an improvement of the accuracy of keywords extraction. And then, precision and recall rate both have a big enhancement, which can prove the effectiveness of this method.At last, this paper proposes a method of twice classification and CHI statistical information about quick text classification. The traditional method takes large calculation of vector distance by using complex model. But in the field of practical application, special in the field of mobile Internet of personal mobile, it becomes more and more necessary that the algorithm has a feature of quick. So in this paper, we try to find a new method with quickly performance. It is a positive classification process. And we try to use twice classification method. For the first time, we use a simple method to classify quickly and remove many irrelevant classes. And then, for the second time, we use a complex method which is with high performance to classify accurately. We compute the CHI statistical information in the stage of training, and then in the stage of testing, we can use the CHI statistical information directly, and so it becomes a positive classification method. After experiment, the average of F1value of this method was86.38%, and the average of F1of twice classification was90.32%. It is useful for practical application. The new method had a great improvement on time performance. The result proved that the new method was satisfactory.
Keywords/Search Tags:keywords extraction, semantic analysis, text classification
PDF Full Text Request
Related items