Font Size: a A A

Keyword Extraction Method Research Based On Improved TFIDF And Spectral Division

Posted on:2013-02-15Degree:MasterType:Thesis
Country:ChinaCandidate:G S XiaoFull Text:PDF
GTID:2218330371991474Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Key words refers to the word or phrase in the document which has the specificity and can reflect the theme of the document. The process to extract keywords from a document by automated technology is called automatic keywords extraction. Automatic keywords extraction is the basis and one of the key technologies of classification, retrieval and summarization in text automatic processing.According to different theoretical basis, Keywords extraction methods can be divided like this:Statistical analysis, language analysis, artificial intelligence and so on. The method of statistical analysis is to extract several words or phrases which has great weight according to the calculation by statistical information as keywords. TFIDF(Term Frequency&Inverse Documentation Frequency) is a widely used weight calculation method of statistical analysis. It chooses the product of TF (text frequency) and IDF (inverse document frequency) to represent the feature weights. The traditional TFIDF algorithm based solely on word frequency may have two types of phenomenon:some low-frequency words which are not so representative of the document theme have very high IDF value, some high-frequency words which can reflect the document theme very well have so low IDF value. Considering the word frequency, property, length and position appearing in the document of words or phrases, an improved TFIDF algorithm is designed in the paper.Word co-occurrence frequency is important information in the method of statistical analysis in keywords extraction. It effected not so well to extract the keywords which the weight of candidate key words to the calculation by simply weighting the co-occurrence frequency. In order to improve the accuracy of keywords extraction, the Spectral Division method based on the classification of the graph is applied to the extraction of keywords and this method is designed in this paper. The main idea of this algorithm is that:firstly, establish the similar graph for the words of this text based on word occurrence. secondly, classify all the keywords by using Spectral Division algorithm and total the number of words of the classification which the word belong to, and then make weighted calculation on the mended value and this total number, finally, sort the words by their weighted value and extract the key word.The paper selects each100papers in each different categories:civil law, science as well as technology and economy from "China Paper Download Center"(http://www.studa.net/) as an experimental data set. In the test to extract keywords by traditional TFIDF, improved TFIDF and the method based on Spectral Division, it shows the accuracy, recall and F1value of improved TFIDF are obviously higher than traditional TFIDF, however, the method based on Spectral Division effects best in these there methods.
Keywords/Search Tags:Improved TFIDF, Spectral Division, Keywords extraction
PDF Full Text Request
Related items