Font Size: a A A

Research On Keyword Extraction And Improved LSA Based On Co-occurrence Word

Posted on:2018-09-07Degree:MasterType:Thesis
Country:ChinaCandidate:Y X Y GongFull Text:PDF
GTID:2428330512493969Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The extraction of information topics is a fundamental task for quickly locating user needs.This paper mainly studies the extraction algorithms of text subject words.The calculation of the lexical weight is the most basic and the most critical problem for the extraction of information topics.In this paper,we obtain the co-occurrence matrix by calculating the lexical weight according to the co-occurrence words constructed by mutual information,and according to the word frequency,part of speech and word position information.By using the Singular Value Decomposition(SVD)decomposition in the latent semantic analysis(LSA),the similarity matrix of the lattice space is obtained.After the kmeans clustering,the first few covariates with the largest mutual information value are selected as the keywords of the article.The main contents and innovations of this paper are as follows.In the aspect of word information calculation,the traditional TF-IDF(Term Frequency–Inverse Document Frequency)algorithm is easy to understand and easy to calculate.But because it ignores the word property,word position,word length and other word characteristics,such a word weight can not accurately measure the contribution of words to the text.Therefore,this paper first takes the part of speech into consideration,finding the four main words of nouns,verbs,adjectives and adverbs from statistical calculation in a large number of corpus,as well as its relative proportion:61.98%,29.19%,3.82%,5.01%.Based on this ratio,we improve the traditional TF-IDF algorithm,which is the POS_TF-IDF algorithm proposed in this paper,that is,the TF-IDF algorithm based on part of speech.Assuming that the words and words are independent of each other,the BOW(Bag of Words)model ignores the correlation between words.In order to make up for this defect,this paper puts forward the method of calculating the weight of words.In this paper,the relevant contribution of the co-occurrence of the words is calculated by the mutual information,demonstrating the correctness and rationality of the mutual information calculation of the co-occurrence.In addition,selecting the text paragraph as the window size,the corresponding weight varies according to the different positions in paragraph,such as in the first sentence of the paragraph,the last sentence and the middle sentence.Taking part of speech,common word and word position factor into consideration,the COVSM model is proposed in this paper.The calculation of word weight in the model not only makes up for the deficiency of isolated computing information of traditional TF-IDF algorithm,but also adds the influence factors of word position.In this paper,LSA model is used to extract the keywords,the key in the LSA model is SVD.This paper explains the mathematical theory of SVD decomposition and demonstrates its physical meaning in text analysis,illustrating the right singularity matrix of the left singularity matrix and vocabulary correlation.After the clustering of the left singular matrix of document,this paper takes the top three words with highest weight as the result of the extraction of the keyword.The experimental results verify the correctness of the extracted keywords by this algorithm.
Keywords/Search Tags:TF-IDF, Co-occurrence Word, LSA, SVD
PDF Full Text Request
Related items