Research On Keyword Extraction And Improved LSA Based On Co-occurrence Word

Posted on:2018-09-07

Degree:Master

Type:Thesis

Country:China

Candidate:Y X Y Gong

Full Text:PDF

GTID:2428330512493969

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The extraction of information topics is a fundamental task for quickly locating user needs.This paper mainly studies the extraction algorithms of text subject words.The calculation of the lexical weight is the most basic and the most critical problem for the extraction of information topics.In this paper,we obtain the co-occurrence matrix by calculating the lexical weight according to the co-occurrence words constructed by mutual information,and according to the word frequency,part of speech and word position information.By using the Singular Value Decomposition(SVD)decomposition in the latent semantic analysis(LSA),the similarity matrix of the lattice space is obtained.After the kmeans clustering,the first few covariates with the largest mutual information value are selected as the keywords of the article.The main contents and innovations of this paper are as follows.In the aspect of word information calculation,the traditional TF-IDF(Term Frequency�Inverse Document Frequency)algorithm is easy to understand and easy to calculate.But because it ignores the word property,word position,word length and other word characteristics,such a word weight can not accurately measure the contribution of words to the text.Therefore,this paper first takes the part of speech into consideration,finding the four main words of nouns,verbs,adjectives and adverbs from statistical calculation in a large number of corpus,as well as its relative proportion:61.98%,29.19%,3.82%,5.01%.Based on this ratio,we improve the traditional TF-IDF algorithm,which is the POS_TF-IDF algorithm proposed in this paper,that is,the TF-IDF algorithm based on part of speech.Assuming that the words and words are independent of each other,the BOW(Bag of Words)model ignores the correlation between words.In order to make up for this defect,this paper puts forward the method of calculating the weight of words.In this paper,the relevant contribution of the co-occurrence of the words is calculated by the mutual information,demonstrating the correctness and rationality of the mutual information calculation of the co-occurrence.In addition,selecting the text paragraph as the window size,the corresponding weight varies according to the different positions in paragraph,such as in the first sentence of the paragraph,the last sentence and the middle sentence.Taking part of speech,common word and word position factor into consideration,the COVSM model is proposed in this paper.The calculation of word weight in the model not only makes up for the deficiency of isolated computing information of traditional TF-IDF algorithm,but also adds the influence factors of word position.In this paper,LSA model is used to extract the keywords,the key in the LSA model is SVD.This paper explains the mathematical theory of SVD decomposition and demonstrates its physical meaning in text analysis,illustrating the right singularity matrix of the left singularity matrix and vocabulary correlation.After the clustering of the left singular matrix of document,this paper takes the top three words with highest weight as the result of the extraction of the keyword.The experimental results verify the correctness of the extracted keywords by this algorithm.

Keywords/Search Tags:

TF-IDF, Co-occurrence Word, LSA, SVD

PDF Full Text Request

Related items

1	Research On Keyword Extraction And Improved LSA Based On Co-occurrence Word
2	Research About Micro-blog Hot Topics Discovery Based On Optimized TF-TDF And Word Co-occurrence Model
3	Scientific Paper Discrimination Method Research Based-on Word Co-Occurrence Network And Support Vector Machine
4	Research On The Language Model Information Retrieval Method Based On Word Co-occurrence
5	A Novel Chinese Subjective Sentences Recognition Method Based On Word Co-occurrence Relationship Graphic Model
6	The Research Of Micro-Blog New Emotion Words Recognition And Orientation Judgment Based On Word2Vec
7	The Description Of Text's Feature Based On Semanteme Concept
8	The Application Of Co-occurrence Analysis In The Recognition Of The Features Of Discipline Intersecting
9	Research On Relocation Algorithms In Multiple Scenes Based On Image Features
10	Research On The Model Of Word Embedding Based On Word2Vec