Font Size: a A A

Automatic Keyword Extraction Algorithms Based On Word Embedding And Multiple Features Fusion

Posted on:2020-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:Y Z DaiFull Text:PDF
GTID:2428330575494688Subject:System theory
Abstract/Summary:PDF Full Text Request
Due to the rapid development of information technology,the information in various fields has been growing explosively.In order to obtain the desired information from the massive amount of texts quickly and effectively,people usually need to utilize some important words that can cover the main idea of a text,namely,keywords.However,most texts do not provide keywords,which is not conducive to obtaining and processing text information.In order to address this problem,people proposed the automatic keyword extraction technology,i.e.,let computer automatically extract the corresponding keywords from the text according to some method,and designed a large number of keyword extraction algorithms.However,most of the existing keyword extraction algorithms rely on text corpus,and suffer various problems such as high computational complexity,weak applicability,and low extraction precision.Therefore,the research on automatic keyword extraction algorithms has been acitive ever and is gaining more and more attention under the current era of big data.Based on the above background,this thesis focuses on automatic keyword extraction,and mainly proposes two automatic keyword extraction algorithms: keyword extraction algorithm based on position-weighted word frequency statistics(PW_TF)and keyword extraction algorithm based on multiple features fusion and graph model(MF_Rank).The main idea of PW_TF is to represent keywords using the statistical and structural features of words.The statistical feature describes the frequency information of the words in the text,and the structural features reflect the position information of the words in the text,Moreover,the words in different positions have different importance.PW_TF is simple and easy to implement,but it considers only the word's position information and frequency information,not including the semantic information.MF_Rank is proposed based on the classical graph model algorithm(e.g.,TextRank).Its main idea is to use the semantic features extracted with word embedding technology,together with the statistical features and structural freatures of words,i.e.,multiple features fusion,to obtain the importance weights of the words(i.e.,the nodes of the graph)and the attractiveness weights between the words(i.e.,the edges of the graph).Then,the keywords are determined with the final weight of each word which is computed through the iterative implementation of the graph model algorithm.In order to verify the performance of the proposed algorithms,a large amount of simulation experiments have been conducted on three different corpus datasets.Theexperiment results show that,compared with the existing word frequency statistics method and graph model method,the keyword extraction algorithm proposed in this thesis can enhance the performance by 6.45% and 20.36% at largest extent,respectively.Moreover,compared with PW_TF,MF_Rank can achieve a maximum increase of 1.76% in terms of performance.The PW_TF and MF_Rank algorithms proposed in this thesis have good adaptability without relying on corpus,and can be directly applied to keyword extraction of single text;the experiment results also show that both algorithms are feasible and effective.
Keywords/Search Tags:Automatic Keyword Extraction, Multiple Features Fusion, Graph Model, Single Text, Word Embedding
PDF Full Text Request
Related items