Font Size: a A A

Keyphrase Extraction Algorithm Integrating Semantic Features And Learning To Rank

Posted on:2022-10-03Degree:MasterType:Thesis
Country:ChinaCandidate:Q HuFull Text:PDF
GTID:2518306539492074Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In the era of big data,information present state of explosive growth,rapid growth of information for today's information society has brought great challenges.How to accurately distinguish the information data needed and how to effectively use the useful information,this has been one of the urgent problems to be solved in the information society.Aiming at this kind of problem,the technique of automatic keyphrase extraction comes into being.By using keyphrases to reflect the main information of text,it can effectively solve the problem that it is difficult to extract the required information data.Keyphrase extraction has always been an important field in natural language processing(NLP),which plays an important role in information retrieval,text classification,question answering system and other fields.The most common one is in search engine,which searches related items by entering keywords.However,a large number of data text are not labeled with keyphrases,and relying on manual annotation of one of them is time-consuming,laborious and inefficient,so the study of automatic keyphrase extraction method is very necessary.Aiming at the problem that the supervised automatic keyphrase extraction method based on classification or sequence labeling can not capture the internal associations among keyphrases and deviate from the essence of keyphrase judgement,this paper proposes a keyphrase extraction method integrating semantic features and learning to rank(SF-L2R-KPE).The algorithm calculates the cosine similarity between candidate phrases and documents using Doc2 Vec which is a word embedding technology,and to measure its semantic characteristics between them,combined with other statistical characteristics,the candidate keyphrases' scores are calculated by the trained learning to rank model.According to the score from high to low arrangement,the top N candidate keyphrases are selected as keyphrases.The effectiveness of the proposed algorithm is verified by experiments on multiple data sets and compared with other models.The main innovations of this algorithm are as follows:(1)Doc2vec can train a document vector that can effectively represent the document while training the word vector,and the semantic features between the word and the document can be better reflected by the cosine distance between the two vectors.(2)This is the first time for using the learning to rank model based on neural network to carry out the task of keyword extraction,and the idea of sorting can better capture the internal relationships between keywords.(3)Pairwise method can bring the competition between candidate keyphrases,which is more in line with the essence of keyphrase judgement and brings better extraction effect.
Keywords/Search Tags:keyphrase extraction, Doc2Vec, semantic features, learning to rank
PDF Full Text Request
Related items