| The rapid development of information technology promotes the geometric growth of network data,which leads to more and more data and makes it more and more difficult to search and utilize text information effectively.In the face of massive information,especially in the face of explosive growth of text information,it has become an urgent problem to efficiently capture useful information from massive text.In order to solve this problem,we need to extract the central words from the text which can reflect the theme of the text.These words are called keywords.Keywords can well reflect the author's thoughts and the theme of the article,so that readers can quickly understand the main content of an article,so it is of great significance to have a skilled automatic keyword extraction method.Keywords as the core content of the text,should not only reflect the importance of words,but also reflect the relevance between the text and the theme.However,there are few researches on the relevance of keyword themes,and most of them focus on the linguistic probabilistic model of words or lexicograph-based research,so the implicit semantic characteristics of words cannot be mined.In addition,most text in the display world does not provide tagged keywords.If manual labeling is adopted,it is not only inefficient,time-consuming and laborious,but also the subjective consciousness of people has a great impact on the labeling results.Therefore,manual keyword allocation is a time-consuming and tedious task.Based on the above factors,this paper mainly studies the topic relevance of keywords and the problem of less marked corpus.The main contents of this paper are as follows:(1)This paper proposes a method to calculate the correlation between words and text topics.The text preprocessing algorithm firstly get corresponding candidate text keyword sequence,and combined with domain knowledge to training text corpus data get word vector list,and then according to the word or word vector list corresponding text vector sequence,the single word in the text,vector clustering text clustering center,finally calculated each candidate keywords and similarity of clustering center,as the semantics of the correlation between words and text theme.(2)Aiming at the problem that the topic relevance of keywords is not strong,this paper proposes a keyword extraction method that integrates semantic features.This algorithm research focuses on feature extraction of candidate keywords in text.On the basis of previous studies,this paper extracted four kinds of characteristics including word frequency,length,location and language information of candidate keywords,including similarity features of words and text topics,which were used as training sample data of classification model to train keyword classification model.The experimental results show that the keyword extraction method with semantic feature fusion improves the accuracy by 16.2% and f-score by 20.5% compared with the traditional TFIDF method.Keywords extracted can not only reflect the importance of words,but also reflect the relevance of the theme of words.(3)To solve the problem of less marked corpus,this paper combines the multifeature keyword extraction method with semi-supervised learning method,and proposes an improved semi-supervised keyword extraction method.The algorithm improves the method of initial training sample selection,and extracts training samples with high confidence through cross validation,to improve the accuracy of the model.Experiments show that,with certain experimental data,the supervised algorithm can only learn the rules from labeled samples,while the semi-supervised algorithm can not only learn the rules of labeled samples,but also dig out the internal rules of unlabeled samples. |