Font Size: a A A

Research On Semantic Based Document Keyword Extraction Technology

Posted on:2021-04-10Degree:MasterType:Thesis
Country:ChinaCandidate:H SuFull Text:PDF
GTID:2428330626458938Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of information technology,the rapid growing data is coming to people's eyes,in addition,all kinds of information is filled in every corner of our life.Therefore,how to get the real meaningful key parts from many data information accurately has became a research hotspot.Most of the current keyword extraction algorithms are based on the research of word frequency and word length,but also on the research of semantic and word chain.In the keyword extraction based on word frequency and word length,it depends on the length of the word,and the longer word segmentation is more likely to become the final keyword,but it is not suitable for some special cases,at the same time,there is no semantic information of the words in full text;the research based semantic ignores the basic attributes of the word itself,such as word length,and its time cost is large.It can not accurately extract the keyword information of literature.Accuracy and efficiency are difficult to meet the needs of users.This paper is based on the above problems.Keywords reflect the main information and core concepts of the article.Accurate keyword information can bring great convenience to readers in reading and searching.Therefore,the keyword extraction technology also has continuous improvement and optimization.The extraction algorithm based semantic can get more real meaning and intention of the words in the article through disambiguation and semantic analysis,and the extracted keywords can be more accepted by users.In this paper,by comparing with the traditional TF-IDF and KEA algorithms,then a algorithm named GSW based on semantics is proposed,which is based on group character tree,semantic similarity and word-length priority ratio.The algorithm is mainly used in the field of the text keyword extraction of natural language.It combines the method of semantic analysis with the method of basic word information statistics to solve the contradiction between accuracy and timeliness in keyword extraction.The algorithm mainly solves the following problems and works:(1)A data structure named group character tree is defined,which is used to load the word information in the thesaurus.Compared with the originalhigh-performance storage structure named character tree,the time complexity of the two is the same,but the average searched length of word string and memory occupation of the group character tree are slightly better than the character tree.This structure is applied to the processing of word segmentation to optimize the storage structure of thesaurus.(2)A disambiguation algorithm based on semantic similarity and B + tree is proposed.In this algorithm,the value of semantic similarity between words in the unit group is calculated to disambiguate,and B + tree structure is used to store the amount of intermediate calculation,which improves the performance of query and sorting.This method makes the disambiguation effect more reliable.(3)The concept of word-length priority ratio is defined.When using naive Bayesian classification algorithm to extract the final keywords,the word-length priority ratio is applied to calculate the word length weight.The word-length priority ratio can reconcile the feature value of word length,so that the word with shorter length may become the final selected keyword to some extent,reducing the chance of semantic one sidedness of long words,and making the keyword extracted by the algorithm more accurate and reliable.Finally,in order to verify the feasibility and accuracy of the improved algorithm in keyword extraction,after the requirement analysis and process design,the keyword extraction platform is built to output the algorithm.With the help of 600 articles from the platform,the experimental group is divided into two groups: single document classification and multi document classification.Verifying the keywords obtained by the extraction algorithm.The keywords expected by users are all in them,and they have good performance in the accuracy rate,recall rate and the adjustment and average value of both.The availability and accuracy of the algorithm are proved.
Keywords/Search Tags:extraction, word-length priority ratio, Disambiguate
PDF Full Text Request
Related items