Research On Semantic Based Document Keyword Extraction Technology

Posted on:2021-04-10

Degree:Master

Type:Thesis

Country:China

Candidate:H Su

Full Text:PDF

GTID:2428330626458938

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the continuous development of information technology,the rapid growing data is coming to people's eyes,in addition,all kinds of information is filled in every corner of our life.Therefore,how to get the real meaningful key parts from many data information accurately has became a research hotspot.Most of the current keyword extraction algorithms are based on the research of word frequency and word length,but also on the research of semantic and word chain.In the keyword extraction based on word frequency and word length,it depends on the length of the word,and the longer word segmentation is more likely to become the final keyword,but it is not suitable for some special cases,at the same time,there is no semantic information of the words in full text;the research based semantic ignores the basic attributes of the word itself,such as word length,and its time cost is large.It can not accurately extract the keyword information of literature.Accuracy and efficiency are difficult to meet the needs of users.This paper is based on the above problems.Keywords reflect the main information and core concepts of the article.Accurate keyword information can bring great convenience to readers in reading and searching.Therefore,the keyword extraction technology also has continuous improvement and optimization.The extraction algorithm based semantic can get more real meaning and intention of the words in the article through disambiguation and semantic analysis,and the extracted keywords can be more accepted by users.In this paper,by comparing with the traditional TF-IDF and KEA algorithms,then a algorithm named GSW based on semantics is proposed,which is based on group character tree,semantic similarity and word-length priority ratio.The algorithm is mainly used in the field of the text keyword extraction of natural language.It combines the method of semantic analysis with the method of basic word information statistics to solve the contradiction between accuracy and timeliness in keyword extraction.The algorithm mainly solves the following problems and works:(1)A data structure named group character tree is defined,which is used to load the word information in the thesaurus.Compared with the originalhigh-performance storage structure named character tree,the time complexity of the two is the same,but the average searched length of word string and memory occupation of the group character tree are slightly better than the character tree.This structure is applied to the processing of word segmentation to optimize the storage structure of thesaurus.(2)A disambiguation algorithm based on semantic similarity and B + tree is proposed.In this algorithm,the value of semantic similarity between words in the unit group is calculated to disambiguate,and B + tree structure is used to store the amount of intermediate calculation,which improves the performance of query and sorting.This method makes the disambiguation effect more reliable.(3)The concept of word-length priority ratio is defined.When using naive Bayesian classification algorithm to extract the final keywords,the word-length priority ratio is applied to calculate the word length weight.The word-length priority ratio can reconcile the feature value of word length,so that the word with shorter length may become the final selected keyword to some extent,reducing the chance of semantic one sidedness of long words,and making the keyword extracted by the algorithm more accurate and reliable.Finally,in order to verify the feasibility and accuracy of the improved algorithm in keyword extraction,after the requirement analysis and process design,the keyword extraction platform is built to output the algorithm.With the help of 600 articles from the platform,the experimental group is divided into two groups: single document classification and multi document classification.Verifying the keywords obtained by the extraction algorithm.The keywords expected by users are all in them,and they have good performance in the accuracy rate,recall rate and the adjustment and average value of both.The availability and accuracy of the algorithm are proved.

Keywords/Search Tags:

extraction, word-length priority ratio, Disambiguate

PDF Full Text Request

Related items

1	Study On Extraction Algorithm Of Palm - To - Length Ratio
2	Research On Chinese Word Segmentation Technology With Word Length And Rule Algrithm
3	Optimization Structure For Finite Word Length FIR Digital Filters
4	The Research Of United Controlling Mechanism Of Auto-adapted Frame Length And The Priority Queue
5	Structure Optimization And Performance Research Of Finite Word Length FIR Digital Filter
6	The Improved Extraction Word Model And Its Implementation Based On Word Boundary Characteristics
7	Detection Method Of Assembly Coaxality For Parts With High Length-diameter Ratio In Local Imaging
8	Research Of Combined Chinese Word Segmentation Method
9	Research On The Protocol Technology In Delay Tolerant Network Based On Service Priority Strategy
10	A Contour Curve Matching Method