Font Size: a A A

Research On Text Abstract Extraction Technology Based On Keywords And Topic Sentences

Posted on:2021-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:Y Q JiangFull Text:PDF
GTID:2438330611992476Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of network information technology,the quantity and scale of document information are increasingly expanding.Thus,it has become an urgent problem how to quickly acquire the effective information needed by users from such massive data and summarize the mass document information accordingly.Document abstract extraction technology aims to automatically summarize the main contents from the complicated document information,so as to cope with the difficulty of extracting abstracts manually from mass data and reduce the user's workload of information interpretation.At present,the advent of big data makes document abstract extraction technology become a hot issue in the domestic and foreign academic researches.The main work of this thesis is carried out as follows:(1)To propose a keyword optimization algorithm based on the combination of TF-IDF algorithm and words similarityA keyword optimization algorithm based on the combination of TF-IDF algorithm and words similarity is proposed as the keywords extraction via traditional TF-IDF algorithm cannot truly reflect the distribution and importance of feature words.During the preprocessing of keywords extraction,this algorithm introduces the concept of words similarity to improve the data quality of segmentation of words with high similarity with the method of marking section.The document inverse frequency in TF-IDF algorithm is converted into the word inverse frequency,which improves the importance expression of the weight of each word in the corpus,and finally achieves the extraction of keywords.As a result,this algorithm has obvious advantages in improving the accuracy of document keywords extraction,and higher precision and recall rate than the traditional TF-IDF algorithm.(2)To propose an improved Chinese abstract extraction algorithm based on documentrank.To solve the problem that the traditional algorithm of documententrank defaults that all sentences have the same initial importance without considering different degrees of importance of the sentences,an improved Chinese abstract extraction algorithm is proposed based on documententrank.In this algorithm,firstly,the keywords extracted above based on the combination of TF-IDF algorithm and words similarity are reserved as the influencing factors for the weight adjustment of subsequent topic sentences and then some sentence clusters are formed by combining Doc2 Vec model with the mean shift algorithm which improves the selection of initial points.Taking into account the factors such as sentence position,sentence pattern characteristics and the importance of key words,the new topic sentence weight is formed and applied to the algorithm of documententrank to improve the accuracy of the abstract.The good results obtained in the data set of this thesis shows that this algorithm performs better in automatic extraction of Chinese abstract than the TF-IDF algorithm,which only considers word frequency,and the traditional documententrank algorithm,whose default sentence weight is all 1.
Keywords/Search Tags:TF-IDF algorithm, keywords extraction, Text Rank algorithm, topic sentences, document abstract
PDF Full Text Request
Related items