Font Size: a A A

Design And Implementation Of Information Extraction System Based On Improved TF-IDF Algorithm

Posted on:2020-09-18Degree:MasterType:Thesis
Country:ChinaCandidate:L ChengFull Text:PDF
GTID:2428330572473696Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
The rapid development of Internet technology has brought about the explosive growth of information.At the same time,there are a lot of redundant and disturbing information in these network information.How to filter a large number of network noise information quickly and effectively and extract target information effectively has become one of the focuses of research.Natural Language Processing(NLP)can extract the subject and semantics of text,identify similar semantic information and eliminate noise interference such as antonyms by processing text,paragraph,sentence and words based on word vectors and sentence vectors,so as to achieve the purpose of information extraction for specific documents.As one of the classical text keyword extraction algorithms,term frequency-inverse document frequency(TF-IDF)has been widely used.Its method of obtaining document keywords is to count Term Frequency(TF).The more times a word appears,the more likely the article will be related to this word.Meanwhile,the weight of common words is reduced by Inverse Document Frequency(IDF).However,the traditional TF-IDF algorithm still has many problems to be improved in practical application,such as not considering the incomplete classification of words in documents,ignoring the distribution information between feature words and so on.Although some scholars have improved the traditional TF-IDF algorithm,they still simply link word frequency and weight,without considering the influence of word distribution on its weight in different documents,ignoring the location information of words in documents,and so on,resulting in inaccurate extraction of document topics.Therefore,this paper introduces the theory of information entropy and relative entropy in information theory,and proposes an improved TF-IDF algorithm for document keyword extraction.In view of the traditional algorithm which simply relies on word frequency to calculate word weight and does not consider the influence of word distribution on its weight in different documents,the information entropy and relative entropy of words are included in the weight of words;in view of the traditional algorithm which neglects the importance of the first and last sentences of text as summary sentences in the whole text,this paper introduces the weight factor based on word location information as the first sentence and the relative entropy of words.Words in the last sentence are given higher weights.Formulas of revision value of document length,revision value of word frequency and word frequency control are put forward to solve the problem of high word frequency in long documents.The experimental results show that the improved TF-IDF algorithm has better accuracy and recall than the traditional algorithm.In order to meet the demand of information extraction of massive network text,this paper designs and implements an information extraction system based on improved TF-IDF algorithm by using word segmentation technology,part-of-speech tagging technology,keyword extraction technology and word vector processing technology in natural language processing.This paper describes in detail the requirement analysis,basic structure,processing flow and functional modules of the information extraction system.Finally,the information extraction system is tested.The test results show that the information extraction system can well realize the functions of text preprocessing,noise text filtering,target sentence location,semantic similarity calculation and information extraction in the requirement analysis.It can complete the task of information extraction efficiently and accurately.
Keywords/Search Tags:Information extraction, keyword extraction, semantic similarity
PDF Full Text Request
Related items