Font Size: a A A

The Research And Application Of News Extraction Technology Based On Domain Lexicon

Posted on:2019-10-27Degree:MasterType:Thesis
Country:ChinaCandidate:J YangFull Text:PDF
GTID:2428330545957138Subject:Systems analysis and integration
Abstract/Summary:PDF Full Text Request
With the popularity of the Internet,network news has become the main way for people to obtain real-time dynamic and political news.In order to obtain news columns that are interested in users from many news websites,it is necessary to focus on the news.Through the selection and classification of news features,the users can find the purpose of accurate search for the news and facilitate the user to browse.In this paper,the extraction of news is divided into two stages:word library construction and short text similarity calculation.Word library construction phase mainly uses Thulac based word segmentation model to divide word segmentation and word tagging,and use convolution neural network model to calculate word similarity;short text similarity calculation stage is mainly improved.Compared with the traditional TF-IDF algorithm,the feature distribution of the improved algorithm has reached 9,and the IDF value is improved.The main work of this paper is as follows:1)The construction phase of the word library mainly uses the Thulac based participle model to carry out the word segmentation and part of speech tagging,and uses the skip-gram model based on the convolution neural network to calculate the vector value and similarity of words to achieve the classification of words.The words after the classification are put into the lexicon as the domain word bank.2)The short text similarity calculation stage mainly uses the improved TF-IDF algorithm to calculate the similarity degree of the short text input by the user,and extracts the domain news with the highest sentence similarity.3)On the basis of the previous work,the experiment uses the python language to compute the segmentation and similarity of the web crawler news.The experimental results show that the combined method of convolution neural network and the improved TF-IDF algorithm proposed in this paper has improved the speed and accuracy of text classification.
Keywords/Search Tags:Word segmentation, text classification, feature extraction, similarity calculation
PDF Full Text Request
Related items