Font Size: a A A

Design And Implementation Of Topic Recognition And Automatic Classification System For News Web Pages

Posted on:2020-01-09Degree:MasterType:Thesis
Country:ChinaCandidate:H X LiFull Text:PDF
GTID:2428330611499670Subject:Computer technology
Abstract/Summary:PDF Full Text Request
With the rapid development and popularization of Internet technology,network resources are exploding,and its main form of expression exists in the form of web pages.Although the rich information resources of the Internet bring certain convenience,users cannot locate quickly the information they need in a large amount of information resources,so the problem of web page classification has emerged.With the development of classification technology,web page classification puts higher requirements on accuracy and efficiency.In addition,news web pages are used as the medium for daily information acquisition.The classification standards of most news websites are not uniform.This paper utilizes classification technology to effectively classify and manage the web pages of different news websites.Firstly,In order to clarify the requirements of the system,this paper ana lyzes the overall task overview,functional requirements and overall business process of the system,and clarifies the necessity of designing the news web page topic recognition and automatic classification system.Secondly,for the problem of the characteristics of the news web page,the paper analyzes the classification characteristics of webpages and completes the selection of webpage content features.The research analyzes the LDA topic model and completes the topic recognition of the news web page text.Therefore,the topic words and the content features are collectively used as the text feature.A feature item,feature vector and label structure position triplet are used to represent the web page features.The feature vector construction is completed by introducing corresponding structural features to each text feature and transforming them into structure vectors.Thirdly,for the problem of web page classification model,traditional machine learning classification model and convolutional neural network classification model are analyzed and studied.This paper studies and improves a convolutional neural network classification model based on the joint features of web pages and structures.This research utilizes this model to classify news web corpus.At the same time,the keyword extraction algorithm improved from the semantic space is used to obtain the webpage text keywords,which can get and the webpage text summary.Based on the above research and analysis,this paper clarifies the overall architecture and logic function modules of the classification system,including the core function modules such as data acquisition,news webpage classification feature analysis,and construction classification model.According to the working requirements of the system,the paper analyzes the classification characteristics of the webpage and constructs the joint features,and further designs the convolutional neural network classification model based on the joint features,and tests the classification results of the dataset webpage.At the same time,compared with the machine learning classification model,the accuracy is improved by 3% ?4%,which further verifies the performance of the model.Finally,the topic recognition and automatic classification system for news web pages is designed and completed based on the design of this paper.The system can be applied to unified classification management for news web pages,which has wide application value.
Keywords/Search Tags:news web page classification, joint feature, convolutional neural network, topic model
PDF Full Text Request
Related items