Font Size: a A A

Design And Implementation Of Content-based Webpage Collection And Classification System

Posted on:2019-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:M J LiFull Text:PDF
GTID:2428330590475433Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the increasing prosperity of the Internet,the information resources of Internet has become more and more.It's convenient for people to acquire knowledge,but it brings some problems which information and noise information is excessive.Instead,it affects the user's search for effective information.Intermet news,as a main source of Internet information,has greater research value than other sources of information.It is necessary to collect and classify Internet news accurately and efficiently.It has great significance in the fields of information retrieval and data mining.Classifying webpages based on webpage content can fully consider the semantics of the text,and avoids the website collection error caused by misclassification or unclassification of webpages,and has a better classification effect.In this thesis,we had studied the technology of web page text extraction,combined with the characteristics of news websites,and developed a more effective acquisition strategy and update strategy,which ensured the high efficiency of news collection.Because of the diverse news website source and frequently changing website design,the text extraction technology based on template has not guarantee the accuracy in extraction,the thesis analysed web text extraction technology,got a general text extraction algorithm based on text distribution,and made a experiment to determine the optimal value of the algorithm,reduced the time cost of artificial write rules.For text classification,the paper studied and analyzed the overall process of text categorization,chose Labeled-LDA topic model for text feature representation.Compared with the traditional vector space model,it reduced the feature dimension,avoided the loss of semantic information,expanded the LDA model as a supervised classification model.By comparing the text classification method,the support vector machine was selected as the classifier of text feature.By comparing the results with other methods,the validity of the classification method was verified,and the model of training was prepared for the classification of new text.Based on the B/S architecture,this thesis implemented the webpage collection and classification system,made the specific design and implementation of each system module,evaluated the system in terms of acquisition performance and classification accuracy,and verified the feasibility of the system.
Keywords/Search Tags:Webpage Collection, Text Extraction, Labeled Latent Dirichlet Allocation, Support Vector Machine, B/S architecture
PDF Full Text Request
Related items