Design And Implementation Of Content-based Webpage Collection And Classification System

Posted on:2019-08-12

Degree:Master

Type:Thesis

Country:China

Candidate:M J Li

Full Text:PDF

GTID:2428330590475433

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

With the increasing prosperity of the Internet,the information resources of Internet has become more and more.It's convenient for people to acquire knowledge,but it brings some problems which information and noise information is excessive.Instead,it affects the user's search for effective information.Intermet news,as a main source of Internet information,has greater research value than other sources of information.It is necessary to collect and classify Internet news accurately and efficiently.It has great significance in the fields of information retrieval and data mining.Classifying webpages based on webpage content can fully consider the semantics of the text,and avoids the website collection error caused by misclassification or unclassification of webpages,and has a better classification effect.In this thesis,we had studied the technology of web page text extraction,combined with the characteristics of news websites,and developed a more effective acquisition strategy and update strategy,which ensured the high efficiency of news collection.Because of the diverse news website source and frequently changing website design,the text extraction technology based on template has not guarantee the accuracy in extraction,the thesis analysed web text extraction technology,got a general text extraction algorithm based on text distribution,and made a experiment to determine the optimal value of the algorithm,reduced the time cost of artificial write rules.For text classification,the paper studied and analyzed the overall process of text categorization,chose Labeled-LDA topic model for text feature representation.Compared with the traditional vector space model,it reduced the feature dimension,avoided the loss of semantic information,expanded the LDA model as a supervised classification model.By comparing the text classification method,the support vector machine was selected as the classifier of text feature.By comparing the results with other methods,the validity of the classification method was verified,and the model of training was prepared for the classification of new text.Based on the B/S architecture,this thesis implemented the webpage collection and classification system,made the specific design and implementation of each system module,evaluated the system in terms of acquisition performance and classification accuracy,and verified the feasibility of the system.

Keywords/Search Tags:

Webpage Collection, Text Extraction, Labeled Latent Dirichlet Allocation, Support Vector Machine, B/S architecture

PDF Full Text Request

Related items

1	Design And Implementaion Of Finance News Classification System Based On Labeled-LDA
2	Chinese Text Classification Method Based On Improved Topic Model
3	Research And Implementation Of Spark-based Text Classification
4	Classification Algorithm For Social Text Stream
5	Design And Implementation Of A Text Recommender System Of Social Network Based On Latent Dirichlet Allocation
6	Research And Implementation Of Distributed Network Monitoring System Based On Text Mining
7	Research And Application Of Text Classification Model Based On Topic Model
8	Application Of A Small Amount Of Labeled Samples Support Vector Machine Classification
9	Public Opinion Events Active Detection Research On Microblogging
10	Research On Text Classification Filtering Technology Based On Latent Semantic Indexing And Support Vector Machine