Research And Application Of Random Forest Technology In News Page Classification Systems

Posted on:2014-12-13

Degree:Master

Type:Thesis

Country:China

Candidate:C Liu

Full Text:PDF

GTID:2268330425464395

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

News is very important in people’s daily life and can be seen everywhere. How to collect the news from network is significant, which contained discovery of the news site. By crawling the Internet, Web Spiders collect news and find new news sites, then the news will be stored into database. With the increase of news, we need to classify the news pages. According to the content that the news reports, we should determine which category it is and where to store it.News page is great different form other media, it has a lot of advantages. Traditional news use paper to disseminate information. With the rapid development of the Internet, this mode of transmission is a waste of paper resources, more and more information are presented to the user in the webpage. If the svstem offers all the news to a user, it will send excessive data, and the user may not like the news of all categories, this must waste resources. Before the user read news, the system must take some technology to classify news into different categories. Because the webpage cannot be directly calculated by the computer, news classification is a complicated problem. These webpages which presents in text structure must be translated into the form of a computer can understand, and then can be uesd by the emerging algorithm. With machine learning, artificial intelligence and other advanced technology development, more and more algorithms not only be used in their respective fields, but also used to the webpage classification. So webpage classification has attracted more and more attention.This article focuses on the identification of news and news classification. News classification process usually contains the collection of training set, web vectorization, feature selection, classification, model training, model evaluation, development of the prototype system. In order to obtain a more precise training set, we make the spider to take the breadth of crawling strategy, then mark the relative set by ourselves, it is more clean, accurate than the training set online. In order to vectorizing the webpage, we use the N-gram method to cut the web text and its URL address into N-gram words. All N-gram words constitute feature space, computing a weight for each words in each page, and fill the information into vectors, so all vectors constitute the initial matrix. The computer can easily handle the matrix, but the dimension of the matrix is too large, so we must remove some noise characteristics and some unrelated features. This article uses information gain ratio to select features from the matrix. The model selection and training are the focus of this study, we use a variety of algorithms for model training, comparing the advantages and disadvantages of the various algorithms. According to the results of the model test and the value of precision rate and recall rate, we choose and improve the random forest algorithm to the prototype classification module,we can get the higher accuracy. Finally, We develop a prototype system for web page classification, various algorithms are developed into the system in a modular way, so the development of software become easy. In the last, we propose a idea of aggregating the URL prefix. Through this way, the system increases its efficiency dramatically, and the amount of computation dips. All the things reflect immeasurable value of the news classification.

Keywords/Search Tags:

Webpage Vectorization, Information Gain, The Bayes, The SVM, Random Forest, News Classification

PDF Full Text Request

Related items

1	Completing News Classification By Related Machine Learning Algorithms
2	Research And Design Of Web Classification Algorithm Based On Education Browser
3	Research And Application Of Distributed Webpage Automatic Classification Algorithm Based On Bayes
4	Improved Feature Selection Methods For Web Pages Based On DIV Iterative Search And Information Gain
5	Research Of Random Forest Transfer Learning Based On Instance
6	Research And Implementation Of Classification Algorithm Based On Message Content And User Behavior Relationship
7	Research Of Chinese Text Classification Algorithms Based On VSM
8	Research On Random Forest Algorithm Based On Feature Selection And Diversity
9	Applications Of Information Gain Based Bayes Data Mining Algorithm In Spam Filtering
10	Research On Strategy Of Imputing Missing Data Based On Random Forest