Font Size: a A A

Research And Application Of Random Forest Technology In News Page Classification Systems

Posted on:2014-12-13Degree:MasterType:Thesis
Country:ChinaCandidate:C LiuFull Text:PDF
GTID:2268330425464395Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
News is very important in people’s daily life and can be seen everywhere. How to collect the news from network is significant, which contained discovery of the news site. By crawling the Internet, Web Spiders collect news and find new news sites, then the news will be stored into database. With the increase of news, we need to classify the news pages. According to the content that the news reports, we should determine which category it is and where to store it.News page is great different form other media, it has a lot of advantages. Traditional news use paper to disseminate information. With the rapid development of the Internet, this mode of transmission is a waste of paper resources, more and more information are presented to the user in the webpage. If the svstem offers all the news to a user, it will send excessive data, and the user may not like the news of all categories, this must waste resources. Before the user read news, the system must take some technology to classify news into different categories. Because the webpage cannot be directly calculated by the computer, news classification is a complicated problem. These webpages which presents in text structure must be translated into the form of a computer can understand, and then can be uesd by the emerging algorithm. With machine learning, artificial intelligence and other advanced technology development, more and more algorithms not only be used in their respective fields, but also used to the webpage classification. So webpage classification has attracted more and more attention.This article focuses on the identification of news and news classification. News classification process usually contains the collection of training set, web vectorization, feature selection, classification, model training, model evaluation, development of the prototype system. In order to obtain a more precise training set, we make the spider to take the breadth of crawling strategy, then mark the relative set by ourselves, it is more clean, accurate than the training set online. In order to vectorizing the webpage, we use the N-gram method to cut the web text and its URL address into N-gram words. All N-gram words constitute feature space, computing a weight for each words in each page, and fill the information into vectors, so all vectors constitute the initial matrix. The computer can easily handle the matrix, but the dimension of the matrix is too large, so we must remove some noise characteristics and some unrelated features. This article uses information gain ratio to select features from the matrix. The model selection and training are the focus of this study, we use a variety of algorithms for model training, comparing the advantages and disadvantages of the various algorithms. According to the results of the model test and the value of precision rate and recall rate, we choose and improve the random forest algorithm to the prototype classification module,we can get the higher accuracy. Finally, We develop a prototype system for web page classification, various algorithms are developed into the system in a modular way, so the development of software become easy. In the last, we propose a idea of aggregating the URL prefix. Through this way, the system increases its efficiency dramatically, and the amount of computation dips. All the things reflect immeasurable value of the news classification.
Keywords/Search Tags:Webpage Vectorization, Information Gain, The Bayes, The SVM, Random Forest, News Classification
PDF Full Text Request
Related items