Font Size: a A A

Design And Implementation Of News WEB Page Classification System Based On The Spark

Posted on:2018-08-06Degree:MasterType:Thesis
Country:ChinaCandidate:H J WangFull Text:PDF
GTID:2348330518494407Subject:Computer technology
Abstract/Summary:PDF Full Text Request
The development of the Internet is very fast. Today, the Internet has become a perfect and huge system in which the information is not only a huge number, but also good real-time. These advantages of the Internet makes us increasingly dependent on the Internet to get outside information. But because of the openness and heterogeneity of the Internet, network information complicated,and from such a large number of the lack of regular network information is difficult to prepare to find the necessary information. In addition, many times we want to filter certain categories of pages. Web page classification technology is an effective way to solve the above problems. The technology will organize and process web pages in the Internet, and then achieve the purpose of user convenience and efficient use of resources.This paper makes a thorough research on the whole process of traditional web page classification, and analyzes the information extraction, feature selection, weight calculation and classification. On this basis, the main work done are: 1) Aiming at the shortcomings of ignoring text semantic level information in the previous web categorization methods, a theme model is introduced. This paper proposes a classification method based on spatial vector model and theme model. The experimental results show that the LDA model has improved the classification effect in all the categories after the introduction of the LDA model. 2) In order to solve the problem that the structure information of the web page is ignored in the conventional web page classification method,the TF-IDF is improved by using the web page structure information. Then, we use the traditional TF-IDF and the improved TF-IDF vectorized text for the same dataset, and use the same SVM classification method to carry on the contrast experiment. The experiment results show that the classification result will be improved after considering the web structure information. 3) In order to solve the problem that the Web pages are treated as isolated objects and do not take into account the relationship between Web pages, we use the webpage information to improve the random forest method in the previous webpage classification. The design experiment proves that the improved random forest is more categorized than the original random forest method 4) On the basis of the theoretical research, a Spark-based webpage classification system is implemented. The main functions are crawling data set, extracting the effective information of the webpage, the text segmentation, feature selection, feature item weight calculation, unknown class text classification.
Keywords/Search Tags:web page classification, web structure information, LDA, Spark
PDF Full Text Request
Related items