Font Size: a A A

Research And Implementation Of Focused Crawler Oriented To Engineering Technology

Posted on:2017-03-15Degree:MasterType:Thesis
Country:ChinaCandidate:H LiFull Text:PDF
GTID:2348330509959843Subject:Industrial Engineering
Abstract/Summary:PDF Full Text Request
With the development of the Internet, more and more professional information is accumulated to the Internet. However, due to the rapid growth of the Web information resources, traditional search engines have failed to meet people's demand fo customized information retrieval. Focused crawler can crawl more detailed and professional data for different areas.In order gather to professional information, the focused crawler oriented to the robot industry is designed and implemented on the basis of WebMagic framework with an added theme identification module, where the naive bayesian classification algorithm is improved. The achievements of this thesis are as follows. Firstly, by comparing crawler frameworks, WebMagic is selected as the basic crawler framework of this thesis. Then, the secondary application development is discussed. In order to achieve the focused crawler, the theme identification module is added. This module includes Web information extraction, Chinese word segmentation, feature selection and removing stop words, etc. Secondly, by comparing different text classification algorithms, naive bayesian classification algorithm is chosen as the classification algorithm of the theme identification module. And the advantages and disadvantages of the algorithm are analyzed. Thirdly, in order to improve the performance of the naive bayes, three parameters are added. These parameters are the magnification factor, the attribute weights and the constraint factor of the theme.At last, comparing the experiment results obtained from focused crawler and general crawler, it is validated that focused crawler is better than general crawler on the accuracy. Comparing the experiment results classified by naive bayes and improved naive bayes, it is validated that the improved naive bayes contributes better accuracy, recall and precision.
Keywords/Search Tags:Focused crawler, Text categorization, Robot, Naive bayes
PDF Full Text Request
Related items