Font Size: a A A

Design And Implementation Of An Automated Information Collection System

Posted on:2019-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:S LiFull Text:PDF
GTID:2348330545958504Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of the big bang of information and data today,data mining on the Internet can be carried out to extract valuable information and predict the occurrence of certain events.Modern mainstream search engines,such as Google and Baidu,deploy their own information collecting system(crawler system)around the world.In the information collection system,the most important part is how to parse the web page and extract the data information of interest.In the general information collection system,we need to formulate the information extraction rules for the website,especially in case of encountering the page structures are similar,a lot of human resources will be consumed.Automated information collection can solve this problem.The existing automated page parsing algorithms generally generate template-based or machine-learning methods for automatic information extraction.The most common algorithms include heuristic algorithms,tree alignment,and template generation methods such as RoadRunner.The problem with these existing algorithms is that the extracted information contains noisy information,data extraction time is too long and so on.In order to solve the above problems,the main contents of this thesis are reflected in three aspects.First,in order to solve the problem of manual intervention and excessive proportion of noisy information in webpage information extraction of web crawler.This thesis put forward a trinary tree information extraction algorithm based on web label content blocks.Through a large amount of analysis,the thesis determines the labels and thresholds.Finally,the thesis combine web content blocks extraction with trinity to form a new algorithm.Experiments show that this algorithm has better performance than the same algorithm in both time and proportion of noisy information.Second,in order to better match the information extraction algorithm,the classification of the webpage structure needs to be solved.The most common algorithm is edit distance of the DOM tree.However,the most disadvantage of this algorithm is to determine whether the two web pages belong to the same structure takes lots of time.This thesis put forward a new method to judge the similarity of webpage structure based on the edit distance of webpage label attribute string.The method is based on the low possibility of the same webpage templates being applied between the existing popular websites and the high possibility of different forum in the same site.Experiments show that the time it takes to determine if the page structure is similar is about 3/4 of the DOM tree edit distance method.Third,design an automated information collection system.In order to speed up the collection of information,using distributed architecture.Use ZooKeeper to achieve the dynamic configuration.Use MySQL to save data.The implementation of this system avoids the manual formulation of information extraction rules.
Keywords/Search Tags:Information collection system, Web page block, Trinary tree, Levenshtein
PDF Full Text Request
Related items