Font Size: a A A

Design And Implementation Of Web Information Extraction System For Evaluation Of Search Engine

Posted on:2013-09-01Degree:MasterType:Thesis
Country:ChinaCandidate:B LiuFull Text:PDF
GTID:2268330392469552Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid growth of the information on Internet, the web has graduallybecome the main platform for people to get information. Thus, the major searchengines come with the tide of fashion with intensive competitions. The searchresults as well as the user’s experience are two factors to evaluate the quality ofthe engine. Since the user’s experience can bring flow for the engine, so theevaluation that focus on the degree of user’s satisfaction about search engine hasbecome more and more important, and companies that has large search enginehave already set up a special team or departments for evaluation to lead thedirection for search engine by the evaluation data.The principle of the evaluation is to collect up information of search resultsof search engine for the users to score, and to make statistics for the index in orderto make a comparison. To get evaluation data successfully is the key to theevaluation task.In this thesis, the importance of information extraction is stressed becausethat the experimental data verifies that the accuracy of data extraction will directlyaffect the results of evaluation. The author compares some technologies of theexisted Web information’s extraction, and makes analyses according to thedemand of this system, and points out advantages and disadvantages with thecombination of the characteristics of the source code on the results page of thesearch engine. The author also proposes a new method that is to combine regularmatch with Dom resolve to extract and process the evaluation data. Based on thisidea, the author wants to achieve an extraction system of Web information withstrong applicability and higher automation to solve the issue which is to collect theevaluation data of evaluation system.The system mainly consists of the page download, the page filter, rules togenerate extraction, the information extraction, and the data storage, etc… Thethesis makes a more detail introduction for these elements, in which the rules togenerate extraction is more important in the realization of the system. By usingDom structure and the study of the sample, the system can automatically generatethe rules of extraction; find the maximum common path of the node, and recordcharacteristics of the sample node. At the same time, the system filters irrelevantinformation of the node by matching algorithm of similarity on node to realize thehigh automation on extraction of information for some products. Of course, therules of extraction can also be manually amended. In order to improve accuracy, some products use regular matching to extract information. To encode and enteredinto the base of rules artificially in advance, the system will call the templatematching module to allocate template for the extraction.Finally, the thesis introduces two indicators in the evaluation of extraction ofinformation: rate of precision and rate of recall, and makes tests on data downloadand analyses on results of information extraction. According to the indicators ofevaluation, the author proposed that the system has a good effect on the page ofresult set that is generated by the search engine, and also in solving the problemthat getting evaluation data efficiently and accurately for the reviewers.
Keywords/Search Tags:Web information extraction, Dom, search engine, evaluation onsatisfaction of users, data of evaluation
PDF Full Text Request
Related items