Font Size: a A A

Research And Implementation Of Chemical Web Information Acquisition Method

Posted on:2017-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:J R WanFull Text:PDF
GTID:2308330485480611Subject:Agricultural informatization
Abstract/Summary:PDF Full Text Request
With the rapid development of Internet technology and the popularization of computers, the Internet has become the largest information database of the world and the main place of publishing chemical information by chemical companies, organizations and individuals. However, using conventional means of information retrieval like Baidu and Google is not high in accuracy and has low efficiency due to the huge amount of information in the Internet. It can not meet the needs of chemists for efficient retrieval of chemical information. To solve the problems, we took the commonly used chemical sites as the research object, studied the chemical substances web page information acquisition methods. To make the chemical substance information extracted from web pages and stored in a chemical database. The main contents and conclusions are as follows:(1) Research on the method of collecting chemical substance pages. Collecting the chemical substance pages is the premise of information extraction. But, in chemical website there are some pages which are unrelated to chemical substance information or have low correlation(off-topic pages). Aiming at this problem, we used URL’s topical relevance prediction technology based on regular expression and web page’s topical relevance judging method based on text heuristic technique to implement the feature of crawling topic pages. And on this basis, we implemented a topic web crawler. Experimental results show that the topic web crawler can effectively crawl chemical substance information pages, which achieve the requirement of the research.(2) Research on the method of extracting chemical substances web information. The format and content of the crawled chemical substance pages cannot meet the requirement of the pages which were used as trained or extracted pages. Therefore, we repaired the format and eliminated the “noise information” of the crawled web page firstly. Then on the basis of analyzing the web structure, we designed an extraction rule generation algorithm based on tree structure, which could search iterator actively with single web page and describe the iterator as extraction rule with regular expression. Finally, we extracted the chemical information from pages with the extraction rule, and saved the information into the database. The experimental results demonstrate that the designed extraction method can extract the chemical information on web page accurately, and the recall stays above 95.2%.(3) Design and implementation of chemical web page information extraction system. Combining the collecting chemical substance web page method and the extraction information method, we designed and implemented the extraction system with the B/S mode, which integrated the functions of crawling web page, tidying web page, generating extraction rules and extraction of chemical information. Finally, through the test and analysis, it proved that the system has strong availability.
Keywords/Search Tags:chemical substance information, web information extraction, topical relevance, topic web crawler, iteration structure
PDF Full Text Request
Related items