Font Size: a A A

Research On Chemical Information Acquisition Method Based On Web Data Mining

Posted on:2013-01-11Degree:MasterType:Thesis
Country:ChinaCandidate:S FengFull Text:PDF
GTID:2218330374468367Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the rapid development of the Internet, online information resources are increasingday by day; using conventional means of getting information is not high in accuracy and haslow efficiency and other problems. So this passage takes the chemicals commonly used site asthe object of study, researching how to obtain information from a web page fast andefficiently and to make the chemicals environmental safety database automatically update.Firstly, use the vertical search engine technology and get some pages related to chemicals thenanalyze the page structure. We can take the appropriate techniques and methods according tothe degree of webs' structure. Secondly, use some methods, such as sorting algorithms andglobal model to integrate heterogeneous data in the Chemical Substances Web site. At thesame time we present a segmentation task, dynamic update check methods in order toimprove the dynamic information source website with continuous information and timelyextraction, The main contents are as follows:(1)The dynamic research on online information of the chemical substance.The main taskof online access to chemical substances is to obtain the CasNo (chemical registration number),name, physicochemical properties and other information. Depending on the types of sitepages, respectively use the focused crawler technology and artificial simulation of webpagebrowsing method to obtain webpage; analyze the tree structure, use the wrapper technique toextract chemical related attribute information and apply the regular expressions to extractstructural information from the unstructured data. And also, adopt the monitor technology toachieve the scheduling of the chemical substances web site, and ensure that the automaticacquisitions of online information of the chemical substance and that data are updated in atimely manner.(2)The research on the chemicals heterogeneous data integration methods. For theproblem of data heterogeneous of chemical substances in the webpage, this paper does thesethings. Firstly, according to the chemical environment safety related properties, determine theintegration range and design the public data model CompoundsDTO as the globalpattern.Secondly,use the sorting algorithm to make access to dynamic data analysis.Finally,the processed data is mapped to the global model. These procedures make the integration ofheterogeneous data, effectively eliminating the structure conflict and semantic conflict of the heterogeneous data source.(3)The design and development of chemical material and environmental safety datamanagement system. On the basis of the construction of chemical environmental securitydatabase, we apply the technology of chemicals online dynamic access and chemicalsheterogeneous data integration technology to design a data management system for the safetyenvironment of chemicals. Then we realize the automatically and timely extraction ofchemical information on Internet. In addition, we save the data with unified structure in thedatabase with the new dynamic detection technology to query database continuously.
Keywords/Search Tags:Web mining, Focusing Crawler, chemical substances, dynamic updateinspection, timely update
PDF Full Text Request
Related items