Font Size: a A A

Research And Application Of Web Information Extraction Technology Oriented To Coal Mine Safety Incidents

Posted on:2016-01-30Degree:MasterType:Thesis
Country:ChinaCandidate:Z W LiuFull Text:PDF
GTID:2308330503950758Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Internet provides us with a lot of resources,which users can use for finding all kinds of information. However, it is an issue for most of the researchers hoping to solve that how to extract required information from a jumbled web page. Internet features include a large amount of data, different structure and variable content and so on which cause web information extraction method is different from the traditional extraction methods. As the growing demand of users, in recent years, there are a wide variety of information extraction methods at home and abroad. Based on the characteristics of information of coal mine safety event on the Internet,this paper presents a method of web information extraction oriedted to coal mine safety incidents, which provide users with faster, more accurate service of mine safety event information.First, an in-depth study research has focused on web page cleaning technology. Page cleaning is to organize web page source code and clean some noise data of page. Through analyzing the characteristics of noise data of the page, JTidy is used to complete page formatting. By parsing HTML tags of a page, HTMLParser is used to compages page tag tree, and remove some of the noise label.Secondly, there is a study on page topic extraction, then, a theme extraction algorithm based on Measure of Medium Truth Degree is proposed. The algorithm can give relevant characteristics of tag tree by analyzing the subject feature of the page, and extracte the subject of the page with combining theory of Measure of Medium Truth Degree. To some extent, this approach improves the accuracy of theme information extraction.Additionally, the method of extraction rule generating, improved the method of extraction rule generating based on path, which is defined as the method of buildind extraction rules based on feature comparison is proposed after analysising the methods of buildind extraction rules based on absolute and relative paths. The method build feature class by selecting node feature items, then generate corresponding extraction rules. This method increases other characteristic except path for node, which can help to improve the robustness of extraction rules,and further improve the accuracy of information extraction.Then, the web information extraction technology, for mine safety event pages with more records, the information extraction algorithm is method based on DOM page, whitch also can extract the text of record, and then extract text information by using method based templates and inductive statistical. After the extraction is completed, the results will be presented to the user and store the extracted data into a relational database.Finally, according to the above research, web information extraction oriedted to coal mine safety incidents system can be designed and implemented. This system is test by using a number of coal mine safety incidents related site, the test proved the feasibility of the system, and the test results indicate that the extraction system designed for mine safety event information extraction has higher Precision and Recall.
Keywords/Search Tags:coal mine safety incidents, page cleaning, information extraction, Measure of Medium Truth Degree, features comparison
PDF Full Text Request
Related items