| With rapid development of Internet,Internet has become one of the most important knowledge repositories.To realize rapid and efficient extraction and usage among knowledge makes good application prospects and value. Large-quantity,semi-structured,and dynamic which are inherent characteristics of the Internet,has brought the complexity scalability and adaptability on information extraction.However,the emergence of XML technology provides an opportunity for solving the problems in web information extraction.After having analyzed and researched the technology of XML and information extraction at the beginning of paper,it is found that how to set a rule of extraction efficiently is the difficult for the recent web information extraction. Upon the existing problems,the paper provides a solution of web information extraction based on public path study and deeply studies related technologies.The key problem in information extraction is how to generate accurate, general,and robust extraction rules.The Paper makes use of the advantages of standards XSL and XPATH in data orientation and conservation to solve the problem..The method is that induction learning automatically can be realized by training samples and heuristic processing.And the information blocks which users are interested in can be located accurately based on the patterns appearing again in samples.Moreover,by this method the extraction rules based on XSLT is generated and information extraction automatically based on rules is realized.Finally,cope with the actual projectâ… participated in and use C++ programming language in windows platform,the prototype information extraction system has been built with good interpersonal interactive capabilities. Experimental results show that the system can extract the interest to the field of web pages;meanwhile,it has good user experience,scalability and adaptability. |