Nowadays, Internet plays an increasingly significant role in people's lives. Among the Internet services, World Wide Web has become the most important service, it is important to extract suitable information for the online public opinion analysis from the vast WWW effectively. Although many methods have been proposed to solve this problem, existing Web information extraction methods have some limitations. This thesis studies the data acquisition techniques from Web crawler and automated Web information extraction.This thesis firstly introduces the background and development history of Web crawler and Web information extraction, analyzes the character and disadvantage of the existing methods.Secondly, this thesis analyzes the special requirements of online public opinion analysis for the source data, and proposes two special crawlers for forum website and blog website. An effective crawling strategy based on three-level is proposed for the forum website, and the crawler could extract information effectively only with the template of one topic, through reverse crawling strategy the crawler could solve the problem of topic updating; User-unit is the crawling strategy of the blog crawler, through mapping user id to the memory to check the existing users, with the strategy that extracting information from pages for the first time and from RSS for the updating, get information in real time through focusing on active users.Then, this thesis proposes an automated Web information extraction and classification method for the list page, which based on DOM Tree structure and text features of web pages, we preprocess the page and construct the DOM Tree of the page firstly, then extract the informative data record set from the page through mining the similar subtree from the DOM Tree, after that extract detailed information from each data record in the informative set based on the structure of the subtree, finally the detailed information is classified through extracting template text, matching class value of CSS with classification feature library, analyzing tag name and text feature.Finally, experiments show that our method can extract and classify information effectively. |