| In recent years, BBS, Blog and Twitter gradually become the major tools that used by people to take participate in the publication information in the Internet. People can share their knowledge, idea and opinion freely in forums. The content of forums is created by people, and it is very important for analyzing people's opinion and advertising applications.The first step of analyzing data is to obtain data from forums. Traditional technologies of crawler download data in page unit and analyze data of web pages after crawling, which are not fit for crawling data from forums. Forums have their own structures and useful data of forums hide behind the web pages. Web crawlers usually download data in the page unit, which ignore the internal structure of forums and lose the relationship among forum pages. Web pages have many useless data and crawlers store all the data without extracting useful data.Based on the above observation, we propose a new method, which integrates crawler with rules-based information extraction. And then we introduce a system InForCE. It can analyze the structure of forums, and then schedule the tasks of crawling post pages with the information of list pages. It can extract information from web pages based on extraction rules and organize them with posts. InForCE consists of crawler, HTML parser, post pool, rule leaner and rule pool. Crawler is used to download web pages. HTML pareser transforms HTML into XHTML for information extraction. Post pool judges the strategy of crawling post pages. Rule leaner and rule pools are used to extract information.Our main contributions are shown as follows:We integrate crawler with information extraction and task scheduling. Forums have list pages and post pages. List pages contain brief information of posts on forums, which are the entry of post pages. Post pages have all the information of posts. According to information of list pages, we can schedule the tasks of crawling post pages and organize all the data of a post into a document. We propose a descriptive pattern mapping rules based on XML and XPath pattern. XPath pattern will be used to descript the mapping rules from source data to target data. According to the structure of forums, we define the model of target data and define the mapping rules from source data to target data.We simplify the process of information extraction. We get pattern mapping rules with machine learning, and convert the rules into XSLT automatically. XSLT will be used to extract information from source data. Automatic conversion of rules makes users who do not have knowledge of XSLT can complete the data extraction task quickly.In summary, we analyze the problems in the process of data acquisition from forums. We propose our approach and develop a tool InForCE to support our approach. Currently, InForCE can run liba and soufun forum successfully. Web pages which we crawl from liba exceeds 80G, and target data exceeds 40G. Experimental results show InForCE works well. |