Font Size: a A A

Data Extraction From Web Forums

Posted on:2013-07-30Degree:DoctorType:Dissertation
Country:ChinaCandidate:J W ZhangFull Text:PDF
GTID:1228330395955785Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Web2.0provides a wealth of services for people, the huge number of attendees make it be evolving into an ecosystem. As well as presenting rich information to people, Web also harvests massive contents contributed by users, which holds a huge value.As a typical Web2.0service, Web forums provide a platform for users to publish and exchange information. For example, people may like to release information or make comments, such as sharing product experience, exchanging life experience, discussing ed-ucation, posting gossip and so on. Such user-generated contents reflect people’s real needs and viewpoints, social phenomena and others. Hence, how to extract data from Web forums becomes very realistic and meaningful since it is critical for commodity recommen-dation, expert discovery, public opinion monitoring and other analysis tasks.Forum data consists of not only a lot of useful user-generated contents, but also some noise data, such as recommendations, advertisements and so on. In addition, there exist a large number of Web sites with different styles, which makes forum data extraction even more challenging. Traditional Web data extraction methods usually work on structured data, therefore, it is necessary to revisit the existing work to devise new efficient extraction methods for Web forums. This paper makes the following contributions,· Proposing a forum data extraction method with high precision and recall by inte-grating inductive logic programming and XPath pattern learning. The method fully considers the structural features of forum pages, introduces new predicates, unifies logic program expressions and XPath patterns, and uses a divide-and-conquer way to learn XPath patterns. XPath patterns are used for expressing the structural features of target data. Finally, XPath patterns are automatically transformed into a XSLT file, which is responsible for transferring the extracted data into a predefined storage model, to complete forum data extraction.· Proposing an unsupervised method of forum data extraction based on both the struc-tural features of Web pages and the relationships between Web pages, which makes the extraction processing automatically. Considering the structural similarity among Web pages from a same Web site, this paper adopts some comparison operations on multiple Web pages to divide Web pages into stable parts and unstable parts, and introduces two filtering operations, page-level filtering and template-level filtering, to remove most noise data from Web pages. Finally, the definition for path accompanying distance and path similarity are introduced to compute the dependency relationship between paths in stable parts and unstable parts. The dependency relationship be-tween paths can help to find those paths locating target data and realize automatic extraction of forum posts.· Proposing an unsupervised wrapper generation method for forum data extraction, which fully considers the features of Web page structure and contents to improve adaptability of the unsupervised method on different forums and ensure the integrity of extracted forum posts. This method contains two stages, which exploit the features of Web page structure, user contents and some redundant information generated by forums themselves. First, it tries to locate user areas by using the redundant in-formation, base on which user information can be obtained by finding a maximum substructure in user areas. Second, it tries to distinguish user-generated contents and noise by loading all data into a table, and then an attribute dependency computation is exerted on this table to identify which items should be reserved. All the paths locating those contents discovered in above two stages are gathered and induced into a regular tree for future use.In summary, this paper proposes three methods to extract forum data. The first method is a supervised extraction rule learning method, it behaves well on precision and recall, and is qualified for small-scale data set. The:second method is a unsupervised one, which extracts data simultaneously from multiple forum pages, has no explicit rules and can handle a larger data set. The third method is also an unsupervised method, it learns extraction rules and uses these rules to extract data, which gives a comprehensive consideration on extraction automation and performance. This method can handle a larger data set than the first two methods. Extensive experiments on real forum data sets show that the above methods have a good extraction performance.
Keywords/Search Tags:Forum Data Extraction, User-Generated Contents, Wrapper, Induc-tive Logic Programming
PDF Full Text Request
Related items