Font Size: a A A

Chinese BBS Information Extraction And Classification

Posted on:2010-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:J HanFull Text:PDF
GTID:2218360305998712Subject:Communication and Information System
Abstract/Summary:PDF Full Text Request
With the disordered, massive and dynamically changeful network information resource, the information extraction and classification in the cyberspace can assist user to find the very information rapidly, and gain structured data which is directly used by other application system, favoring the application of network information. As to different information sources, this dissertation mainly works in the method of BBS information extraction and classification, and describes the BBS information with structured form.Constructing the DOM tree by parsing the BBS page, find the BBS floor units'rule based on the elements'position rule on the DOM tree, and propose three kinds of concepts of anchor information, including the structured anchor, the individual anchor and the JavaScript anchor. With the remarkable characteristics of the anchor information, the anchor induction algorithm is introduced. The algorithm can effectively obtain the anchor information from the BBS pages by using the position, quantity and relation of the anchor information in the DOM tree to extract the position and then deduce the floor units'rule reversely. After establishing the steady mapping relationship between the anchor information and the floor units, the position of the floor units are located on the DOM tree by the path of the anchor information, and then split from the DOM tree accurately. Experimental analysis shows it can solve 87.39 percent of BBS pages rightly.When extracting information from the BBS pages by the floor units, the floor units from the same BBS site have the same DOM sub-tree structure,so the needed information's position in the DOM sub-tree is changeless.Compare two floor units'DOM sub-tree, and extract the different content with the same position, we can get the collection of each floor units'information items.In the collection of information items classification procedure, the information items are sorted by the position in the DOM sub-tree.With the underlying semantic feature of the information item, mapped into its own category's semantic label, it retrieves 70 percent of the structured mode information of the BBS back-end database table. The method greatly reduces manual labor intensity.By the BBS information extraction and classification, the structured table data is conductive to the design and management of BBS site.
Keywords/Search Tags:Information Extraction, Information Classification, Bulletin Board System, Floor Split, Anchor Information, Anchor Induction Algorithm, Semantic Label Discover
PDF Full Text Request
Related items