Font Size: a A A

Organization Entity Information Extractor From Webpage Base On CRF

Posted on:2012-06-02Degree:MasterType:Thesis
Country:ChinaCandidate:J J ShiFull Text:PDF
GTID:2178330332499259Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Search engine is a very important area of Internet technology, and its effectiveness largely depends on the quality of the index. Organization query is a very large share of all search queries and closely related to people's lives. Shopping, travelling and business trip all need to search for organization information, thus it will be a very large contribution that a high-quality organization information index was established to improve the quality and customer satisfaction of search engine. Organization Entity Information Extractor in this text helps to establish an efficient geographical index for search system.The high-quality index needs not only high-quality webpage library, but also high-quality index entry. The environment of Internet is complex, and information on it can't be appropriate or effective all the time. So it's needed to find the authority organization entity information webpage before extracting information from the pages. So our system contains two modules, webpage classification and page extractor. Webpage classification is responsible for checking the quality, it identify organization entity pages from all web pages, and then constitute the index warehouse with them. Webpage extractor extracts organization entity properties from the page and uses them to build the index entry. In the following we will introduce the two modules respectively.The organization entity information of official website is the latest, accurate and comprehensive, we called the official pages that contain organization information official organization entity page. Establish the organization entity index with official organization entity page can guarantee the index reliable and comprehensive. There are two more important sub-modules in webpage classification:feature extraction and classification module. In our context feature extraction module not only include traditional text features, but also the hypertext links, anchor text, URL and other link-features. Web pages are not independent in the Internet, but are closely related with each other. The features that reflects the relationship between the web pages are greater significant than text feature. In the classification module, we propose the improved decision tree learning algorithm to build a more effective classifier model. In our algorithm, confidence value was returned instead of the result of classification. In this way, we can adjust P/R according to the needs by selecting confidence threshold. The experiment shows that results of our classification system are very excellent. The accuracy of our system is 85.71%, far better than 55.59% of rule-based classification.The structure of web page is more complex than the text. We must pay attention to three aspects:Firstly, there are no fixed templates of web pages; Secondly, web page updates at any time; Thirdly, many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. Therefore, we propose a joint model to understand the web page. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research organization entity information extraction show the feasibility and promise of our approach.Organization entity information include address, phone, mail, fax and etc, it is too complicated to be extracted from the page directly by using CRF model. So cascaded conditional random field model (CCRF) was proposed in this text. CCRF contains two sub-models:outer model and inner model. Outer model uses the CRF model based on Visual DOM tree model to identify the organization entity blocks. HTML source code is a semi-structured, can't be annotated directly by model. Tag tree is the most natural markup structure of the page. However, the tag tree tends to reveal the structure, rather than the visual structure. Therefore, in outer model the DOM tree with the visual information will be used. The leaf nodes of the visual DOM tree will be identified by the CRF model. The attribute values of an organization are presented in multiple separated identified blocks. The text content of a single identified block could contain information of multiple attributes. So the recognitions of organization entities need further annotation. In the lower model firstly split the entity block into a sequence of words, and then extract the word feature, finally label the word sequences by CRF. Inner model receives the blocks and block features from outside model. After the pretreatment, extract the word features and then label the words. After the step above, the extraction of organization entity information from web page was completed.In future research,, we can consider to further improve the calculation of anchor text features of web page classification; in the page extractor, we can consider further extract description, category and rating agencies of organization for better understanding of organizations.
Keywords/Search Tags:Index of Search Engine, Webpage Classifier, Feature Extractor, Webpage Understanding, CRF
PDF Full Text Request
Related items