Font Size: a A A

Research And Application Of Self-Defined Topic Information Extraction

Posted on:2009-12-07Degree:MasterType:Thesis
Country:ChinaCandidate:H ChenFull Text:PDF
GTID:2178360272470271Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
With the rapid development of information industry, many people get the bonus from the internet information, but the information highway carries some information people do not need, such as pictures, advertisements and so on. A number of commercial sites found these drawbacks and use the "push" technology, to release it on the internet with RSS seeds, it can be customized through the direct targeting of the required information, such accurate information in a timely manner. However, some non-commercial sites such as investigation site is not yet support the unified information platform for different information related to the topic of customization. Therefore, to access the information, to achieve the new, fast and accurate goal, we must take information from other sources of information by ourselves.This article in the background of the self-topic information, by study of WEB-page catalog features of the structure, proposes the proliferation of local algorithms to find links relevant to the subject in order to determine the location of the block. On the step of information extraction, divide the html document base on the layout tags. construct the page coarse-grained DOM tree model, adopt the semantic analysis of the semantic web block access to the features of the website as a vector to quantify, when taking advantage of semantic web page text itself, save a great deal of training set of templates, as well as training to generate the template phase and contrast. In practice, using pieces of semantic analysis and semantic block combination, avoid non-topic information collected.By the end of the study collected topic information and criminal investigation page, establish a information extraction model based on the location of block. Solve automatic, extensive and accurate collect information on different sites with topic information, and based on different self-definition topic, it can classify information extracted from the definition of classified information, realize the auto-extraction of topic information extraction. Experiments have proved that this model not only for simple structure webs taking on a very high accuracy and the recall rate, but also some portal with topic information having achieved very good results. According to the this model, establish the system now used in sub-system of Dalian City Criminal Investigation Brigade Combat On-line (DCIDCO), improve the on-line combat system and provide the basic of information.
Keywords/Search Tags:Self-defined Topic, Information Extraction, Semantic Block, Block Position
PDF Full Text Request
Related items