With the rapid development of information industry, many people get the bonus from the internet information, but the information highway carries some information people do not need, such as pictures, advertisements and so on. A number of commercial sites found these drawbacks and use the "push" technology, to release it on the internet with RSS seeds, it can be customized through the direct targeting of the required information, such accurate information in a timely manner. However, some non-commercial sites such as investigation site is not yet support the unified information platform for different information related to the topic of customization. Therefore, to access the information, to achieve the new, fast and accurate goal, we must take information from other sources of information by ourselves.This article in the background of the self-topic information, by study of WEB-page catalog features of the structure, proposes the proliferation of local algorithms to find links relevant to the subject in order to determine the location of the block. On the step of information extraction, divide the html document base on the layout tags. construct the page coarse-grained DOM tree model, adopt the semantic analysis of the semantic web block access to the features of the website as a vector to quantify, when taking advantage of semantic web page text itself, save a great deal of training set of templates, as well as training to generate the template phase and contrast. In practice, using pieces of semantic analysis and semantic block combination, avoid non-topic information collected.By the end of the study collected topic information and criminal investigation page, establish a information extraction model based on the location of block. Solve automatic, extensive and accurate collect information on different sites with topic information, and based on different self-definition topic, it can classify information extracted from the definition of classified information, realize the auto-extraction of topic information extraction. Experiments have proved that this model not only for simple structure webs taking on a very high accuracy and the recall rate, but also some portal with topic information having achieved very good results. According to the this model, establish the system now used in sub-system of Dalian City Criminal Investigation Brigade Combat On-line (DCIDCO), improve the on-line combat system and provide the basic of information. |