Font Size: a A A

Research And Application Of Block-Based Topic Information Extraction

Posted on:2010-12-03Degree:MasterType:Thesis
Country:ChinaCandidate:C ZhangFull Text:PDF
GTID:2178360302960811Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Currently, public security information network and the Criminal investigation contain a large number of criminal cases of online information. It is difficult to organize information comprehensively and quickly if browsing, collecting and storing information by artificial manner, therefore it is not conducive to the timely detection of cases. Based on features of public security information and criminal investigation web pages, the thesis addresses this problem through by extracting the topic links in the topic block and extracting the topic information in the web pages, applies the algorithms to information extraction system of Dalian public security organ.Firstly according to the research of directory-type web pages, most of links having the same topic were in one pair of html tags and their layout of content. Based on the research of features of directory-type web pages and technology of web spider, the thesis presents a semantic-based extraction of topic links algorithm. The algorithm can determine the topic block and extract all links in this topic block. The result of experiment shows that this method is effective, it can extract all links in a topic block and avoid non-topic links collected.Secondly, web page is usually made up of several blocks which are made by html container tag division including and
. According to the research of layout between web page title and web page information, we find that web page title and web page information is in two html tag mostly. Based on this research, the thesis presents an information extraction based on visual block segmentation. By using web title and anchor text to determine the information block, construct html layout tag tree, then using determine regular and regular expression to remove non-topic links, non topic information and redundant html tags. The result of experiment shows that this method is effective. In the extraction process, only content block was processed that workload is smaller.In this paper, topic links extraction and information extraction methods are applied to Dalian public security organ information extraction system. The system for extracting information in the criminal investigation and public security web pages could significantly improve accuracy and efficiency for obtaining information. This system improves the on-line combat system and provides the basic of information.
Keywords/Search Tags:Information extraction, semantic block, web page layout, content block
PDF Full Text Request
Related items