Font Size: a A A

Application And Research Of Information Extraction And Topic Spider For Criminal Investigation Web Pages

Posted on:2008-10-20Degree:MasterType:Thesis
Country:ChinaCandidate:D H XieFull Text:PDF
GTID:2178360242967299Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Currently there is lot of information in the public security information weisite, but it is not possible to visit and clean up all information only through artifical manner, so much important information would be lost, also would go aginst cracking a criminal case. Based on the features of criminal investigation web pages, the thesis addresses this problem through by implementing a focused spider system to collect the pages in the criminal investigation web pages using technology of information extraction and focused spider.Firstly the thesis paritions the html tags document based the the container tag, consequently constructs the corsed-gain division DOM tree. Through analyzing the text semantic, the eigenvector of the page and semantic block could be got as the quantitative figure of page. Base on the figure of the page, the theis presents an automatic extraction algorithm of web pages topical information based on blocks to extract the topic information. The results of experiments show this method is effective. The PageRank algorithm is used in ranking web pages. It estimates the page's authorithy by taking into account the link. However, it assigns each outlink the same weight and is independent topic, resulting in topic-drift.In this paper. In this paper, an improved PageRank algorithm base on the topic correlativity is proposed, but the experiment shows the algorithm does not obtain high perform. In order to address the problem and meet the demand of project, the seft-adaptive link importance evaluation algorithm is proposed in this paper. The experiments show the algorithm obtains better performance than the PageRank combined with the link relation.The system for extracting information in the criminal investigation web pages could significantly improve accuracy and efficiency for obtaining information. At the same time as the mehod for designing and implementing a focused spider is universal, which can guild other focused spider belong to other field.
Keywords/Search Tags:Criminal Investigation Web Page, Topical Information Extraction, Semantic Block, Focused Spider
PDF Full Text Request
Related items