Font Size: a A A

Research And Implementation Of Topic-based Document Data Collection System

Posted on:2011-08-10Degree:MasterType:Thesis
Country:ChinaCandidate:D H ZhangFull Text:PDF
GTID:2248330395457797Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
In recent years, enterprise competitive intelligence system has developed rapidly. An enterprise will enhance its own competitiveness, only if an independent competitive intelligence analysis system of its own has been established. Whether users can collect data quickly and accurately has become the most important issue that must be solved. Topic-based data collection technique research is becoming a research hotspot. This paper takes the design and implementation of the topic-based document data collection system as a topic, focusing on the study of multi-document keyword extraction technique and document similarity calculation.Users offer documents which are of the same topic to the topic-based document data collection system. Then the system extracts keywords on the same topic. The keywords are used to initially filter out irrelevant hyperlinks by the Web crawler. Then extracting documents from web pages, we can use document similarity calculation technique to filter out irrelevant documents. At last, the system returns a large number of structured documents on the same topic.In the multi-document keyword extraction, this paper achieves four methods for key word extraction based on statistical approach. Through experiments, we found that the accuracy of extracted keywords is not very high. After analyzing the results of the experiments, we found that a number of high-frequency words appear in the results of keyword extraction. To solve this problem, we get the document frequency of keywords in Chinese category corpus. When the document frequency is above a certain threshold, we remove the word from the keyword list directly; otherwise, we use the document frequency to modify the weight of keyword. The experiment showed that the system after improvement has a good performance.The most frequently method of text representation in document similarity calculation is vector space model based on TF-IDF weight. In text representation, topic keywords should be given greater weights. In this paper, we present a text representation method by using topic keywords as text futures. Then we get the document similarity. The experimental results show that the performance reduced because of the low precision and low recall. So term weight multiplied by the weight we got in topic keyword extraction is used to revise the document vector before similarity calculation. The experimental results show that the performance has improved. The first three document similarity calculation methods are based on the exact match between document features, but there are many semantic relations between two features, which are very important for the calculation of document similarity. To solve this problem, a document similarity calculation method based on word similarity is proposed in this paper. The experimental results show that the performance has improved obviously.
Keywords/Search Tags:document data collection, keywords extraction, similarity computation ofdocument, web crawlers
PDF Full Text Request
Related items