Research And Implementation Of Topic-based Document Data Collection System

Posted on:2011-08-10

Degree:Master

Type:Thesis

Country:China

Candidate:D H Zhang

Full Text:PDF

GTID:2248330395457797

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

In recent years, enterprise competitive intelligence system has developed rapidly. An enterprise will enhance its own competitiveness, only if an independent competitive intelligence analysis system of its own has been established. Whether users can collect data quickly and accurately has become the most important issue that must be solved. Topic-based data collection technique research is becoming a research hotspot. This paper takes the design and implementation of the topic-based document data collection system as a topic, focusing on the study of multi-document keyword extraction technique and document similarity calculation.Users offer documents which are of the same topic to the topic-based document data collection system. Then the system extracts keywords on the same topic. The keywords are used to initially filter out irrelevant hyperlinks by the Web crawler. Then extracting documents from web pages, we can use document similarity calculation technique to filter out irrelevant documents. At last, the system returns a large number of structured documents on the same topic.In the multi-document keyword extraction, this paper achieves four methods for key word extraction based on statistical approach. Through experiments, we found that the accuracy of extracted keywords is not very high. After analyzing the results of the experiments, we found that a number of high-frequency words appear in the results of keyword extraction. To solve this problem, we get the document frequency of keywords in Chinese category corpus. When the document frequency is above a certain threshold, we remove the word from the keyword list directly; otherwise, we use the document frequency to modify the weight of keyword. The experiment showed that the system after improvement has a good performance.The most frequently method of text representation in document similarity calculation is vector space model based on TF-IDF weight. In text representation, topic keywords should be given greater weights. In this paper, we present a text representation method by using topic keywords as text futures. Then we get the document similarity. The experimental results show that the performance reduced because of the low precision and low recall. So term weight multiplied by the weight we got in topic keyword extraction is used to revise the document vector before similarity calculation. The experimental results show that the performance has improved. The first three document similarity calculation methods are based on the exact match between document features, but there are many semantic relations between two features, which are very important for the calculation of document similarity. To solve this problem, a document similarity calculation method based on word similarity is proposed in this paper. The experimental results show that the performance has improved obviously.

Keywords/Search Tags:

document data collection, keywords extraction, similarity computation ofdocument, web crawlers

PDF Full Text Request

Related items

1	Research On Semantic Similarity Computation And Applications
2	Research On XML Keywords Retrieval By Integrating Semantics Of Document And User Inquires
3	Web Document Automatic Classification Based On Keywords
4	User Web Information Collection And Analysis System Based On The Smart Router
5	The Research And Design Of Automobile After-Sale Service System Based On B2B2C
6	Study And Application On Chinese Sentence Similarity Computation
7	An Efficient Keywords Extraction Algorithm For Text Comprehension
8	Research On Text Abstract Extraction Technology Based On Keywords And Topic Sentences
9	Word Network-based Keywords, Automatic Extraction Methods, And In The Chinese Web Page Classification In The Study
10	Research On Text Similarity Calculation Method And Its Application In Financial Field