Font Size: a A A

Automatic Indexing Technology Research And Improvement For Document Information

Posted on:2014-06-15Degree:MasterType:Thesis
Country:ChinaCandidate:A Q XuFull Text:PDF
GTID:2268330425982280Subject:Applied Mathematics
Abstract/Summary:PDF Full Text Request
Automatic indexing refers to the process, in which the computer automatically give keyword or keywords to express the text message content. In order to meet the needs of the rapid growth of information resources, and to solve the traditional manual indexing defects, which is costly, inefficient,poor consistency. Research on automatic indexing has become an inevitable trend and has great significance. According to different sources of index terms, it is mainly divided into the automatic keyphrase extraction indexing and automatically assigned word indexing.Currently, domestic and foreign study mainly focuse on automatic word extraction indexing, namely computer automatically extracts keywords as text indexing terms from the text content, which represent the core content of text. Based on the research, analysis and summary of previous word extraction indexing method, this article selected automatic word extraction by computer techniques to study, and completed the following works:(1)This paper briefly describes the automatic indexing technology research significance. It is the base of retrieval system, and the premise of automatic summarization, automatic classification, automatic clustering, machine translation and other natural language processing techniques. Describes automatic indexing related concepts, such as the indexing terms,keywords, key phrases, subject headings, terms and controlled vocabularies, and determine keywords, key phrases or subject headings as the automatic indexing subject. There is a brief introduction of automatic indexing steps by computer, the process requirements and corresponding method of each step. Finally, this paper gives a brief description of principle of Chinese automatic segmentation techniques.(2)Studies candidate keyword extraction, which is one process of the English automatic indexing system, and introduces the concept of the core word set. On the base of the study on the relationship between the core word set and keywords set, and the combination of n-gram method, this paper proposes the algorithm ideas:firstly, locating potential candidate keywords by the core word, then generating candidate keywords by the expansion tree of the core word. Compared with the traditional n-gram method, the method makes the candidate keywords set reduce to the original2/7, and does not increase the computational complexity.(3)Researches on the innadequate of TF-IDF statistical weighting method of Chinese automatic indexing, and takes into account other statistical information of the term(part of speech, position information and mutual information), which affect calculation of weights of the candidate keywords, the weight decides that the candidate keywords can become final indexing terms. Whereby, adds these statistics in the TF-IDF algorithm, and proposes an improved multi-feature fusion algorithm and formula. Finally, do the numerical experiments, and to automatic word extraction indexing accuracy, recall and F integrated indicators and other technical parameters were compared and analyzed. The results show that the improved multi-feature fusion algorithm of automatic indexing is better than the known TF-IDF statistical weighting method, it improves recall and precision.
Keywords/Search Tags:automatic indexing, the candidate words, the core set of words, forwordexpansion tree, multi-feature merging
PDF Full Text Request
Related items