Automatic Indexing Technology Research And Improvement For Document Information

Posted on:2014-06-15

Degree:Master

Type:Thesis

Country:China

Candidate:A Q Xu

Full Text:PDF

GTID:2268330425982280

Subject:Applied Mathematics

Abstract/Summary:

PDF Full Text Request

Automatic indexing refers to the process, in which the computer automatically give keyword or keywords to express the text message content. In order to meet the needs of the rapid growth of information resources, and to solve the traditional manual indexing defects, which is costly, inefficient,poor consistency. Research on automatic indexing has become an inevitable trend and has great significance. According to different sources of index terms, it is mainly divided into the automatic keyphrase extraction indexing and automatically assigned word indexing.Currently, domestic and foreign study mainly focuse on automatic word extraction indexing, namely computer automatically extracts keywords as text indexing terms from the text content, which represent the core content of text. Based on the research, analysis and summary of previous word extraction indexing method, this article selected automatic word extraction by computer techniques to study, and completed the following works:(1)This paper briefly describes the automatic indexing technology research significance. It is the base of retrieval system, and the premise of automatic summarization, automatic classification, automatic clustering, machine translation and other natural language processing techniques. Describes automatic indexing related concepts, such as the indexing terms,keywords, key phrases, subject headings, terms and controlled vocabularies, and determine keywords, key phrases or subject headings as the automatic indexing subject. There is a brief introduction of automatic indexing steps by computer, the process requirements and corresponding method of each step. Finally, this paper gives a brief description of principle of Chinese automatic segmentation techniques.(2)Studies candidate keyword extraction, which is one process of the English automatic indexing system, and introduces the concept of the core word set. On the base of the study on the relationship between the core word set and keywords set, and the combination of n-gram method, this paper proposes the algorithm ideas:firstly, locating potential candidate keywords by the core word, then generating candidate keywords by the expansion tree of the core word. Compared with the traditional n-gram method, the method makes the candidate keywords set reduce to the original2/7, and does not increase the computational complexity.(3)Researches on the innadequate of TF-IDF statistical weighting method of Chinese automatic indexing, and takes into account other statistical information of the term(part of speech, position information and mutual information), which affect calculation of weights of the candidate keywords, the weight decides that the candidate keywords can become final indexing terms. Whereby, adds these statistics in the TF-IDF algorithm, and proposes an improved multi-feature fusion algorithm and formula. Finally, do the numerical experiments, and to automatic word extraction indexing accuracy, recall and F integrated indicators and other technical parameters were compared and analyzed. The results show that the improved multi-feature fusion algorithm of automatic indexing is better than the known TF-IDF statistical weighting method, it improves recall and precision.

Keywords/Search Tags:

automatic indexing, the candidate words, the core set of words, forwordexpansion tree, multi-feature merging

PDF Full Text Request

Related items

1	Studies On Text Content Indexing: Based On Key Phrase
2	Research On The Rule Excavation Method Based On Decision Tree In Automatic Identification Of Relation Words In Chinese Compound Sentences
3	Design Of Chinese Automatic Indexing Algorithm And Its Application In Network Public Opinion Monitoring
4	Automatic Recognition Of Relation Words In Chinese Complex Sentence Based On Decision Tree
5	Research On Feature Words Extraction And Emotional Tendency Analysis Of Video Commentary
6	Research On Automatic Indexing System Of Economic News
7	An Automatic Recognition Method Of Chinese Relation Words In Compoundsentences Based On Dependency Tree Similarity
8	Automatic Recognition And Rule Mining Of Chinese Relation Words In Compound Sentences Based On Dependencies
9	Research On Extraction Patterns Of Product Description Words And Sentiment Words
10	Design And Implementation Of Pallaral Batch B~+-tree Indexing On Multi-core Processors