Font Size: a A A

Studies On Text Content Indexing: Based On Key Phrase

Posted on:2006-08-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:H LiuFull Text:PDF
GTID:1118360152488964Subject:Linguistics and Applied Linguistics
Abstract/Summary:PDF Full Text Request
Without structured content labeled, so much information yet has inefficient efficiency of IR. So, how to organize these numerous and jumbled information, improve the efficiency of using information, has been a crucial task in informatics.Effective information organization and representation is the basic of IR. The labeling of Text content, especially inner feature, such as topic and key words, is the crux of information organization and representation. This paper clustered field Words in classed large-scale corpus by feature extraction in text classing. Based on this repository, we completed a text content automatically labeling system, integrating with text categorization and key words indexing. We index text content with concise and comprehensive category and key words; thereby user can fleetly grasp the essence of text, raise browsing and retrieving efficiency.This paper focuses on the implement of integrative text categorization and key words indexing as follows:1 .. Presented and approved that Key Phrase is much fit for the feature of text representation.With steady structure, integrated meaning and statistical significance, Key Phrase can overcome the limitation of VSM (Vector Space Model) and NB (Naive-Bayes), being fit for feature of text representation, and being propitious to improving effect of text categorization and key words indexing. The experiment verified our opinion (MicroFl increase of 3.1 percent of parent- category, MicroFl increase of 15 percent of sub- category).2 Built a layered classed large-scale corpus.From downloaded WebPages, we extracted content information, such as title, key words, category, time and text, by means of IE. After analyzing columns of 18 networks, we establish a classificatory system for WebPages, which contains 4 levels, 229 categories. Based on all these, we built a layered classed large-scale corpus. This corpus is classed systematically, contains abundance information and structured content (labeled by XML), which is the knowledge source of Key Phrase extraction and Words Clustering, as well as the train and test corpus of text categorization and key words indexing.3 Built a big word list of Key Phrases extracted from the corpus.We built a big word list (32 ten thousand lemmas) of Key Phrases (22 ten thousand lemmas) extracted from this corpus. Compared with commonly used word list (8 ten thousand entry), the rate of new words of Key Phrases is about 78% (science and technology, for example) .4 . Built a large-scale domain repository by feature extraction in text categorizationConsidering that domain words comprises words in general domain (such as "球 比赛" in sports) and words in special domain ( such as "跑垒" in softball), by adjusting the radix of extraction of root of word's frequency, we clustered field Words in classed large-scale corpus by feature extraction in text categorization. Based onthese domain words and their classificatory difference, we built a large-scale domain repository for text categorization and key words indexing.5 Accomplished a layered and multi-tag text categorization system.When count word's frequency, we dynamic weighted word's frequency at different position in text according to text's length. Also, we ameliorated the weighted means of Pr(f1/ck) in NB, and the experiment verified this means (MicroF1 increase of 18.9 percent of parent- category, MicroFl increase of 7.6 percent of sub-category). When estimating the category, we enhanced the veracity of parent-class by emendation of subclass, considering that Key Phrase of words in special domain may improve the veracity of subclass. The problem of adaptability in text categorization also has been resolved in this paper.6 Fulfilled a dynamic key words indexing system.In this system, we can obtain the number of key words be labeled by dynamic adjusting means according to text's length. And using the classificatory difference of words, we debase general words, whereas, give prominence to these domain words, which avoids that key words indexing system excessively...
Keywords/Search Tags:classed corpus, feature extraction, words clustering, text categorization, key words indexing
PDF Full Text Request
Related items