Studies On Text Content Indexing: Based On Key Phrase

Posted on:2006-08-24

Degree:Doctor

Type:Dissertation

Country:China

Candidate:H Liu

Full Text:PDF

GTID:1118360152488964

Subject:Linguistics and Applied Linguistics

Abstract/Summary:

Without structured content labeled, so much information yet has inefficient efficiency of IR. So, how to organize these numerous and jumbled information, improve the efficiency of using information, has been a crucial task in informatics.Effective information organization and representation is the basic of IR. The labeling of Text content, especially inner feature, such as topic and key words, is the crux of information organization and representation. This paper clustered field Words in classed large-scale corpus by feature extraction in text classing. Based on this repository, we completed a text content automatically labeling system, integrating with text categorization and key words indexing. We index text content with concise and comprehensive category and key words; thereby user can fleetly grasp the essence of text, raise browsing and retrieving efficiency.This paper focuses on the implement of integrative text categorization and key words indexing as follows:1 .. Presented and approved that Key Phrase is much fit for the feature of text representation.With steady structure, integrated meaning and statistical significance, Key Phrase can overcome the limitation of VSM (Vector Space Model) and NB (Naive-Bayes), being fit for feature of text representation, and being propitious to improving effect of text categorization and key words indexing. The experiment verified our opinion (MicroFl increase of 3.1 percent of parent- category, MicroFl increase of 15 percent of sub- category).2 Built a layered classed large-scale corpus.From downloaded WebPages, we extracted content information, such as title, key words, category, time and text, by means of IE. After analyzing columns of 18 networks, we establish a classificatory system for WebPages, which contains 4 levels, 229 categories. Based on all these, we built a layered classed large-scale corpus. This corpus is classed systematically, contains abundance information and structured content (labeled by XML), which is the knowledge source of Key Phrase extraction and Words Clustering, as well as the train and test corpus of text categorization and key words indexing.3 Built a big word list of Key Phrases extracted from the corpus.We built a big word list (32 ten thousand lemmas) of Key Phrases (22 ten thousand lemmas) extracted from this corpus. Compared with commonly used word list (8 ten thousand entry), the rate of new words of Key Phrases is about 78% (science and technology, for example) .4 . Built a large-scale domain repository by feature extraction in text categorizationConsidering that domain words comprises words in general domain (such as "çƒ æ¯”èµ›" in sports) and words in special domain ( such as "è·‘åž’" in softball), by adjusting the radix of extraction of root of word's frequency, we clustered field Words in classed large-scale corpus by feature extraction in text categorization. Based onthese domain words and their classificatory difference, we built a large-scale domain repository for text categorization and key words indexing.5 Accomplished a layered and multi-tag text categorization system.When count word's frequency, we dynamic weighted word's frequency at different position in text according to text's length. Also, we ameliorated the weighted means of P_r(f₁/c_k) in NB, and the experiment verified this means (MicroF1 increase of 18.9 percent of parent- category, MicroFl increase of 7.6 percent of sub-category). When estimating the category, we enhanced the veracity of parent-class by emendation of subclass, considering that Key Phrase of words in special domain may improve the veracity of subclass. The problem of adaptability in text categorization also has been resolved in this paper.6 Fulfilled a dynamic key words indexing system.In this system, we can obtain the number of key words be labeled by dynamic adjusting means according to text's length. And using the classificatory difference of words, we debase general words, whereas, give prominence to these domain words, which avoids that key words indexing system excessively...

Keywords/Search Tags:

classed corpus, feature extraction, words clustering, text categorization, key words indexing

Related items

1	Statistical Law Of The Same Frequency Words For Text Mining And Short Text Categorization
2	Research On The Algorithm For Text Clustering Based On Feature Words
3	The Study And Application Of Document Categorization
4	Automatic Indexing Technology Research And Improvement For Document Information
5	Research On Feature Words Extraction And Emotional Tendency Analysis Of Video Commentary
6	Research On Fast Retrieval Algorithm Chinese Expressions And Sentences Based On Chinese Corpus
7	Research On Keyword Extraction Algorithm For Chinese Texts And Cluster Center Point Selection Algorithm In Text Clustering
8	Research On Web Text Clustering And Classification Algorithm
9	Research On Image Categorization Based On Bag-of-words Model
10	The Text Categorization And Structure Of Theme Words Network Based On Topic Models