The Study Of Chinese Text Representation And Classification Based On Multi-Instance Learning

Posted on:2010-05-16

Degree:Master

Type:Thesis

Country:China

Candidate:W He

Full Text:PDF

GTID:2178360302960756

Subject:Information management and e-government

Abstract/Summary:

PDF Full Text Request

With the extensive application of information technology and extensive development of information construction, information resource is explosive growth. How to obtain the valid information gains more and more attention. In these information resources, 80% are nature language text. Text mining and indexing are the key solution to solve the information management problems. The research on knowledge indexing has been carried out in China. Some new requirements of text content mining are put forward.For the need to transform the text into the form which existing algorithm can handle, text mining is much more difficult than knowledge discovery and data mining in structural database. However, there are inherent shortcomings of vectorization representation. First, it almost ignored the semantic information of text. Second, it cost much attention on mathematic problems and leads the text content mining itself to be studied less.In the light of text representation problems mentioned above, based on national natural science found, text semantic representation, similarity compute and text classification are the main researching contents in this paper. Text mining relies on text representation. In view of these features mentioned above, first of all the existing text representation models are studied. According to the research results of knowledge indexing, sentence is chosen as the text segment unit to replace words which containing less semantic information. Secondly, the concept of multi-instance bag is introduced to theorise the text sentence bag. And sentence similarity computing method is proposed to define bag distance. Thirdly, the relationships among sentences in sentence bag are studied. Sentence relationship map is used to express these relationships and topic sentence extraction algorithm is designed on this map. In order to validate the sentence bag, a text classifier has been designed and the achieved statistical values are not worse than vector space model.The research work of this paper enriches the study of multi-instance learning and puts forward new text representation theory. Extracting text topic sentence without sentence position and other weighting information is a new way in text content mining.

Keywords/Search Tags:

Multi-instance learning, Text representation, Text classification, Bag of Sentences, Topic sentence extraction

PDF Full Text Request

Related items

1	Research On Extracting Topic Sentences From News Based On Text Features And Correlation Analysis
2	The Research On Local Smooth Preserving Of Manifold Regularization Auto Encoder For Text Representation
3	Study And Application Of Deep Features Learning In Sentence-Level Text Classification
4	Research On Multi-instance Multi-labe Learning Based On Feature Learning
5	Research On Web Text Mining Based For Multi-instance Multi-label Classification
6	Research On Text Abstract Extraction Technology Based On Keywords And Topic Sentences
7	Research On Key Techniques Of Short-text Representation And Classification Based On Hybrid Semantic
8	Research On Text Representation And Feature Extraction Methods Based On Conditional Co-occurrence Degree
9	Classification Of Chinese Text Subject Classification And Emotion Based On Machine Learning
10	Study On Topic Model Based Multi-label Text Classification And Stream Text Data Modeling