Font Size: a A A

The Study Of Chinese Text Representation And Classification Based On Multi-Instance Learning

Posted on:2010-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:W HeFull Text:PDF
GTID:2178360302960756Subject:Information management and e-government
Abstract/Summary:PDF Full Text Request
With the extensive application of information technology and extensive development of information construction, information resource is explosive growth. How to obtain the valid information gains more and more attention. In these information resources, 80% are nature language text. Text mining and indexing are the key solution to solve the information management problems. The research on knowledge indexing has been carried out in China. Some new requirements of text content mining are put forward.For the need to transform the text into the form which existing algorithm can handle, text mining is much more difficult than knowledge discovery and data mining in structural database. However, there are inherent shortcomings of vectorization representation. First, it almost ignored the semantic information of text. Second, it cost much attention on mathematic problems and leads the text content mining itself to be studied less.In the light of text representation problems mentioned above, based on national natural science found, text semantic representation, similarity compute and text classification are the main researching contents in this paper. Text mining relies on text representation. In view of these features mentioned above, first of all the existing text representation models are studied. According to the research results of knowledge indexing, sentence is chosen as the text segment unit to replace words which containing less semantic information. Secondly, the concept of multi-instance bag is introduced to theorise the text sentence bag. And sentence similarity computing method is proposed to define bag distance. Thirdly, the relationships among sentences in sentence bag are studied. Sentence relationship map is used to express these relationships and topic sentence extraction algorithm is designed on this map. In order to validate the sentence bag, a text classifier has been designed and the achieved statistical values are not worse than vector space model.The research work of this paper enriches the study of multi-instance learning and puts forward new text representation theory. Extracting text topic sentence without sentence position and other weighting information is a new way in text content mining.
Keywords/Search Tags:Multi-instance learning, Text representation, Text classification, Bag of Sentences, Topic sentence extraction
PDF Full Text Request
Related items