Font Size: a A A

The Research And Implementation Of Chinese Short-text Representation And Classification

Posted on:2013-02-18Degree:MasterType:Thesis
Country:ChinaCandidate:J J PengFull Text:PDF
GTID:2248330371989956Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Text classification is to automatically classify a text in a given taxonomy, based on the content ofthe text. It is the basis and core of the text. The analysis of the domestic and foreign research on this topicshow that the classification of participles and short text has been become the two biggest problems. Inaddition, the research of the text representation model is also a heated topic of the text classification field.This paper focuses on these three issues, trying not confine to f the traditional vector space model, and usesthe model of text representation of the sentence package so that people can solve a series of problems suchas the ambiguity caused by particles and the difficulty in feature extraction in short text.The following are the main work of this paper:1. To improve and realize Chinese clauses algorithm. Eliminating stop words and discontinuationsentence and meanwhile combining high similar sentences. By setting the stop words list and thediscontinuation sentence form, the system compare the text which needs classification with the two formswhile dividing sentences, if some words in the text are the same with some words in the two forms, theywill be removed, otherwise, the clause will be preserved;and then scan the sentences whose stop wordshave been removed, calculating their morphological similarity, if they are same, those sentences would beconsidered of highly similar sentences, then the system merges them according to certain rules.2. To improve text similarity computing method. To divide the text into several fragments, andthen consider the contribution of each fragments to text recognition and text-category distinction, givingeach fragment of text in a different location weights so that when calculate the degree of text similarity, justfollow the method of weighting to weight the degree of similarity of the text. The improved calculationmethod takes into account the location of the sentence in the text to distinguish the text recognition and textcategories.3.To summarize the respective advantages and disadvantages of the text representation model andtypical text classification algorithm by studying large quantities of domestic and foreign literature,and todetermine the sentence package model which is text representation of text categorization system according to the specific needs of the paper (short Chinese text classification). kNN is text classification algorithm.4. To Program the function of each module of the package model-based sentence classificationsystem of the Chinese short text.
Keywords/Search Tags:Chinese text classification, short-text representation, BoS model, classification algorithm
PDF Full Text Request
Related items