Font Size: a A A

Research On Fitlteration And Classfication Methods Of Large-Scale Short Text

Posted on:2008-05-15Degree:MasterType:Thesis
Country:ChinaCandidate:W WuFull Text:PDF
GTID:2178360215482484Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
The instant communication technology is greatly boosted in the current information society. The Short Message Service (SMS), which is used for mobile telephone is considerate as another big information carrier besides the Internet. It is used in every aspect of the society and people's life. Used as communication tool, the Short Message also plays a critical role in the guide and spread of public opinion. Therefore, analysis and research of the short message which is a special kind of short text, building up effective and exact classification system, excavating the user interested information are especially important and urgent. Based on this background, the thesis launches the investigation and research on filtration and classification methods of short text.Currently, the traditional text disposing methods has grown mature and can filter and classify the regular text. However, as for the short message which uses the short text as carrier, the research is just at the underway stage. Therefore, with the background of project2, the thesis has done many researches about the features and related disposing method of short text, and then puts forward the rule-based filtering method and statistic language model-based classification method, which is meaningful at both research and realism. The mainly contributions that come out of the thesis are:First of all, on the basis of investigation about language feature and corpus structure, along with the project's background, pointing out the rule-based method to filter large scale given short text. The thesis uses the Regular Expression as tool to create rule and finish matching. The aim of this is to guarantee the fast and exact matching of the mean less short texts with fixed format and expression mode, and then filter them.Secondly, research and establish the classification system of the short text. After studying the principle and smoothing algorithm of statistical language model, the thesis brings out the language model based modeling method for short text. The classifier based on the statistical language model can dispose the non-handwriting short text. In order to solve the problem comes of a short text contains little info, topic feathers is combined with language model, which can derive a more accurate language model for short text.This thesis systematically introduces the language features and classification characteristics of short text, and then brings out effective filtering and classification methods aimed at disposing large scale short text. However, the technology used for short text is relatively immature compared to those used for traditional long regular text, and great performance improvement is still possible for further research in short text disposal.
Keywords/Search Tags:short text, text filtration, Regular Expression, statistic language model, text classification
PDF Full Text Request
Related items