| Recently mciroblogs, due to the characteristics of short, immediacy and fission spread, has been one of the most important social networking media from which netizens can access to news events, interpersonal relationships, self-expression, and social sharing. It also has become an important platform from which we can disseminate social public opinion, enterprise brand and product, and traditional media. However the traditional tools cannot achieve better performance on microblogs texts, because of its nature of informal text. The solution is text normalization which has become an important preprocessing of microblogs text analysis.Unlike the English informal words that are OOV words, Chinese informal words manifest diversity, such as phonetic substitutions, abbreviations, neologisms and paraphrase. In this paper, we focus on text normalization for Chinese microblogs. The traditional approaches employed noise models or translation models to normalization, viewing informal words as spelling errors. Other methods attempted from semantic point of view to study text normalization, but still face some key challenges. In this paper, according to the typical processing for microblogs normalization, we explore three key issues:learning informal word senses, mining the relation between informal words and formal words, and jointly words segmentation and normalization on Chinese microblogs. These works are described as follows.1. Lexical-based hypergraph model for word sense inductionMost of informal word in microblogs represent as new senses. Informal word identification can consider as a disambiguation task, which cannot be processed using traditional dictionary. So the key of this task is how to learn or induce senses of informal words on microblogs. Word sense induction is a task of automatically finding word senses from large scale texts. It is generally considered as an unsupervised clustering problem. In this paper we propose a hypergraph model in which nodes represent instances of contexts where a target word occurs, and hyperedges represent higher-order semantic relatedness among instances. A lexical chain-based method is used for discovering the hyperedges, and hypergraph clustering methods are used for finding word senses among the context instances. Experiments show that this model outperforms other methods and the performance can be affected by lexical chains, sense number and semantic granularity of target word.2. Mining relation between informal and formal words based on embedding learningAssuming that the majority of informal words can be normalized into formal equivalents, we can construct an informal dictionary that is useful for text normalization. The key issue is how to mine relation between informal and formal words from large of microblog texts. Considering that informal and formal words should be similar semantically, in this paper we proposed a multi-senses embedding model in which we can learn global multi-senses representation and obtain synonym relationships, overcoming the question that traditional methods have no revealed directly semantic relationship among words. Our model employees the position information of windows, and effectively solves the semantic bias problem. After post-processing that includes filter and classification, we mine relation between informal and formal words. Experiments show the effectiveness of our method.3. A joint model for segmentation, pos-tagging and normalizationIn this paper, we explore the application about normalization and propose a joint model for segmentation, pos-tagging and normalization to solve the problem of Chinese microblogs segmentation. Our model normalizes text by extended actions. Words are segmented on normalized texts and better word segmentation can help to find informal words, thus facilitating normalization. We can also train our model using standard corpus, overcoming the lack of tagging corpus. The scores of tagging text are computed using two types of features which include common features extracting from standard texts and domain features extracting from informal text. So our model is of better domain adaptability. Experiments show that three tasks can help each other, and language model can help to improve performances. |