| With the development of Internet technology, the information on the network has grown explosively, people are accustomed to get information from the network.However, there is much bad information in the massive network which had an unavoidable effect on people. For example, web text which contains vulgar language,not only has a bad impact on the network users,especially young people,but also in the community.The national language resource monitoring and research center of network media undertook a project supported by National Social Science Foundation under the environment of Internet language and the construction of harmonious network language research, a task which is the automatic monitoring network text vulgar language usage. Based on the research of this task, this paper proposes and implements a coarse text automatic decision method and applies it to the judgment of the vulgar degree of the network text. The main work includes the following three parts:Firstly, a large scale coarse text corpus is constructed for the relative lack of relevant research data. Machine learning based natural language text analysis needs large scale training corpus, but there is a lack of training corpus. Therefore we download one million micro-blog data to be labeled, and use voting strategy to verify the consistency of annotation,the final construction of a micro-blog text contains twenty thousand text corpus based on the vulgar texts.Second, a new method to combine the mutual information and distance is proposed, and applies it to the construction of vulgar dictionary. Usually, different vulgar slang in the same text often co-occurrence and has a short distance, this paper introduced the correlation measure between words, the measure using mutual information and seven kinds distance functions expresses co-occurrence degree and distance effects of two words. Regard some words which has vulgar meaning and high recognition as seed words, and construct the seed dictionary, then use text corpus of text extraction method based on coarse related words, extended dictionary words,construct the seed dictionary. Compared with the word extension method based on mutual information, the method proposed in this paper can solve the interference caused by the new words to a certain extent.Third, a kind of text vulgar judgment model is this designed. The common factors of the text length, the frequency of using vulgar slang and vulgar level play a decisive role in the text of this paper. They will determine the degree of vulgar accordingly,these factors were quantified by statistical model, this text designs vulgar judgment model and the experimental results proved that the degree of the effectiveness. The experimental results show that the model has lower time consumption and better classification performance on the one hand, and on the other hand, it can be used to express the vulgar degree. |