Font Size: a A A

Study On Uighur And Kazakh Illegal Web Page Recognition Methord

Posted on:2013-09-20Degree:MasterType:Thesis
Country:ChinaCandidate:Y K LiFull Text:PDF
GTID:2248330395465768Subject:Agricultural mechanization project
Abstract/Summary:PDF Full Text Request
With the rapid development of IT, the Internet has become an important tool to rapidly release and access information. In recent years, the number of Uighur and Kazakh website has a rapid growth, according to incomplete statistics, there are more than two thousand Uighur and Kazakh website in domestic, and it still has a trend of increasing. Minority language website provide a variety of their ethnic’s cultural information, but, at the same time, some criminals using the internet to spread illegal information threating the society’s security and stabilization. Such information seriously distort our party’s principles and policies,distort the truth. These information can easily cause public’s irrational judgment, and bring huge risks to the social harmony and stability. How to effectively monitor and filter such illegal information has become the concern of government, and the technology of Uighur and Kazakh bad web page reconnition has also become the research focus of research units.The author design the Uighur and Kazakh website recognition model, and use the search engine to search Uighur and Kazakh website in internet and collect such website’s date. This article carries out research on technology involved in Uighur and Kazakh illegal web page reconnition, and the study main contains Uighur and Kazakh webpage content extraction methord, Uighur and Kazakh word segmentation, characteristic word extraction, text classification algorithm and the classifier performance evaluation.This paper extract feature word from the training set in Chi-square methord based on the analysis of Uighur and Kazakh bad webpage character. To detect the impact of different text classification algorithm to Uighur and Kazakh bad webpage Recognition Model’s performance, the author studies the support vector machine, k-Nearest Neighbor algorithm, Naive Bayes text classification algorithm, and according to multiple linear regression principle design a multi-linear regression model. In this paper, the author carry out a test and contrast using these four categories of methods, the result shows that when the the text is expressed by feature vector with weight and the libsvm’s kernel function is RBF(Radial Basis Function), the model’s accuracy and recall can reach more than95%, at the same time, the model has a more stable performance and higher computation efficiency.in practical applications, this methord also achieve a very good recognition effect.
Keywords/Search Tags:illegal webpage classification, feature word extraction, multiple linear regression, supportvector machine, knn, Naive Bayes
PDF Full Text Request
Related items