Study On Uighur And Kazakh Illegal Web Page Recognition Methord

Posted on:2013-09-20

Degree:Master

Type:Thesis

Country:China

Candidate:Y K Li

Full Text:PDF

GTID:2248330395465768

Subject:Agricultural mechanization project

Abstract/Summary:

PDF Full Text Request

With the rapid development of IT, the Internet has become an important tool to rapidly release and access information. In recent years, the number of Uighur and Kazakh website has a rapid growth, according to incomplete statistics, there are more than two thousand Uighur and Kazakh website in domestic, and it still has a trend of increasing. Minority language website provide a variety of their ethnic’s cultural information, but, at the same time, some criminals using the internet to spread illegal information threating the society’s security and stabilization. Such information seriously distort our party’s principles and policies,distort the truth. These information can easily cause public’s irrational judgment, and bring huge risks to the social harmony and stability. How to effectively monitor and filter such illegal information has become the concern of government, and the technology of Uighur and Kazakh bad web page reconnition has also become the research focus of research units.The author design the Uighur and Kazakh website recognition model, and use the search engine to search Uighur and Kazakh website in internet and collect such website’s date. This article carries out research on technology involved in Uighur and Kazakh illegal web page reconnition, and the study main contains Uighur and Kazakh webpage content extraction methord, Uighur and Kazakh word segmentation, characteristic word extraction, text classification algorithm and the classifier performance evaluation.This paper extract feature word from the training set in Chi-square methord based on the analysis of Uighur and Kazakh bad webpage character. To detect the impact of different text classification algorithm to Uighur and Kazakh bad webpage Recognition Model’s performance, the author studies the support vector machine, k-Nearest Neighbor algorithm, Naive Bayes text classification algorithm, and according to multiple linear regression principle design a multi-linear regression model. In this paper, the author carry out a test and contrast using these four categories of methods, the result shows that when the the text is expressed by feature vector with weight and the libsvm’s kernel function is RBF(Radial Basis Function), the model’s accuracy and recall can reach more than95%, at the same time, the model has a more stable performance and higher computation efficiency.in practical applications, this methord also achieve a very good recognition effect.

Keywords/Search Tags:

illegal webpage classification, feature word extraction, multiple linear regression, supportvector machine, knn, Naive Bayes

PDF Full Text Request

Related items

1	Research On Feature Selection And Classification Based On Intelligent Optimization Algorithms
2	Research And Application Of Distributed Webpage Automatic Classification Algorithm Based On Bayes
3	Comparing Classifiers In Data Mining
4	Design And Implementation Of Short Message Classification System Based On Naive Bayesian
5	Research Of Chinese Text Classification Based On Naive Bayesian Method And Application Of Microblogging Data Classification
6	Research On Text Classification Algorithm Based On Naive Bayes Method
7	Research On Spam Text Classification Based On Improved Naive Bayes Algorithm
8	Research Of Identifying Splog Based On Multiple Structure Features
9	Decoding Emotion From FMRI Based On Machine Learning
10	Study Of Feature Extraction Based Genetic Characteristics And Species Identifiction Of Weed Seeds