Font Size: a A A

Research On The Techniques Of Chinese Web Information Identical Cognizance

Posted on:2011-04-02Degree:DoctorType:Dissertation
Country:ChinaCandidate:J B MaFull Text:PDF
GTID:1118360305969470Subject:Agricultural mechanization project
Abstract/Summary:PDF Full Text Request
With the increasing popularization of Internet, Various Web information such as BBS, Blog, E-mail etc arises, which become important information source in the daily life and work. However, this web information provide convenient to people, at the same time bring a lot of problems. Some illegal web information, such as antisocial information, fraud information, pornographic information, terroristic threatening information, gambling information appears by means of BBS, Blog or E-mail. The Internet provides criminals new criminous space and means. These phenomena cause wicked effect, which affect social stabilization and national security seriously.Now, the main methods to prevent these phenomena are installing filtering software to filter the information containing sensitive words. But, the passive defensive methods can't put an end of these phenomena of illegal web information, because criminal can make use of some substitute words to break through the defense of filtering software. Purnishing the criminals by means of law can strike these crimes effectively. Our state has come on interrelated law. There are laws to resort to. But due to lacking effective evidence, the law case can't be put into court. If web information's authorship is identified, criminal's evidence can be found, evidence for computer forensic can be collected, which have important application value and practical significance to law enforcement, social safety and stabilization, Internet environments'purifying.Making use of the theory and techniques of stylometry, the web information author's writing style were investigated in this paper. Some writing features that could represent author's writing style were extracted. The machine-learning algorithm was used to identify the authorship of web information. The main work was listed in the following. (1) The related research were investigated and analyzed comprehensive and detailedly. The model and framework of web information authorship identification were provided. (2) The methods for e-mail and web page's content extraction were brought forward. (3) The linguistic features, structural features and format features that could express author's writing style were extracted. (4) The support vector machine algorithm was improved. The PSTSVM algorithm that suited small sample's classification was brought forward. (5) The Chinese web information authorship identification system was developed. (6) To investigate the criminal's social relations, the social network was researched. The social network building methods based on authorship's authenticity judgement were provided.To test validity of the method in this paper, large datasets were collected. Several experiments were done. Some influencing fators were tested in the experiments. Experimental results proved that the three feature extraction methods were effective. The three features combination had a better result than single feature. The classification accuracy for dataset of literature, blog and e-mail exceeded 86 percent. The experimental results proved that the method of the research was effective and it was feasible to apply for computer forensic.
Keywords/Search Tags:Chinese, Web Information, Authorship Identification, PSTSVM
PDF Full Text Request
Related items