Font Size: a A A

Research Of Bad Text Filtering Model Based On Improved Vector Space Model

Posted on:2011-11-01Degree:MasterType:Thesis
Country:ChinaCandidate:Q X WangFull Text:PDF
GTID:2178330332982011Subject:E-commerce
Abstract/Summary:PDF Full Text Request
With the progress of science and technology and the rapid development of Internet, powerful network has brought unprecedented convenience for people's production and life. Through the network people can get kinds of information resources quickly. Network have made its way into each corners of people's production and life, and is changing people's production and life style gradually. Network brings convenient to us, but it also brings lots of harmful information which does not favor the people's production and life. Some harmful information seizes the opportunity to sneak and cause the network flooded with a great deal about all kinds of violence, pornography, cults, and other adverse reaction information. This harmful information is deeply eroding people's soul, poisoning people's thinking and making serious impact on social stability. How to distinguish the harmful information from the internet's mass information, how to take safety measures to control the spread of harmful information in the network, is question which must be paid attention and solved in the current network information security domain.Therefore, in order to protect social stability and development, to protect the network users from the harassment of harmful information, to accelerate the pace of construction of spiritual civilization and promote social harmony, we should develop the more effective recognition and the filtration technology to shielding and filtering harmful information which spread in the network when we conduct the positive ideological education to the general network user. At present in the process of information's release and transmission in the network, the information content is complex, the information type is numerous and diverse and the good and bad information are intermingled. In this case this paper's main research target is about using text classification technology to identify the nature of targeted text by researching bad text representation model, bad text filtering model and the main key technology. It is useful to filter bad information in the network and create a green network space for network users.This thesis is aimed at text filtering. The main research contents of this thesis are that analysis and research on the theory and the technology involves to the text filtration, mainly includes web text extraction technology, the Chinese word segmentation technology, feature extraction technology, text representation model and artificial neural network of BP network and so on. Meanwhile, it researches and analysis the defects of the traditional VSM and improves the traditional VSM through introduces the position weight and the glossary tendentiousness analysis. Then in this research foundation, this thesis designs one bad text filtration model which bases on the improved vector space model. It also elaborates the model's overall framework and specific implementation of each functional module. Finally, it collects some corpus according to an event and handles this corpus, carries on the simulation test to evaluate the model through the MATLAB simulation software to the model. From the simulation test results, we know that the model can realize text filtering functions. The improvement of vector space model more effectively. The model's filtering functions more effective which bases on the improved of VSM. Using the model can distinguish and filter the bad information in the network to a certain extent. It also can control the spread of bad information and provide the civilized and the health network environment for the network user.This thesis's innovation is that it introduces the position weight and the glossary tendentiousness analysis through analysis the flaw of tradition vector space model. It improves the traditional VSM by giving the characteristics specific weight according to its position in the text as well as carrying on the emotion tendency analysis to the characteristic item. It also designs one bad text filtration model which bases on the improved vector space model. Based on the improvement of traditional vector space model and the designed bad text filtering model, it not only improves the accuracy of text expression, further improves the text representation method, but also improves the efficiency of bad text filtering, controls the dissemination of information, and provides a new direction for text filtering research. Due to the time and my ability is limited, there are some shortages existing in this thesis and it is necessary to study further. Such as the date which uses in the simulation test is some event's commentary. It is short and the number is few, so it has certain flaw. Therefore using more lengthy texts as the simulation test's data is the focus of research in the future. Although this thesis proposes some improvement for the tradition vector space model, it still has some flaws, like the entire chapter text's tic analysis, the emotion tendency analysis and so on. It is need further study.
Keywords/Search Tags:Bad text, Text Filtering, Vector Space Model(VSM), BP network
PDF Full Text Request
Related items