The rapid development of the Internet to become the richest resource in the world's information network, and its large number of shared resource is one of the important ways to access to information for human. But it also brings a lot of questions, how to filter the information which has nothing to do with own demand, and not to been invaded by illegal information, already become a research hot spot of the current Internet development. Implementation of the information filtering system has solved this problem to a large extent.The text information cannot be directly calculated and reasoned, only be calculated after it is an abstract model. Therefore, the text pretreatment is a question which the text information filtering system must solve firstly. And, the segmentation is the process the premise which abstracts the information characteristic value and builds on the vector text, the segmentation techniques may eliminate the ambiguity which exists between the words. Its accuracy of the result is directly related to accuracy and recall rate of the information filtering result. In view of the fact that present natural language understanding of the status quo, overall analysis, understanding the text of the subject thought, and give concrete and explicit expression of analytic, also has certain difficulty. Even if only carry on the grammar tic analysis to the entire text, often because the resources and response time constraints, can not be carried out in the end. This paper proposed a binary segmentation algorithm based on language analysis, through a combination of dictionary matching and binary segmentation based on statistical methods, carried out the new context-sensitive word recognition and Disambiguation. In addition, the corrector based on language analysis as a post-processing. Because of vacation match which created as a result of the dispersion existence's key words for those lengthy text, from the language analysis perspective, could give a good deal with the screening to segmentation results. It has been proved by experiments, about the five categories of education, sports, entertainment, technology and life, the average accuracy rate of the algorithm is 93.2%, It enhances 32.3% and 10.2% separately compared to the reversion biggest matching algorithm and the term frequency method.Based on the layered system architecture, we had realized an user client side,which could make the text to carry on the study, and carry on the classified filtration automatically to the newly arrived text in the intelligent text filtering system. In the information analysis module of the system increased Chinese segmentation based on the linguistic environment analysis, carried disambiguation to the candidates of words which had screened, carried post-processing to segmentation results by grammar analysis technology, and used the Rocchio algorithm to adjust the Topics eigenvector and improve the template. In addition, It allowed the user to adjust mode of Keyword filter or template filter in their own needs, so that users can better meet the filtering results needs. It has been proved by experiments, the system's average recalling rate and the accuracy enhances 16.3% and 12.6% separately compared to the CJKAnalyzer filtering system. |