Font Size: a A A

Research And Implementation Of Key Technologies Inkapok Education News Platform

Posted on:2013-10-06Degree:MasterType:Thesis
Country:ChinaCandidate:P ZhangFull Text:PDF
GTID:2248330374981013Subject:Computer system architecture
Abstract/Summary:PDF Full Text Request
Information is the basis of human civilization, since the birth of the Internet, information is growingricher and richer, while knowledge stay poor, the complexity of data makes higher demands on traditionaltechniques, and how to take advantage of huge amounts of text to enhance the value of information isbecoming an important issue.This paper focuses on collecting, refining and organizing mass Internet information on education andimproving traditional text classification and clustering technology with the help of huge size and variousInternet data. A news aggregation system is also designed to help users to keep abreast of current affairseasily and quickly and two key technologies are proposed to solve several problems of importance.The detection of out-of-vocabulary words is important for text classification and clustering and otherapplications. Based on LMR-tagging and CRFs models, we came up with a generating algorithm forbuilding OOV words dictionary, including three processes:1) select potential unstable sentence from hugeamounts of data;2) extract unstable areas;3) building OOV words dictionary. The innovation is to reducethe scales of computing, make it suitable for mass text mining, and to avoid the noise brought by the suffixtree.The auto-tagging algorithm is an iterative feedback framework based on two basic assumptions. Themain processes start from the initial query, retrieve documents to obtain related words, and combine themto form query for searching, repeat this process until a termination condition. Two examples are designed:LA-1is strictly constraint with category name. LA-2uses query expansion to generate queries, achieve abalance on accuracy and generalization.Experiments are set up to verifying the effectiveness of two techniques. The result indicates that OOVdictionaries can increase the accuracy of classification, especially in short text. The training set producedby auto-tagging algorithm is as well as manual annotated data and make personal categorization possible inactual application. The two algorithms have achieved good results, has proven its practicality andeffectiveness in a real environment.
Keywords/Search Tags:news aggregation, text categorization, out-of vocabulary words detection, training setauto-tagging
PDF Full Text Request
Related items