Font Size: a A A

Study On Chinese Website Ripping And Transcoding

Posted on:2014-03-21Degree:MasterType:Thesis
Country:ChinaCandidate:N N WuFull Text:PDF
GTID:2268330401954310Subject:Agricultural mechanization project
Abstract/Summary:PDF Full Text Request
Based on analyzes the GB2312, GBK, GB18030, Big5, UTF-8character encoding of Chinesecharacteristics, this paper focus on the Chinese character encoding web document recognitiontechnology.By comparing different text features (Boolean weighting, word frequency weighting,word frequency inverted document frequency weighting) and machine learning methods (multiplelinear regression, Naive Bayes, K nearest neighbor, support vector machine), we propose acombination of Chinese character encoding special rules and page text feature model of Chinesecharacter encoding.For UTF-8encoding because of its strict coding rules,the model determine bythe encoding rules.Space for code bits coincide GB Series and Big5encoded using text features toidentify the web.The tests showed that the threshold (the number of UTF-8characters/the totalnumber of characters) is equal to1,the model for the UTF-8encoding recognition rate is100%.The number of features greater than65,the four kinds of machine learning methods on theGB series and Big5coded identification rate was100%。In order to achieve vertical search engines Chinese agriculture unified coding task, the studydesign and development of agriculture website Chinese character encoding automaticallyrecognizes and converts all kinds of encoding UTF-8encoding of generic modules.Thepagesdownload by the web crawler is the modules input parameters, the first of which Chineseextraction, and then determine whether to use according to the coding rules UTF-8encoding.If itis not UTF-8encoding, then according to the eigenvalues of experiments, using Boolean weightsand K nearest neighbor judgment coding algorithm determined the code is Big5encoding or GBseries.
Keywords/Search Tags:Chinese character encoding identification, feature selection, featureweighting, machine learning, Web crawler, Chinese character encoding conversion
PDF Full Text Request
Related items