Font Size: a A A

Research Of Key Techniques Of Large-scale Web Text Fast Categorization

Posted on:2016-05-26Degree:MasterType:Thesis
Country:ChinaCandidate:Z W HaoFull Text:PDF
GTID:2348330542975461Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The amount of Information on the Internet increases explosively in recent years.The massive information makes information classification more challenging than before,and new standards and requirements arise.Early works on information classification mainly focus on accuracy,but the time-efficiency is hardly considered,which makes them hard to apply to the information sorting now.There is an urgent need for information methods that are not only accurate but also fast.This paper focuses on key problems concerning large-scale web information sorting,including text extraction,text de-duplication,and text classification.The existing works on text extraction are based on the analysis of structures of HTML labels,which is in time-efficiency.A fast text extraction method based on maximum successive sub string is proposed.Different from the traditional methods,our method does not analyze the structures of HTML label,but identifies the location of main text based on the length of string in HTML codes.After the extraction,text de-duplication is conducted.A lot of information is duplicated on the Internet,and by text de-duplication the text classification can by easies.The traditional text de-duplication methods for large-scale web information can solve the combination of words,but the permutations of words are not considered,which may lead to a decreasing in accuracy.A text de-duplication based on shingle is proposed.The de-duplication method calculates local impact factors of word permutations,and combines these factors with Simhash to decease the error caused by word permutations.Text classification works after the text de-duplication.The proposed text classification method is design based on the distributed parallel framework,and an improved NB method for classification.Firstly,a inverted tree with intermediate results stored in its indexes nodes is constructed,and the classification is achieved by keys searching and classification according to the tree.Furthermore,the time efficiency is greatly improved by cross optimizing on pruning horizontally and vertically.
Keywords/Search Tags:HTML, content extracting, content distinction, rapid classification, distributed
PDF Full Text Request
Related items