Research Of Key Techniques Of Large-scale Web Text Fast Categorization

Posted on:2016-05-26

Degree:Master

Type:Thesis

Country:China

Candidate:Z W Hao

Full Text:PDF

GTID:2348330542975461

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

The amount of Information on the Internet increases explosively in recent years.The massive information makes information classification more challenging than before,and new standards and requirements arise.Early works on information classification mainly focus on accuracy,but the time-efficiency is hardly considered,which makes them hard to apply to the information sorting now.There is an urgent need for information methods that are not only accurate but also fast.This paper focuses on key problems concerning large-scale web information sorting,including text extraction,text de-duplication,and text classification.The existing works on text extraction are based on the analysis of structures of HTML labels,which is in time-efficiency.A fast text extraction method based on maximum successive sub string is proposed.Different from the traditional methods,our method does not analyze the structures of HTML label,but identifies the location of main text based on the length of string in HTML codes.After the extraction,text de-duplication is conducted.A lot of information is duplicated on the Internet,and by text de-duplication the text classification can by easies.The traditional text de-duplication methods for large-scale web information can solve the combination of words,but the permutations of words are not considered,which may lead to a decreasing in accuracy.A text de-duplication based on shingle is proposed.The de-duplication method calculates local impact factors of word permutations,and combines these factors with Simhash to decease the error caused by word permutations.Text classification works after the text de-duplication.The proposed text classification method is design based on the distributed parallel framework,and an improved NB method for classification.Firstly,a inverted tree with intermediate results stored in its indexes nodes is constructed,and the classification is achieved by keys searching and classification according to the tree.Furthermore,the time efficiency is greatly improved by cross optimizing on pruning horizontally and vertically.

Keywords/Search Tags:

HTML, content extracting, content distinction, rapid classification, distributed

PDF Full Text Request

Related items

1	Context-based content extraction of HTML documents
2	Network Analysis And Filtering Technology Research Based On The Content
3	Data content mining: Extracting and cataloging content-based metadata from satellite images (remote sensing)
4	Research On Content Extraction In HTML Web Pages Based Multi-Features
5	The Development Of Finical Content Management And Publication System
6	The Cdn System, Content Routing
7	Research On Content Based Classification Of Images
8	Content based image retrieval using evidence combination
9	Research On Content Scheduling Technologies In Content Networks
10	Research And Implementation Of Packet Classification In IPv6 Networks