Font Size: a A A

Research And Application Of Chinese Web Pages Automatic Classification

Posted on:2008-10-02Degree:MasterType:Thesis
Country:ChinaCandidate:G H XieFull Text:PDF
GTID:2178360242467051Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The rapid development of Internet technology, promotes information growth explosive. It contains huge amount of digital information including text, graphic, voice and even video. Most of them are semi-structured or unstructured, so how to get the most needful information effectively is one of the major goals of information-processing. Web-page classification system can categorize web pages and shorten time for collation online-documents. It helps people to grasp needful information easier. Recently, web-page automatic classification, which has been mix up with search engine, information pushing, sending and filtering, has improved information services effectively.Based on analyzed the differences between the web page and ordinary text, a information extraction system using DOM tree parsing technology had been developed. The system can filter Hub-type and Pic-type of pages effectively and eliminate HTML tags, ads, pictures and other irrelative information from Theme-type pages to reserve content and relative information.The main technologies involved in web-page classification including information extraction, Chinese word segmentation, dimension reduction, text model, classification methods and evaluation criteria are comprehensive presented and in-depth studied and discussion. By analyzing the factors impacting terms' weight, the shortage of TF*IDF and the structural characteristics of web pages, "TF*IDF*CHI" had been presented for weight calculation. This method took into account the importance for the single document, the document set category of a feature and web pages' structure. It improved the describable ability and category discriminability of valuable terms. And then several experiments had been set up to checkout the work in this thesis. The average of F1 was advanced about 7%.Then the web information extractor and web pages classifier were applied in criminal investigation information extracting system, implementing criminal investigation information extraction and publication classified, as well as providing data to other systems like information contrasting system, which got good result.
Keywords/Search Tags:Information Extraction, Web Page Classification, Features Selection, Vector Space Model, Support Vector Machine
PDF Full Text Request
Related items