Text Classification Based On Natural Dimension Of Webpage

Posted on:2014-08-04

Degree:Master

Type:Thesis

Country:China

Candidate:L Zhang

Full Text:PDF

GTID:2298330422990608

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the research of text classification and Search Engine, it is very importantthat building a corpus efficiently. However, this work is always done by manualwork. So it will take much time and manpower to build corpus. Besides, this kindof corpus is not so flexible. If we want to change some categories in the corpus,we should change them manually. This means lots of new works should be doneagain. On the internet, pages from most of the website are build by the categoriesof the information. We can see this classification through the navigation. Thenavigation always indicates the classification of the information from the website.By this means, we can classify the corpus by their labels. Only classif y by label,we will get a corpus with much noise. Then we use cluster method to remove thenoise. The whole system contains following steps:Analysis the structure of the pages we have collected. Convert each page toseveral blocks, in which links have same functions. Then the navigation will becontained in one of the blocks. By a method based on rules and analysis links wecan get the navigation.Get the category anchors from the navigation for analysis. Use the candidatekeywords to find the category words in navigation. Treat the pages that linksrefers to for finding if the pages have content and searching the structure of thewebsite. Analysis the classify system and compare it to our classification to makesure the category is what we need. By compute tag ratios and smooth this valuewe can get the content of the pages.We cannot get a pure corpus because of the different standard ofclassification and many cheat link, the corpus will contain many noise. It needsmore processing. Here we use cluster analysis to get the information ofdistribution. By removing the cluster far away from most clusters, we can get acorpus with a higher precision.To test our corpus, we use the corpus in a SVM classifier. Using the corpusas training corpus, and we find we can use SVM to get a high precision result ofclassification. The system works well in both English and Chinese corpus. Thisproves that our corpus building system can get a high precision corpus inchangeable classification system and is effective in text category and searchingengine.

Keywords/Search Tags:

text classification, link analysis, Internet information extraction, information retrieval, text cluster

PDF Full Text Request

Related items

1	Combining text-, link-, and classification-based retrieval methods to enhance information discovery on the Web
2	Algorithm Research For Text Information Retrieval Based On Web
3	Study Of Text Information Retrieval Algorithms Based On Web
4	Research On The Key Techniques Of Web Information Intelligent Acquisition
5	Web Text Classification System For Chinese Pretreatment Technology
6	Analysis Of Text Information Based On Deep Learning
7	Research On Several Problems In Text Retrieval
8	Information Retrieval Oriented Text Classification Technology Research
9	Information Retrieval Oriented Analysis Of Text Content
10	Study On Information Retrieval Of Quality Internet Public Opinion Monitoring System