Font Size: a A A

Text Classification Based On Natural Dimension Of Webpage

Posted on:2014-08-04Degree:MasterType:Thesis
Country:ChinaCandidate:L ZhangFull Text:PDF
GTID:2298330422990608Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the research of text classification and Search Engine, it is very importantthat building a corpus efficiently. However, this work is always done by manualwork. So it will take much time and manpower to build corpus. Besides, this kindof corpus is not so flexible. If we want to change some categories in the corpus,we should change them manually. This means lots of new works should be doneagain. On the internet, pages from most of the website are build by the categoriesof the information. We can see this classification through the navigation. Thenavigation always indicates the classification of the information from the website.By this means, we can classify the corpus by their labels. Only classif y by label,we will get a corpus with much noise. Then we use cluster method to remove thenoise. The whole system contains following steps:Analysis the structure of the pages we have collected. Convert each page toseveral blocks, in which links have same functions. Then the navigation will becontained in one of the blocks. By a method based on rules and analysis links wecan get the navigation.Get the category anchors from the navigation for analysis. Use the candidatekeywords to find the category words in navigation. Treat the pages that linksrefers to for finding if the pages have content and searching the structure of thewebsite. Analysis the classify system and compare it to our classification to makesure the category is what we need. By compute tag ratios and smooth this valuewe can get the content of the pages.We cannot get a pure corpus because of the different standard ofclassification and many cheat link, the corpus will contain many noise. It needsmore processing. Here we use cluster analysis to get the information ofdistribution. By removing the cluster far away from most clusters, we can get acorpus with a higher precision.To test our corpus, we use the corpus in a SVM classifier. Using the corpusas training corpus, and we find we can use SVM to get a high precision result ofclassification. The system works well in both English and Chinese corpus. Thisproves that our corpus building system can get a high precision corpus inchangeable classification system and is effective in text category and searchingengine.
Keywords/Search Tags:text classification, link analysis, Internet information extraction, information retrieval, text cluster
PDF Full Text Request
Related items