Font Size: a A A

The Research And Application Of Related Technology On Website-Oriented Labeling System

Posted on:2015-08-25Degree:MasterType:Thesis
Country:ChinaCandidate:W P GaoFull Text:PDF
GTID:2298330467963258Subject:Signal and Information Processing
Abstract/Summary:PDF Full Text Request
With a exponential growing of websites recent years, Internet information has been much too overloaded for people to read. Under this circumstance, how to find a specific type of website becomes a rather difficult task. In this way, it encourages us to spot a method to classify the website as a whole. Most study nowadays about websites classification pay much attention to single-label classification. To provide a better way to solve this problem, we present a multi-label system for website which can tag a website with more than one label. The introduction section in this paper provides a brief description as follows:firstly, what is the background and significance of website labeling and what kind of research has been done in this domain and what does this paper research on. Secondly, this paper introduces technologies about web crawler, information extraction, text classification algorithm and etc. At last, this paper mainly focuses on the questions below:1how to analyze the structure of a website and extract structural information from it;2how to locate and extract the ’needed’content information from a web page;3how to tag a website with structural information and ’needed’ content information.The main work in this paper can be classified into following sections:1)The backtracking of website topology and the extraction of structural feature.The structure of a website can be divided into two categories, one is based on the directory that files are organized in the server that called physical structure, the other structure is based on the links among web pages. Both structure don’t reflect the hierarchy of the website clearly. This paper presents a method that uses link topology backtracking the website in a well arranged way, namely the homepage, the list page and the content page. Results of our experiments show that the algorithm for site hierarchy backtracking performs good.2) The location and extraction of content information in a web page.Most of the information of a website comes from the content page which covers most part of a website. Therefore, how to extract the content information from web noise becomes necessary. This paper presents a improved DSE algorithm to accomplish the extraction of content from a web page by combining the DES algorithm and paragraph statistical rules. By comparing with DSE and other similar algorithms, the improved DSE performs better and more satisfied results.3) The labeling system for a website.Considering the situation that category feature samples are uneven, this paper presents a attribute weighted approach for ML-KNN to make sure category feature which have more samples are weighted lower and category feature which have less samples are weighted higher. In this way, we ensure the uneven samples appear more balanced. Experiments show that, the attribute weighted S-ML-KNN have a better performance on the accuracy of multi-label classification.
Keywords/Search Tags:website labeling, website feature, content extraction, multi-label classification
PDF Full Text Request
Related items