The Research And Application Of Related Technology On Website-Oriented Labeling System

Posted on:2015-08-25

Degree:Master

Type:Thesis

Country:China

Candidate:W P Gao

Full Text:PDF

GTID:2298330467963258

Subject:Signal and Information Processing

Abstract/Summary:

PDF Full Text Request

With a exponential growing of websites recent years, Internet information has been much too overloaded for people to read. Under this circumstance, how to find a specific type of website becomes a rather difficult task. In this way, it encourages us to spot a method to classify the website as a whole. Most study nowadays about websites classification pay much attention to single-label classification. To provide a better way to solve this problem, we present a multi-label system for website which can tag a website with more than one label. The introduction section in this paper provides a brief description as follows:firstly, what is the background and significance of website labeling and what kind of research has been done in this domain and what does this paper research on. Secondly, this paper introduces technologies about web crawler, information extraction, text classification algorithm and etc. At last, this paper mainly focuses on the questions below:1how to analyze the structure of a website and extract structural information from it;2how to locate and extract the â€™neededâ€™content information from a web page;3how to tag a website with structural information and â€™neededâ€™ content information.The main work in this paper can be classified into following sections:1)The backtracking of website topology and the extraction of structural feature.The structure of a website can be divided into two categories, one is based on the directory that files are organized in the server that called physical structure, the other structure is based on the links among web pages. Both structure donâ€™t reflect the hierarchy of the website clearly. This paper presents a method that uses link topology backtracking the website in a well arranged way, namely the homepage, the list page and the content page. Results of our experiments show that the algorithm for site hierarchy backtracking performs good.2) The location and extraction of content information in a web page.Most of the information of a website comes from the content page which covers most part of a website. Therefore, how to extract the content information from web noise becomes necessary. This paper presents a improved DSE algorithm to accomplish the extraction of content from a web page by combining the DES algorithm and paragraph statistical rules. By comparing with DSE and other similar algorithms, the improved DSE performs better and more satisfied results.3) The labeling system for a website.Considering the situation that category feature samples are uneven, this paper presents a attribute weighted approach for ML-KNN to make sure category feature which have more samples are weighted lower and category feature which have less samples are weighted higher. In this way, we ensure the uneven samples appear more balanced. Experiments show that, the attribute weighted S-ML-KNN have a better performance on the accuracy of multi-label classification.

Keywords/Search Tags:

website labeling, website feature, content extraction, multi-label classification

PDF Full Text Request

Related items

1	Reasearch On Key Technologies About Labeling The Content Of Internet Websites By Using Multi-tag
2	Research And Realization Of Labeling Techniques Of Internet Website
3	Research On Multi-label Classification Under Labeling Noise And Feature Construction
4	The Research On Tag Library For Labeling The Internet Website
5	Research On Data Extraction For Agency Website
6	Design And Implementation Of Multi-dimensional Website Evaluation System
7	Research On Vertical Search Method Of University Website Group
8	Design And Implementation Of Elementary And Middle Schoolsâ€™ Website Group Platform Based On ASP.NET
9	Design And Implementation Of Website Publishing System Based On SVM
10	The Research Of Product Feature Extraction In B2C Website