Font Size: a A A

Research On Web Text Hierarchical Categorization Technologies Based On Semantic

Posted on:2016-06-09Degree:MasterType:Thesis
Country:ChinaCandidate:B ZhangFull Text:PDF
GTID:2308330461492018Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The arrival of the era of big data, brings the convenience of our lives, and it becomes more efficient for people to browse and search the needed information. However, over time, it was found that the time to obtain the required knowledge becomes longer and longer. Big data indeed has brought us a lot of useful information, at the same time it also gave us more useless information. The primary problem of how to access the required knowledge fast and accurately has become the hot issues of people’s concern. To solve the above problems, there comes out the Web text mining technology.Web text information extraction and text classification are two important branches of Web text mining. Web text information extraction first extracts useful information from Web page, and then organizes them with a structured format. The information that extracted from Web page contains Web page title and Web page content. The title of Web page is the most simple and clear overview of the information. The title of Web page is very important for Web page extraction and its application. Web text classification classifies the structured data, which is composed of DOM(Document Object Model) tree format. It is very convenient for people to browse and retrieval the information if we know its categorization. This dissertation uses a hierarchical text classification, upcoming classes organized in a tree form, to a certain extent, meet people’s habits to retrieve information, e.g., Yahoo! Site is using this form of hierarchical tree structure.This dissertation introduces the concept of Web information extraction and Web text classification and the research status at home and abroad. In predecessors research foundation, we first proposed the Web text information extraction method based on hyperlinks and DOM tree-based titles and then present a semantic-based multi-level Web text classification methods. The detailed information is given as follows:We first proposed a Web page information extraction method based on hyperlink and DOM tree. This method first used a real-time analysis model though the hub Web page, then used hyperlink-based approach, and the correspondence between the title and the release time, at last we got the URL of the page and the corresponding anchor text. If the anchor text we have was not the title of the text page, we should get the Web page HTML source code and build a DOM tree for the corresponding theme-based Web page. Based on the visual characteristics of the Web page title, we traversed the DOM tree in depth-first order. The experimental results demonstrate that this method has high accuracy and can be simply implemented and so on.We then proposed a Web text classification method based on Web page title and semantic knowledge base. Specifically, we first established categories of knowledge. Then used domain knowledge to determine which the category a document belongs to. For the missed title, uses semantic similarity algorithm based on How-net, according to a top-down style hierarchical classification method, followed by calculation of the value of the test documentation and the semantic similarity between each category, selected the maximum similarity value of that a category as a test document classification. Experimental results show that this method can meet the actual demand.
Keywords/Search Tags:information extraction, Web page title, theme-based Web page, hierarchical classification, semantic
PDF Full Text Request
Related items