Font Size: a A A

Design And Implementation Of An Automatic Collection And Classification System For Web Text

Posted on:2018-01-22Degree:MasterType:Thesis
Country:ChinaCandidate:J W YuFull Text:PDF
GTID:2428330569485292Subject:Electronics and Communications Engineering
Abstract/Summary:PDF Full Text Request
With the continuous development of Internet,the number of netizens keeps rising and the amount of network information is growing exponentially.In the face of numerous network information with wide range of diverse subjects and contents,it is of great need to search and locate the desired information,which in large part depends on the search engine technologies,information storage strategies and text categorization methods.As web text is unstructed and complicated,only a single categorization cannot accurately characterize the text content.We should use a hierarchical categorization structure to classify web text.As each online information dissemination platform might only cover a small amount of the useful web texts and category organizations in different platforms are often not the same,we propose a system which can automatly collect user-specified content and classify them according to the customized category organization in order to summarize web text within specific domain.The system function module mainly includes text acquisition,duplicate detection,hierarchical classification and web interaction.The acquisition part mainly contains web crawler,which collects web text information through user-defined acquisition rules starting from seed URLs.This thesis proposes a cascade filtering method which uses Simhash fingerprint and word vector to construct the document feature vector.In this paper,we first increase the Simhash fingerprint hamming distance in the first filtering phase,and then select the appropriate threshold of similarity between document feature vector generated with word vector to conduct secondary filtering.We conduct experiments on Sogou's news data set,and the accuracy and recall of the algorithm of our method perform better than the baseline of the Simhash method.The text classification part uses the top-down classification approach.It builds a hierarchical category tree according to the classification level,and places a multi-classifier on each non-leaf node.The improvements of our method include the following:1?We use document feature based on Word2 Vec word vector and features based on TF-IDF to train local classifier respectively.We synthesizes the results of the two classifiers and make the final decision of the local classifier.2?We obtain the category synonyms lexicon through the Word2 Vec model,take relevance as the category word weight,calculate the sum of score of category word in the test text.3?We propose a non-mandatory classification method,assuming that there is a probability of assign test text to unknown category,if the unknown probability is greater than the probability of determining a certain category,the category is regarded as unknown,and the classification procedure ends.According to the experiments,the performance of the proposed hierarchical classification algorithm is better than that of the typical top-down classification method..This thesis also implements an online web text collection and classification system,which uses B/S architecture.The server is built on Flask.Users accesses and manipulates system through the Web browser to control web text collection and classification process.Through the Web page,users can register the system without the demand of installing software,which greatly facilitate the operation of the operator.This system has a high accuracy and faster response speed,while taking up less memory.
Keywords/Search Tags:Web text collection and classification, Web crawler, Hierachical classification, Word vector, Duplicate detect
PDF Full Text Request
Related items