Font Size: a A A

Some Studies For Web Mining

Posted on:2006-04-17Degree:DoctorType:Dissertation
Country:ChinaCandidate:J C XuFull Text:PDF
GTID:1118360155953721Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The development of the World Wide Web and its fast popularization make the mankind really realize that the ocean of data is boundless. In the face of so enormous a data resource, people urgently need a kind of new technology and automated tool to help change this enormous data resource into useful knowledge and information resources. This kind of technology should not only be able to win the top layer information of the data, but also be able to obtain the implied information and the inherent relation between the attributes of data on the basis of fully understanding the data, that is, to obtain important knowledge. Web mining technology has offered a powerful means of transforming the vast data into useful information and knowledge. Internet's own characteristics make Web mining technology face some difficulties: Data amount is very huge and dynamically changing, in addition, its growth rate is surprising; modes of information organization among every website are different; data analysis needs to solve the integrated problem of the data which lack a unifying structure; the complexity of Web pages is far greater than that of any traditional text document collection, and there is not any fixed permutation order; the information that users need only takes a small portion of the information on the Web, which is difficult to deal with accurately and efficiently. According to the characteristics of the demand of enterprise's marketing websites, this paper propose a prototype structure of intellectual website, and to probe into some key technology in Web mining in view of the function of every component of this prototype. The goal is to improve the precision and efficiency of Web mining. Main works fasten on the key technology about Web text feature selection, Web pages classification, vast data reduction and web data extraction etc. in Web content mining. The methods proposed have been validated through the experiment. The main achievement includes: 1. The concept and the characteristic of Web mining are analyzed; the classification, and the current situation of the development and application of Web mining technology are summarized. In accordance with the characteristic and the demand of intelligent website of enterprise's marketing, a new kind of system structure of intelligent website is put forward based on Web mining. Information gathering and information service are offered by using information search, Web content mining, Web structure mining and Web usage mining synthetically. The website based with the structure could do information retrieval, extract, classification, send off voluntarily and information service. This text has described the function of every part in this system structure. 2. Web text feature selection is the task of expressing Web text by automatically extracting a set of keywords describing the Web text theme to form characteristic vector. It influences the quality of the following step –the classification of Web text. Sine genetic algorithm has a strong ability of best-first search and the number of feature words in a web page is unknown, I put forward a new method of messy GA based feature selection for Chinese Web pages. Through the synonym change of keywords and considering of characteristics of frequency, position, word long, visual effect, etc. of text synthetically, a new kind of weighting function of keyword is put forward. The method expresses a gene through a characteristic entry, and adopts the symbol code to form an indefinite long chromosome, and chooses the parents who participate in the future generation evolution among two generations of father and son to preventing too many blind changes. Special tactics of juxtapositional phase and a mutation operator are designed, which not only guarantee all words are candidate characteristic words, but also guarantee the result only relates to fitness, not affected by the length of initial chromosome. The experimental result proves that both reduction ratios and accuracy are high. 3. Classification is a process of predicting the class of objects whose class label is unknown. The research of the classification algorithm is a key problem in data mining. Through the analysis of the classification methods used in Web classification, limitations are found in solving the categorization problem of the web page text that has more than one classifications. This text extends the concept of equilabeled in lattice machine, defines the least upper bound as intersection of set algebra, defines the ordering relation as ? of set algebra, presents intersection labeled concept and an algorithm to obtain it to solve the problem of multi-decision attribute values of the same condition in a decision system. The algorithm is used to categorize Web documents that have more than one classification.Equilabeled in lattice machine is a special case of intersection labeled when each object has a single decision attribute value. The article proves the existence of the interior cover in any decision system. The experiment indicates that the method is resultful. 4. At present object (tuple) reduction mostly adopts the lattice algebra based method, while attribute (field) reduction mostly adopts rough set based method. Time complexity of these two kinds of algorithms is both very high. The paper discusses feasibility of combining band lattice machine with rough set based on equivalence relation and proposes a new efficient data reduction algorithm that can reduce both vertically and horizontally on the basis of lattice learning. The algorithm stipulates hypertuple between minimal E-set and maximal E-set through a density-based method and evaluates the importance degree of attribute automatically with hypertuple, thus under the acceptable precision and complexity of the classification, it reduces row and column together and obtains reduced classification rules. The algorithm doesn't try to find minimal sufficient subset of feature for higher efficiency. It delete some attribute from attribute set of hypertuple, then examines the change of the equivalence relation. If the equivalence relation change is remarkable, the attribute deleted will have greater importance degree; otherwise important degree is relatively small. Thus can appraise the weight value of an attribute, reduce attributes with smaller important degree and get the classification rule of the recognized precision that users approve of. The algorithm provides a way to automatically estimate an attribute without being dependent of domain experts. For the attributes composed of classification rules which are listed in the order of their importance degree, the to-be-classified objects that can't meet the attribute constraint, can be excluded in sequence of importance degree during classifying process, thus data set size will be reduced quickly, and the number of attributes compared can be reduced and categorization efficiency will be improved greatly. Have analyse the time complexity of the algorithm in the article, has collected and compared with the experiment in UCI data set. The experimental result indicates the algorithm has improved the data reduction efficiency greatly . 5. It is analyzed that current situation of Web information extraction technology and the source of Web information extraction knowledge. It is a complicated, difficult and meaningful task to extract Web page with multi data records for the application automatically. The paper...
Keywords/Search Tags:Studies
PDF Full Text Request
Related items