A Tool For Web Usage Mining
Enviado por elfo2107 • 25 de Agosto de 2014 • 3.733 Palabras (15 Páginas) • 325 Visitas
A Tool for Web Usage Mining
Jose M. Domenech1 and Javier Lorenzo2
1 Hospital Juan Carlos I
Real del Castillo 152 - 3571 Las Palmas - Spain
jdomcab@gobiernodecanarias.org
2 Inst. of Intelligent Systems and Num. Applic. in Engineering
Univ. of Las Palmas
Campus Univ. de Ta¯ra - 35017 Las Palmas - Spain
jlorenzo@iusiani.ulpgc.es
Abstract. This paper presents a tool for web usage mining. The aim
is centered on providing a tool that facilitates the mining process rather
than implement elaborated algorithms and techniques. The tool covers
di®erent phases of the CRISP-DM methodology as data preparation,
data selection, modeling and evaluation. The algorithms used in the
modeling phase are those implemented in the Weka project. The tool
has been tested in a web site to ¯nd access and navigation patterns.
1 Introduction
Discovering knowledge from large databases has received great attention during
the last decade being the data mining the main tool to make it [1]. The world
wide web has been considered as the largest repository of information but it lacks
of a well de¯ned structure. Thus the world wide web is a good environment to
make data mining receiving the name of Web Mining [2, 3].
Web mining can be divided into three main topics: Content Mining, Structure
Mining and Usage Mining. This work is focused on Web Usage Mining (WUM)
that has been de¯ned as "the application of data mining techniques to discover
usage patterns from Web data" [4]. Web usage mining can provide patterns of
usage to the organizations in order to obtain customer pro¯les and therefore
they can make easier the website browsing or present speci¯c products/pages.
The latter has a great interest for businesses because it can increase the sales
if they o®er only appealing products to the customers although as pointed out
Anand (Anand et al, 2004), it is di±cult to present a convincing case for Re-
turn on Investment. The success of data mining applications, as many other
applications, depend on the development of a standard. CRISP-DM, (Standard
Cross-Industry Process for Data Mining) (CRISP-DM, 2000) is a consortium of
companies that has de¯ned and validated a data mining process that can be used
into di®erent data mining projects as web usage mining. The life cycle of a data
mining project is de¯ned by CRISP-DM into 6 stages: Business Understanding,
Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.
The Business Understanding phase is highly connected with the problem to
be solved because they de¯ned the business objectives of the application. The last
8th International Conference on Intelligent Data Engineering and Automated Learning
(IDEAL'07), 16-19 December, 2007, Birmingham, UK.
one, Deployment, is not easy to make automatically because each organization
has its own information processing management. For the rest of stages a tool
can be designed in order to facilitate the work of web usage mining practitioners
and reduce the development of new applications.
In this work we implement the WEBMINER architecture [5] which divides
the WUM process into three main parts: preprocessing, pattern discovery and
pattern analysis. This three parts corresponds to the data preparation, modeling
and evaluation of the CRISP-DM model.
In this paper we present a tool to facilitate the Web Usage Mining based
on the WEBMINER architecture. The tool is conceived as a framework where
di®erent techniques can be used in each stage facilitating in this way the experi-
mentation and thus eliminating the need of programming the whole application
when we are interested in studying the e®ect of a new method in the mining
process. The architecture of the tool is shown in Figure 1 and the di®erent ele-
ments that makes up it will be described. Thus, the paper is organized as follows.
Section 2 will describe the data preprocessing. In sections 3 and 5 di®erent ap-
proaches to user session and transactions identi¯cation will be presented. Finally
in sections 6 and 7 the models to be generate and the results are presented.
Web site
crawler
Data
preprocessing
Session
identification
<<Table>>
log
Classifier
training
Feature
Extraction
Clustering
Association rules
discovering
...