A Tool For Web Usage Mining
Enviado por elfo2107 • 25 de Agosto de 2014 • 3.733 Palabras (15 Páginas) • 358 Visitas
A Tool for Web Usage Mining
Jose M. Domenech1 and Javier Lorenzo2
1 Hospital Juan Carlos I
Real del Castillo 152 - 3571 Las Palmas - Spain
jdomcab@gobiernodecanarias.org
2 Inst. of Intelligent Systems and Num. Applic. in Engineering
Univ. of Las Palmas
Campus Univ. de Ta¯ra - 35017 Las Palmas - Spain
jlorenzo@iusiani.ulpgc.es
Abstract. This paper presents a tool for web usage mining. The aim
is centered on providing a tool that facilitates the mining process rather
than implement elaborated algorithms and techniques. The tool covers
di®erent phases of the CRISP-DM methodology as data preparation,
data selection, modeling and evaluation. The algorithms used in the
modeling phase are those implemented in the Weka project. The tool
has been tested in a web site to ¯nd access and navigation patterns.
1 Introduction
Discovering knowledge from large databases has received great attention during
the last decade being the data mining the main tool to make it [1]. The world
wide web has been considered as the largest repository of information but it lacks
of a well de¯ned structure. Thus the world wide web is a good environment to
make data mining receiving the name of Web Mining [2, 3].
Web mining can be divided into three main topics: Content Mining, Structure
Mining and Usage Mining. This work is focused on Web Usage Mining (WUM)
that has been de¯ned as "the application of data mining techniques to discover
usage patterns from Web data" [4]. Web usage mining can provide patterns of
usage to the organizations in order to obtain customer pro¯les and therefore
they can make easier the website browsing or present speci¯c products/pages.
The latter has a great interest for businesses because it can increase the sales
if they o®er only appealing products to the customers although as pointed out
Anand (Anand et al, 2004), it is di±cult to present a convincing case for Re-
turn on Investment. The success of data mining applications, as many other
applications, depend on the development of a standard. CRISP-DM, (Standard
Cross-Industry Process for Data Mining) (CRISP-DM, 2000) is a consortium of
companies that has de¯ned and validated a data mining process that can be used
into di®erent data mining projects as web usage mining. The life cycle of a data
mining project is de¯ned by CRISP-DM into 6 stages: Business Understanding,
Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.
The Business Understanding phase is highly connected with the problem to
be solved because they de¯ned the business objectives of the application. The last
8th International Conference on Intelligent Data Engineering and Automated Learning
(IDEAL'07), 16-19 December, 2007, Birmingham, UK.
one, Deployment, is not easy to make automatically because each organization
has its own information processing management. For the rest of stages a tool
can be designed in order to facilitate the work of web usage mining practitioners
and reduce the development of new applications.
In this work we implement the WEBMINER architecture [5] which divides
the WUM process into three main parts: preprocessing, pattern discovery and
pattern analysis. This three parts corresponds to the data preparation, modeling
and evaluation of the CRISP-DM model.
In this paper we present a tool to facilitate the Web Usage Mining based
on the WEBMINER architecture. The tool is conceived as a framework where
di®erent techniques can be used in each stage facilitating in this way the experi-
mentation and thus eliminating the need of programming the whole application
when we are interested in studying the e®ect of a new method in the mining
process. The architecture of the tool is shown in Figure 1 and the di®erent ele-
ments that makes up it will be described. Thus, the paper is organized as follows.
Section 2 will describe the data preprocessing. In sections 3 and 5 di®erent ap-
proaches to user session and transactions identi¯cation will be presented. Finally
in sections 6 and 7 the models to be generate and the results are presented.
Web site
crawler
Data
preprocessing
Session
identification
<<Table>>
log
Classifier
training
Feature
Extraction
Clustering
Association rules
discovering
Rules
Sessions
Site map
Site map
Classified
pages
Access
Patterns
Browsing
Patterns
Site page
classification
Classified
pages
Server
logs
Fig. 1. WUM tool architecture
2 Web Log Processing
Data source for Web Usage Mining come from di®erent sources as proxy, web
log ¯les, web site structure and even from sni®er packet logs. Normally, the
most widely used sources are the web log ¯les. These ¯les record the user ac-
cesses to the site and there exists several formats: NCSA (Common Log Format),
W3C Extended, SunTM ONEWeb Server (iPlanet), IBM Tivoli Access Manager
WebSEAL or WebSphere Application Server Logs. The most of the web servers
record the access using an extension of the CLF (ECLF). In ECLF basically the
recorded information for each access is:
{ remote host: Remote hostname. (or IP address number if DNS hostname is
not available or was not provided)
{ rfc931 : The remote login name of the user. (If not available a minus sign is
typically placed in the ¯eld)
{ authuser: The username as which the user has authenticated himself. This
is available when using password protected WWW pages. (If not available a
minus sign is typically placed in the ¯eld)
{ date: Date and time of the request.
{ request: The request line exactly as it came from the client. (i.e., the ¯le
name, and the method used to retrieve it [typically GET])
{ status: The HTTP response code returned to the client. Indicates whether
or not the ¯le was successfully retrieved, and if not, what error message was
returned.
{ bytes: The number of bytes transferred.
{ referer: The url the client was on before requesting your url. (If it could not
be determined a minus sign will be placed in this ¯eld)
...