ClubEnsayos.com - Ensayos de Calidad, Tareas y Monografias
Buscar

A Tool For Web Usage Mining


Enviado por   •  25 de Agosto de 2014  •  3.733 Palabras (15 Páginas)  •  358 Visitas

Página 1 de 15

A Tool for Web Usage Mining

Jose M. Domenech1 and Javier Lorenzo2

1 Hospital Juan Carlos I

Real del Castillo 152 - 3571 Las Palmas - Spain

jdomcab@gobiernodecanarias.org

2 Inst. of Intelligent Systems and Num. Applic. in Engineering

Univ. of Las Palmas

Campus Univ. de Ta¯ra - 35017 Las Palmas - Spain

jlorenzo@iusiani.ulpgc.es

Abstract. This paper presents a tool for web usage mining. The aim

is centered on providing a tool that facilitates the mining process rather

than implement elaborated algorithms and techniques. The tool covers

di®erent phases of the CRISP-DM methodology as data preparation,

data selection, modeling and evaluation. The algorithms used in the

modeling phase are those implemented in the Weka project. The tool

has been tested in a web site to ¯nd access and navigation patterns.

1 Introduction

Discovering knowledge from large databases has received great attention during

the last decade being the data mining the main tool to make it [1]. The world

wide web has been considered as the largest repository of information but it lacks

of a well de¯ned structure. Thus the world wide web is a good environment to

make data mining receiving the name of Web Mining [2, 3].

Web mining can be divided into three main topics: Content Mining, Structure

Mining and Usage Mining. This work is focused on Web Usage Mining (WUM)

that has been de¯ned as "the application of data mining techniques to discover

usage patterns from Web data" [4]. Web usage mining can provide patterns of

usage to the organizations in order to obtain customer pro¯les and therefore

they can make easier the website browsing or present speci¯c products/pages.

The latter has a great interest for businesses because it can increase the sales

if they o®er only appealing products to the customers although as pointed out

Anand (Anand et al, 2004), it is di±cult to present a convincing case for Re-

turn on Investment. The success of data mining applications, as many other

applications, depend on the development of a standard. CRISP-DM, (Standard

Cross-Industry Process for Data Mining) (CRISP-DM, 2000) is a consortium of

companies that has de¯ned and validated a data mining process that can be used

into di®erent data mining projects as web usage mining. The life cycle of a data

mining project is de¯ned by CRISP-DM into 6 stages: Business Understanding,

Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.

The Business Understanding phase is highly connected with the problem to

be solved because they de¯ned the business objectives of the application. The last

8th International Conference on Intelligent Data Engineering and Automated Learning

(IDEAL'07), 16-19 December, 2007, Birmingham, UK.

one, Deployment, is not easy to make automatically because each organization

has its own information processing management. For the rest of stages a tool

can be designed in order to facilitate the work of web usage mining practitioners

and reduce the development of new applications.

In this work we implement the WEBMINER architecture [5] which divides

the WUM process into three main parts: preprocessing, pattern discovery and

pattern analysis. This three parts corresponds to the data preparation, modeling

and evaluation of the CRISP-DM model.

In this paper we present a tool to facilitate the Web Usage Mining based

on the WEBMINER architecture. The tool is conceived as a framework where

di®erent techniques can be used in each stage facilitating in this way the experi-

mentation and thus eliminating the need of programming the whole application

when we are interested in studying the e®ect of a new method in the mining

process. The architecture of the tool is shown in Figure 1 and the di®erent ele-

ments that makes up it will be described. Thus, the paper is organized as follows.

Section 2 will describe the data preprocessing. In sections 3 and 5 di®erent ap-

proaches to user session and transactions identi¯cation will be presented. Finally

in sections 6 and 7 the models to be generate and the results are presented.

Web site

crawler

Data

preprocessing

Session

identification

<<Table>>

log

Classifier

training

Feature

Extraction

Clustering

Association rules

discovering

Rules

Sessions

Site map

Site map

Classified

pages

Access

Patterns

Browsing

Patterns

Site page

classification

Classified

pages

Server

logs

Fig. 1. WUM tool architecture

2 Web Log Processing

Data source for Web Usage Mining come from di®erent sources as proxy, web

log ¯les, web site structure and even from sni®er packet logs. Normally, the

most widely used sources are the web log ¯les. These ¯les record the user ac-

cesses to the site and there exists several formats: NCSA (Common Log Format),

W3C Extended, SunTM ONEWeb Server (iPlanet), IBM Tivoli Access Manager

WebSEAL or WebSphere Application Server Logs. The most of the web servers

record the access using an extension of the CLF (ECLF). In ECLF basically the

recorded information for each access is:

{ remote host: Remote hostname. (or IP address number if DNS hostname is

not available or was not provided)

{ rfc931 : The remote login name of the user. (If not available a minus sign is

typically placed in the ¯eld)

{ authuser: The username as which the user has authenticated himself. This

is available when using password protected WWW pages. (If not available a

minus sign is typically placed in the ¯eld)

{ date: Date and time of the request.

{ request: The request line exactly as it came from the client. (i.e., the ¯le

name, and the method used to retrieve it [typically GET])

{ status: The HTTP response code returned to the client. Indicates whether

or not the ¯le was successfully retrieved, and if not, what error message was

returned.

{ bytes: The number of bytes transferred.

{ referer: The url the client was on before requesting your url. (If it could not

be determined a minus sign will be placed in this ¯eld)

...

Descargar como (para miembros actualizados) txt (25 Kb)
Leer 14 páginas más »
Disponible sólo en Clubensayos.com