A Tool For Web Usage Mining

elfo210725 de Agosto de 2014

3.733 Palabras (15 Páginas)392 Visitas

Página 1 de 15

A Tool for Web Usage Mining

Jose M. Domenech1 and Javier Lorenzo2

1 Hospital Juan Carlos I

Real del Castillo 152 - 3571 Las Palmas - Spain

jdomcab@gobiernodecanarias.org

2 Inst. of Intelligent Systems and Num. Applic. in Engineering

Univ. of Las Palmas

Campus Univ. de Ta¯ra - 35017 Las Palmas - Spain

jlorenzo@iusiani.ulpgc.es

Abstract. This paper presents a tool for web usage mining. The aim

is centered on providing a tool that facilitates the mining process rather

than implement elaborated algorithms and techniques. The tool covers

di®erent phases of the CRISP-DM methodology as data preparation,

data selection, modeling and evaluation. The algorithms used in the

modeling phase are those implemented in the Weka project. The tool

has been tested in a web site to ¯nd access and navigation patterns.

1 Introduction

Discovering knowledge from large databases has received great attention during

the last decade being the data mining the main tool to make it [1]. The world

wide web has been considered as the largest repository of information but it lacks

of a well de¯ned structure. Thus the world wide web is a good environment to

make data mining receiving the name of Web Mining [2, 3].

Web mining can be divided into three main topics: Content Mining, Structure

Mining and Usage Mining. This work is focused on Web Usage Mining (WUM)

that has been de¯ned as "the application of data mining techniques to discover

usage patterns from Web data" [4]. Web usage mining can provide patterns of

usage to the organizations in order to obtain customer pro¯les and therefore

they can make easier the website browsing or present speci¯c products/pages.

The latter has a great interest for businesses because it can increase the sales

if they o®er only appealing products to the customers although as pointed out

Anand (Anand et al, 2004), it is di±cult to present a convincing case for Re-

turn on Investment. The success of data mining applications, as many other

applications, depend on the development of a standard. CRISP-DM, (Standard

Cross-Industry Process for Data Mining) (CRISP-DM, 2000) is a consortium of

companies that has de¯ned and validated a data mining process that can be used

into di®erent data mining projects as web usage mining. The life cycle of a data

mining project is de¯ned by CRISP-DM into 6 stages: Business Understanding,

Data Understanding, Data Preparation, Modeling, Evaluation and Deployment.

The Business Understanding phase is highly connected with the problem to

be solved because they de¯ned the business objectives of the application. The last

8th International Conference on Intelligent Data Engineering and Automated Learning

(IDEAL'07), 16-19 December, 2007, Birmingham, UK.

one, Deployment, is not easy to make automatically because each organization

has its own information processing management. For the rest of stages a tool

can be designed in order to facilitate the work of web usage mining practitioners

and reduce the development of new applications.

In this work we implement the WEBMINER architecture [5] which divides

the WUM process into three main parts: preprocessing, pattern discovery and

pattern analysis. This three parts corresponds to the data preparation, modeling

and evaluation of the CRISP-DM model.

In this paper we present a tool to facilitate the Web Usage Mining based

on the WEBMINER architecture. The tool is conceived as a framework where

di®erent techniques can be used in each stage facilitating in this way the experi-

mentation and thus eliminating the need of programming the whole application

when we are interested in studying the e®ect of a new method in the mining

process. The architecture of the tool is shown in Figure 1 and the di®erent ele-

ments that makes up it will be described. Thus, the paper is organized as follows.

Section 2 will describe the data preprocessing. In sections 3 and 5 di®erent ap-

proaches to user session and transactions identi¯cation will be presented. Finally

in sections 6 and 7 the models to be generate and the results are presented.

Web site

crawler

Data

preprocessing

Session

identification

<<Table>>

log

Classifier

training

Feature

Extraction

Clustering

Association rules

discovering

Rules

Sessions

Site map

Classified

pages

Access

Patterns

Browsing

Patterns

Site page

classification

Classified

pages

Server

logs

Fig. 1. WUM tool architecture

2 Web Log Processing

Data source for Web Usage Mining come from di®erent sources as proxy, web

log ¯les, web site structure and even from sni®er packet logs. Normally, the

most widely used sources are the web log ¯les. These ¯les record the user ac-

cesses to the site and there exists several formats: NCSA (Common Log Format),

W3C Extended, SunTM ONEWeb Server (iPlanet), IBM Tivoli Access Manager

WebSEAL or WebSphere Application Server Logs. The most of the web servers

record the access using an extension of the CLF (ECLF). In ECLF basically the

recorded information for each access is:

{ remote host: Remote hostname. (or IP address number if DNS hostname is

not available or was not provided)

{ rfc931 : The remote login name of the user. (If not available a minus sign is

typically placed in the ¯eld)

{ authuser: The username as which the user has authenticated himself. This

is available when using password protected WWW pages. (If not available a

minus sign is typically placed in the ¯eld)

{ date: Date and time of the request.

{ request: The request line exactly as it came from the client. (i.e., the ¯le

name, and the method used to retrieve it [typically GET])

{ status: The HTTP response code returned to the client. Indicates whether

or not the ¯le was successfully retrieved, and if not, what error message was

returned.

{ bytes: The number of bytes transferred.

{ referer: The url the client was on before requesting your url. (If it could not

be determined a minus sign will be placed in this ¯eld)

{ user agent: The software the client claims to be using. (If it could not be

determined a minus sign will be placed in this ¯eld)

As said before, web server logs record all the user accesses including for each

visited page all the elements that composed it as gif images, styles or scripts.

Other entries in the log refers to fail requests to the server as "404 Error: Object

not found". So a ¯rst phase in data preparation consists of ¯ltering the log entries

removing all useless entries. Others entries in the web log that must be removed,

are those that correspond to search robots because they do not corresponds to a

"true" user. To ¯lter these entries it can be used the plain text ¯le Robot.txt,

the list of known search robots www.robotstxt.org/wc/active/all.txt and

we have introduced an heuristic that is to ¯lter those very quick consecutive

requests because a characteristic of search robots is the short delay between page

requests. So with a threshold of 2 seconds between two consecutive requests the

entries that corresponds to robots can be eliminated.

The structure of the site has been used as another data source. This structure

is obtained with a web crawler starting from the root, so all the pages that can

be reached from the root will composed the structure of it. For non static sites

the structure must be introduced by hand.

3 User Session Identi¯cation

Once the web log ¯le is processed and all the irrelevant entries has been removed,

it is necessary to identify the users that visit to the site. The visits are concurrent

so in the log ¯le the entries of di®erent users are interlaced what makes us process

it to collect the entries that belong to the same user.

A ¯rst approach to identify a user is to use the IP address and assign all

the entries with the same IP to the same user. This approach exhibits some

drawbacks. Some users access to internet through a proxy so many users will

share the same IP. In other cases the same user has di®erent IP because it has

a dynamic IP con¯guration in its ISP. In order to minimize these e®ects some

heuristics has been applied. A ¯rst heuristic is to detect changes in the browser

or in the operative system ¯elds of the entries that come from the same IP.

Another heuristic makes use of the referer ¯eld and the map of the site obtained

with the site crawler mentioned previously. Thus if a page is not directly linked

to the pages previously visited by the user, it is an evidence that another user

share the same IP and browser. With the explained heuristics we will get false

positive, that is to consider only one user when actually are di®erent users.

After

...

Descargar como (para miembros actualizados) txt (25 Kb)

Leer 14 páginas más »

Leer documento completo Guardar

Disponible sólo en Clubensayos.com