SAS. Введение в кластеризацию (Лекции 2013)

2020-08-25СтудИзба

Описание презентации

Файл "SAS. Введение в кластеризацию" внутри архива находится в папке "Лекции 2013". Презентация из архива "Лекции 2013", который расположен в категории "". Всё это находится в предмете "(ппп соиад) (sas) пакеты прикладных программ для статистической обработки и анализа данных" из 10 семестр (2 семестр магистратуры), которые можно найти в файловом архиве МГУ им. Ломоносова. Не смотря на прямую связь этого архива с МГУ им. Ломоносова, его также можно найти и в других разделах. .

Просмотр презентации онлайн

Текст из слайда

Chapter 1: Introduction to Clustering
1.1 Overview
1.2 Types of Clustering
1.3 Measuring Similarity
1.4 Classification Performance
1

Chapter 1: Introduction to Clustering
1.1 Overview
1.2 Types of Clustering
1.3 Measuring Similarity
1.4 Classification Performance
2

Objectives

Define clustering and unsupervised learning.

Outline the two fundamental forms:
– partitive (a.k.a. optimization) clustering
– hierarchical clustering.

3
Describe several key distance metrics that are used to
estimate the similarity between observations.

Procedure Overview
Variable Selection
VARCLUS
Plot Data
PRINCOMP, MDS
Preprocessing
ACECLUS, STDIZE,
DISTANCE
Hierarchical
Clustering
CLUSTER
Partitive Clustering
Parametric
Clustering
FASTCLUS
4
Non-Parametric
Clustering
MODECLUS

Example: Clustering for Customer Types
While you have thousands of customers, there are really
only a handful of major types into which most of your
customers can be grouped.





5
Bargain hunter
Man/woman on a mission
Impulse shopper
Weary parent
DINK (dual income, no kids)

Example: Clustering for Store Location
You want to open new grocery stores in the U.S.
based on demographics. Where should you locate
the following types of new stores?



6
low-end budget grocery stores
small boutique grocery stores
large full-service supermarkets

Cluster Profiling
7

Cluster profiling can be defined as the derivation of a
class label from a proposed cluster solution.

The objective is to identify the features, or combination
of features, that uniquely describe each cluster.

Chapter 1: Introduction to Clustering
1.1 Overview
1.2 Types of Clustering
1.3 Measuring Similarity
1.4 Classification Performance
8

Hierarchical Clustering
Agglomerative
9
Divisive

Partitive Clustering
Initial State
X
XX
X
reference vectors (seeds)

Final State
XX XX
X
X
XX
observations
PROBLEMS!
– make you guess the number of clusters present
– make assumptions about the shape of the clusters
– influenced by seed location, outliers, and order of reading
observations
– impossible to determine the optimal grouping, due to the
combinatorial explosion of potential solutions.
10

Heuristic Search
1. Generate an initial partitioning (based on the seeds)
of the observations into clusters.
2. Calculate the change in error produced by moving
each observation from its own cluster to each of the
other clusters.
3. Make the move that produces the greatest reduction.
4. Repeat steps 2 and 3 until no move reduces error.
11

Chapter 1: Introduction to Clustering
1.1 Overview
1.2 Types of Clustering
1.3 Measuring Similarity
1.4 Classification Performance
12

Principles of a Good Similarity Metric
Properties of a good similarity metrics:
1. symmetry:
d(x,y) = d(y,x)
2. non-identical distinguishability:
d(x,y)  0 x  y
3. identical non-distinguishability:
d(x,y) = 0 x = y
4. triangular inequality:
d(x,y)  d(x,z) + d(y,z)
13

The DISTANCE Procedure
General form of the DISTANCE procedure:
PROC
PROC DISTANCE
DISTANCE DATA=SAS-data-set
DATA=SAS-data-set
METHOD=similarity-metric
METHOD=similarity-metric ;
;
VAR
VARlevel
level (variables
(variables<< // option-list
option-list >);
>);
RUN;
RUN;

14
A distance method must be specified (no default), and
all input variables are identified by level.

Simple popular Distance Metrics

Euclidean distance
DE  x  w 
k
x  w 
i
i 1
d

City Block Distance
DM1  xi  wi
i 1

15
Correlation
i
2

Go beyond: Density-Based Similarity
similarity
Density
Density estimate
estimate 11
(cluster
(cluster 1)
1)
Density
Density estimate
estimate 22
(cluster
(cluster 2)
2)
Density-based methods define similarity as the distance
between derived density “bubbles” (hyper-spheres).
16

Chapter 1: Introduction to Clustering
1.1 Overview
1.2 Types of Clustering
1.3 Measuring Similarity
1.4 Classification Performance
17

Assessing the Cluster Classification
Perfect Solution
Perfect
Typical Solution
No Solution
18

From Clusters to Class Probabilities
The probability that a cluster represents a given class is
given by the cluster’s proportion of the row total.
Frequency
19
Probability

Quality of Classification

The chi-square statistic is used to determine whether an
association exists.
 2 
i
j
(observed ij  expected ij ) 2
expected ij

Because the chi-square value grows with sample size, it does
not measure the strength of the association.

Normally, Cramer’s V ranges from 0 to 1
For 2x2 tables only, it ranges between -1 and 1
WEAK
STRONG
0
1
CRAMER'S V STATISTIC
20

Chapter 2: Preparation for Clustering
2.1 Preparing Data for Cluster Analysis
2.2 Variable Clustering
2.3 Graphical Aids to Clustering
2.4 Variable Standardization
2.5 Cluster Preprocessing
21

Chapter 2: Preparation for Clustering
2.1 Preparing Data for Cluster Analysis
2.2 Variable Clustering
2.3 Graphical Aids to Clustering
2.4 Variable Standardization
2.5 Cluster Preprocessing
22

The Challenge of Opportunistic Data
Getting anything useful out of tons of data
23

24
Preparation for Clustering
1. Data and Sample Selection (Who am I clustering?)
2. Variable Selection (What characteristics matter?)
3. Graphical Exploration (What shape/how many
clusters?)
4. Variable Standardization (Are variable scales
comparable?)
5. Variable Transformation (Are variables correlated? Are
clusters elongated?)
24

25
Preparation for Clustering
1. Data and Sample Selection (Who am I clustering?)
2. Variable Selection (What characteristics matter?)
3. Graphical Exploration (What shape/how many
clusters?)
4. Variable Standardization (Are variable scales
comparable?)
5. Variable Transformation (Are variables correlated? Are
clusters elongated?)
25

26
Data and Sample Selection

Not necessary to cluster a large population if you use
clustering techniques that lend themselves to scoring
(for example: Ward’s, k-means)

It is useful to take a random sample for clustering and
score the remainder of the larger population
CLUSTER
IT, BEBE!
26
THEN SCORE
THESE GUYS!

Chapter 2: Preparation for Clustering
2.1 Preparing Data for Cluster Analysis
2.2 Variable Clustering
2.3 Graphical Aids to Clustering
2.4 Variable Standardization
2.5 Cluster Preprocessing
27

28
Preparation for Clustering
1. Data and Sample Selection (Who am I clustering?)
2. Variable Selection (What characteristics matter?)
3. Graphical Exploration (What shape/how many
clusters?)
4. Variable Standardization (Are variable scales
comparable?)
5. Variable Transformation (Are variables correlated? Are
clusters elongated?)
28

Variable Reduction
Should one analyze all the available data?
Redundancy
Input3
E(Target)
Irrelevancy
Inpu
t1
Input1
29
t2
u
p
In

30
Variable Reduction for Irrelevancy

Regression models automatically weight input
variables depending on their impact

But in cluster analysis, there is no dependent variable

So irrelevant variables should be eliminated before
performing any cluster analysis
– Perform variable importance analysis on specially
prepared data sample with target variable
– Include apriori real-world considerations
30

The Secret to Better Clusters
Time of Day
Fraud
OK
Transaction Amt.
31
...

The Secret to Better Clusters
Tim
y
a
D
f
eo
Time of Day
Fraud
OK
t.
m
A
n
tio
c
a
s
n
Tra
Transaction Amt.
32
...

The Secret to Better Clusters
Cheatin’
Heart
Tim
y
a
D
f
eo
Time of Day
Fraud
OK
t.
m
A
n
tio
c
a
s
n
Tra
Transaction Amt.
33
More non-correlated variables =
better clusters

34
Variable Reduction for Redundancy
PROC
PROCVARCLUS
VARCLUSDATA=SAS-data-set
DATA=SAS-data-set ;
;
BY
BYvariables;
variables;
VAR
VARvariables;
variables;
RUN;
RUN;
34

PROC VARCLUS groups redundant variables.

One representative from each cluster can be chosen,
and the remaining variables discarded, reducing both
collinearity and the number of variables

35
Divisive Clustering
PROC VARCLUS uses divisive clustering to create
variable subgroups that are as dissimilar as possible.
Ignored
35

36
Variable Selection Using
Variable Clustering
This demonstration illustrates the concepts
discussed previously.
36
clus02d01.sas

37
Keep them
Ignored
37
clus02d01.sas

Chapter 2: Preparation for Clustering
2.1 Preparing Data for Cluster Analysis
2.2 Variable Clustering
2.3 Graphical Aids to Clustering
2.4 Variable Standardization
2.5 Cluster Preprocessing
38

Graphical Exploration

Plotting can help to determine such key things as
– the shape of the clusters
– relative cluster dispersion (variation)
– the approximate number of clusters in the data.
39

Principal Component Plots
PROC
PROCPRINCOMP
PRINCOMP DATA=SAS-data-set
DATA=SAS-data-set ;
;
BY
BYvariables;
variables;
VAR
VARvariables;
variables;
RUN;
RUN;
x1
Eigenvector 1
Eigenvalue 1
Eigenvector 2
Eigenvalue 2
40
x2

Multidimensional Scaling Plots
PROC
PROCMDS
MDS DATA=distance_matrix
DATA=distance_matrix ;
;
VAR
VARvariables;
variables;
RUN;
RUN;
41

Chapter 2: Preparation for Clustering
2.1 Preparing Data for Cluster Analysis
2.2 Variable Clustering
2.3 Graphical Aids to Clustering
2.4 Variable Standardization
2.5 Cluster Preprocessing
42

43
Preparation for Clustering
1. Data and Sample Selection (Who am I clustering?)
2. Variable Selection (What characteristics matter?)
3. Graphical Exploration (What shape/how many
clusters?)
4. Variable Standardization (Are variable scales
comparable?)
5. Variable Transformation (Are variables correlated? Are
clusters elongated?)
43

44
The STDIZE Procedure
General form of the STDIZE procedure:
PROC
PROCSTDIZE
STDIZEDATA=SAS-data-set
DATA=SAS-data-set
METHOD=method
METHOD=method ;
;
VAR
VARvariables;
variables;
RUN;
RUN;
?
44

45
Standardization Methods
45

46
Standardization Methods
46

Chapter 2: Preparation for Clustering
2.1 Preparing Data for Cluster Analysis
2.2 Variable Clustering
2.3 Graphical Aids to Clustering
2.4 Variable Standardization
2.5 Cluster Preprocessing
47

48
Preparation for Clustering
1. Data and Sample Selection (Who am I clustering?)
2. Variable Selection (What characteristics matter?)
3. Graphical Exploration (What shape/how many
clusters?)
4. Variable Standardization (Are variable scales
comparable?)
5. Variable Transformation (Are variables correlated? Are
clusters elongated?)
48

49
The ACECLUS Procedure
General form of the ACECLUS procedure:
PROC
PROCACECLUS
ACECLUS DATA=SAS-data-set
DATA=SAS-data-set ;
;
VAR
VARvariables;
variables;
RUN;
RUN;
Before ACECLUS
49
After ACECLUS

50
The ACECLUS Procedure
50

51
The ACECLUS Procedure
51

Chapter 3: Partitive Clustering
3.1 Introduction to K-Means Clustering
3.2 K-Means Clustering Using the FASTCLUS Procedure
3.3 Nonparametric Clustering
52

Chapter 3: Partitive Clustering
3.1 Introduction to K-Means Clustering
3.2 K-Means Clustering Using the FASTCLUS Procedure
3.3 Nonparametric Clustering
53

54
Partitive (Optimization) Clustering

54
Partitive clustering minimizes or maximizes a specified
error criterion, for example
– cluster separation, or
– within-cluster similarity (homogeneity).

55
Partitive (Optimization) Clustering
55

Natural Grouping Criterion (K-means)

Parametric (Expectation-Maximization)

Non-Parametric (Kernel-based)

56
Natural Grouping Criterion
56

Borrowing concepts from least-squares estimation
yields a natural grouping criterion:
– maximize the between-cluster sum of squares, or
– minimize the within-cluster sum of squares.

A large between-cluster sum of squares value implies
that the cluster is well-separated.

A small within-cluster sum of squares value implies
that the members of the cluster are homogenous.

57
Cross-Cluster Variation Matrix
 W11W12W13 .... 
 W W W ... 
21 22 23


W
 W31W32W33 ... 


 ...............Wnn 
57

58
The Trace Function

Trace summarizes matrix W into a single number by adding
together its diagonal (variance) elements.

Simply adding matrix elements together makes trace very
efficient, but it also makes it scale dependent

Ignores the off-diagonal elements, so variables are treated
as if they were independent (uncorrelated).

58
Diminishes the impact of information from correlated variables.
+
-

59
Basic Trace(W) Problems
Spherical Structure Problem
Similar Size Problem
59

Trace(W) also tends to produce
clusters with about the same
number of observations

Alternative clustering techniques
exist to manage this problem.

Because the trace function only looks
at the diagonal elements of W, it tends
to form spherical clusters

Use data transformation techniques

Chapter 3: Partitive Clustering
3.1 Introduction to K-Means Clustering
3.2 K-Means Clustering Using the FASTCLUS
Procedure
3.3 Nonparametric Clustering
60

61
The K-Means Methodology
The three-step k-means methodology:
1. Select (or specify) an initial set of cluster seeds
61

62
The K-Means Methodology
The three-step k-means methodology:
1. Select (or specify) an initial set of cluster seeds
2. Read the observations and update the seeds (known
after the update as reference vectors). Repeat until
convergence is attained
62

63
The K-Means Methodology
The three-step k-means methodology:
1. Select (or specify) an initial set of cluster seeds
2. Read the observations and update the seeds (known
after the update as reference vectors). Repeat until
convergence is attained
3. Make one final pass through the data, assigning
each observation to its nearest reference vector
63

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Re-assign cases.
6. Repeat steps 4 and 5
until convergence.
64

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Re-assign cases.
6. Repeat steps 4 and 5
until convergence.
65

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
66
...

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
67
...

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
68
...

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
69
...

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
70
...

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
71
...

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
72
...

k-Means Clustering Algorithm
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
73
...

Segmentation Analysis
When no clusters exist,
use the k-means algorithm
to partition cases into
contiguous groups.
74

75
The FASTCLUS Procedure
General form of the FASTCLUS procedure:
PROC
PROCFASTCLUS
FASTCLUS DATA=SAS-data-set
DATA=SAS-data-set
|;
|;
VAR
VARvariables;
variables;
RUN;
RUN;
Because PROC FASTCLUS produces relatively little
output, it is often a good idea to create an output data
set, and then use other procedures such as PROC
MEANS, PROC SGPLOT, PROC DISCRIM, or PROC
CANDISC to study the clusters.
75

76
The MAXITER= Option

The MAXITER= option sets the number of K-Means
iterations (the default number of iterations is 1)
X
X
X
X
X
X
X
X
Time 0
76
X
X
X
X
X
X
X
X
X
X
X
X
Time 1

X
X
X
X
X
XX
Time n

77
The DRIFT Option
The DRIFT option adjusts the nearest reference vector as
each observation is assigned.
X
X
X
X
X
X
X
X
Time 0
77
X
X
XX
X
X
X
X
X
Time 1
X
X
XX
X
X
X
X
X
Time 2

78
The LEAST= Option
The LEAST = option provides the argument for the
Minkowski distance metric, changes the number of
iterations, and changes the convergence criterion.
Option
78
Distance
Max Iterations
Converge=
default
EUCLIDEAN
1
.02
LEAST=1
CITY BLOCK
20
.0001
LEAST=2
EUCLIDEAN
10
.0001

What Value of k to Use?
The number of seeds, k, typically translates to the final
number of clusters obtained. The choice of k can be made
using a variety of methods.





79
Subject-matter knowledge (There
are most likely five groups.)
Convenience (It is convenient to
market to three to four groups.)
Constraints (You have six products
and need six segments.)
Arbitrarily (Always pick 20.)
Based on the data (combined with
Ward’s method).

Problems with K-Means
Не всегда оптимальное
разбиение пространства
80
Плотность выборки?
Нет, не слышал!

Grocery Store Case Study: Census Data
Analysis goal:
Where should you open new grocery store locations?
Group geographic regions based on income,
household size, and population density.
Analysis plan:





81
Explore the data.
Select the number of segments to create.
Create segments with a clustering
procedure.
Interpret the segments.
Map the segments.

82
K-Means Clustering for
Segmentation
This demonstration illustrates the concepts
discussed previously.
82
clus03d01.sas

83
83

84
84

85
85

Chapter 3: Partitive Clustering
3.1 Introduction to K-Means Clustering
3.2 K-Means Clustering Using the FASTCLUS Procedure
3.3 Nonparametric Clustering
86

87
Parametric vs Non-Parametric Clustering
Expectation-Maximization (+)
Expectation-Maximization (-)
Parametric Clustering performs bad on density-based clusters
87

88
Developing Kernel Intuition
Modes
88

89
Advantages of Nonparametric Clustering

It still obtains good results on compact clusters.

It is capable of detecting clusters of unequal size
and dispersion, even if they have irregular shapes.

It is less sensitive (but not insensitive) to changes in
scale than most clustering methods.

It does not require that you guess the number of
clusters present in the data.
PROC
PROCMODECLUS
MODECLUSDATA=SAS-data-set
DATA=SAS-data-set
METHOD=method
METHOD=method ;
;
VAR
VAR variables;
variables;
RUN;
RUN;
89

90
Significance Tests
90

If requested (the JOIN= option), PROC MODECLUS
can hierarchically join non-significant clusters.

Although a fixed-radius kernel (R=) must be specified,
the choice of smoothing parameter is not critical.

91
Valley-Seeking Method
modal region 1
(cluster 1)
valley
modal region 2
(cluster 2)
91

92
Saddle Density Estimation
no density
difference
density difference
92

93
Hierarchically Joining
Non-Significant Clusters
This demonstration illustrates the concepts
discussed previously.
93
clus03d03.sas

94
94

95
95

Chapter 4: Hierarchical Clustering
4.1 Introduction
4.2 Hierarchical Clustering Methods
96

Chapter 4: Hierarchical Clustering
4.1 Introduction
4.2 Hierarchical Clustering Methods
97

98
Hierarchical Clustering
98

99
The CLUSTER Procedure
General form of the CLUSTER procedure:
PROC
PROC CLUSTER
CLUSTER DATA=SAS-data-set
DATA=SAS-data-set
METHOD=method
METHOD=method ;
;
VAR
VAR variables;
variables;
FREQ
FREQ variable;
variable;
RMSSTD
RMSSTDvariable;
variable;
RUN;
RUN;
The required METHOD= option specifies the hierarchical
technique to be used to cluster the observations.
99

100
Cluster and Data Types
Average Linkage
Distance Data
Required?
Yes
Two-Stage Linkage
Some Options
Hierarchical Method
Ward’s Method
Yes
Centroid Linkage
Yes
Complete Linkage
Yes
Density Linkage
100
Some Options
EML
No
Flexible-Beta Method
Yes
McQuitty’s Similarity
Yes
Median Linkage
Yes
Single Linkage
Yes

101
The TREE Procedure
General form of the TREE procedure:
PROC
PROC TREE
TREE DATA=
DATA= ;
;
RUN;
RUN;
The TREE procedure either
101

displays the dendrogram (LEVEL= option), or

assigns the observations to a specified number
of clusters (NCLUSTERS= option).

Chapter 4: Hierarchical Clustering
4.1 Introduction
4.2 Hierarchical Clustering Methods
102

103
Average Linkage
The distance between clusters is the average distance
between pairs of observations.
CK
d(xi, xj)
1
DKL 

n K nL i  C K
CL
103
 d x , x 
i
j  CL
j

104
Two-Stage Density Linkage
A nonparametric density estimate is used to determine
distances, and recover irregularly shaped clusters.
modal cluster K
modal cluster K
DKL
modal cluster L
1. Form ‘modal’ clusters
104
modal cluster L
2. Apply single linkage

106
Ward’s
Ward’s method uses ANOVA at each fusion point to
determine if the proposed fusion is warranted.
ANOVA
ANOVA
106
DKL 
xK  xL
2
 1
1 



 nK nL 

107
Additional Clustering Methods
CK X
Centroid Linkage
C X
CK
L
Complete Linkage
CL
CK
Density Linkage
Single Linkage
CL
CK
CL
107

112
Comparing Hierarchical
Clustering Methods
This demonstration illustrates the concepts
discussed previously.
112
clus04d01.sas

Chapter 5: Assessing Clustering Results
5.1 Determining the Number of Clusters
5.2 Cluster Profiling
5.3 Scoring New Observations
113

Chapter 5: Assessing Clustering Results
5.1 Determining the Number of Clusters
5.2 Cluster Profiling
5.3 Scoring New Observations
114

Interpreting Dendrograms
For interpreting any hierarchical clustering method
change in fusion
level; prefer 3
clusters.
115

Cubic Clustering Criterion
np *
 1  E(R 2 ) 
2
CCC ln 
2 
2 1.2
1

R

 0.001  E ( R ) 
116

Sarle’s Cubic Clustering Criterion compares observed
and expected R2 values.

It tests the null hypothesis (H0) that the data was
sampled from uniform distribution across a hyper-box.

CCC values greater than 2 suggest there is sufficient
evidence of cluster structure (reject the H0).

Join clusters in local MAXIMA of CCC

Other Useful Statistics

Pseudo-F Statistics
B /( g  1)
PSF 
W /(n  g )

Join clusters
if statistics is in local
MAXIMUM
Pseudo-T2 Statistics
Wm  Wk  Wl
PST2 
 Wk  Wl   (nk  nl  2)
117
Join clusters
if T2 statistics is in local
MINIMUM

Interpreting PSF and PST2
candidates
Pseudo-F Statistics
Read in this Direction
candidates
Pseudo-T2 Statistics
118

Chapter 5: Assessing Clustering Results
5.1 Determining the Number of Clusters
5.2 Cluster Profiling
5.3 Scoring New Observations
120

Cluster Profiling

Generation of unique cluster descriptions from the
input variables.

It can be implemented using many approaches:
 Generate the “typical” member of each cluster.
 Use ANOVA to determine the inputs that uniquely
define each of the typical members.
 Use graphs to compare and describe the clusters

121
In addition, one can compare
each cluster against the
whole cluster population

One-Against-All Comparison
1. For the cluster k classify each observation as being a
member of cluster k (with a value of 1) or not a
member of cluster k (with a value of 0)
2. Use logistic regression to rank the input variables by
their ability to distinguish cluster k from the others
3. Generate a comparative plot of cluster k and the rest
of the data.
122

Chapter 5: Assessing Clustering Results
5.1 Determining the Number of Clusters
5.2 Cluster Profiling
5.3 Scoring New Observations
124

Scoring PROC FASTCLUS Results
1. Perform cluster analysis and save the centroids.
PROC
PROCFASTCLUS
FASTCLUSOUTSTAT=centroids;
OUTSTAT=centroids;
2. Load the saved centroids and score a new file.
PROC
PROCFASTCLUS
FASTCLUSINSTAT=centroids
INSTAT=centroids
OUT=SAS-data-set;
OUT=SAS-data-set;
125

Scoring PROC CLUSTER Results
1. Perform the hierarchical cluster analysis.
PROC
PROCCLUSTER
CLUSTERMETHOD=
METHOD= OUTTREE=tree;
OUTTREE=tree;
VAR
VARvariables;
variables;
RUN;
RUN;
2. Generate the cluster assignments.
PROC
PROC TREE
TREEDATA=tree
DATA=tree N=nclusters
N=nclusters OUT=treeout;
OUT=treeout;
RUN;
RUN;
126
continued...

Scoring PROC CLUSTER Results
3. Calculate the cluster centroids.
PROC
PROC MEANS
MEANSDATA=treeout;
DATA=treeout;
CLASS
CLASS cluster;
cluster;
OUTPUT
OUTPUT MEAN=
MEAN= OUT=centroids;
OUT=centroids;
RUN;
RUN;
4. Read the centroids and score the new file.
PROC
PROC FASTCLUS
FASTCLUS DATA=newdata
DATA=newdata SEED=centroids
SEED=centroids
MAXCLUSTERS=n
MAXCLUSTERS=n MAXITER=0
MAXITER=0 OUT=results;
OUT=results;
RUN;
RUN;
127

Chapter 6: Cluster Analysis Case Study
6.1 Happy Household Case Study
129

The Happy Household Catalog
A retail catalog company with a strong online presence
monitors quarterly purchasing behavior for its customers,
including sales figures summarized across departments
and quarterly totals for 5.5 years of sales.
130

HH wants to improve customer relations by
tailoring promotions to customers based on
their preferred type of shopping experience

Customer preferences are difficult to ascertain
based solely on opportunistic data.

Cluster Analysis as a Predictive Modeling Tool
The marketing team gathers questionnaire data:
131

Identify patterns in customer attitudes toward shopping

Generate attitude profiles (clusters) and tie to
specific marketing promotions

Use attitude profiles as the target variable in a
predictive model with shopping behavior as inputs

Score large customer database (n=48K) using the
predictive model, and assign promotions based on
predicted cluster groupings

Preparation for Clustering
1. Data and Sample Selection
2. Variable Selection (What characteristics matter?)
3. Graphical Exploration (What shape/how many
clusters?)
4. Variable Standardization (Are variable scales
comparable?)
5. Variable Transformation (Are variables correlated? Are
clusters elongated?)
132

Data and Sample Selection
A study is conducted to identify patterns in customer
attitudes toward shopping
Online customers are asked to complete a questionnaire
during a visit to the company’s retail Web site. A sample
of 200 complete data questionnaires is analyzed.
133

Preparation for Clustering
1. Data and Sample Selection (Who am I clustering?)
2. Variable Selection
3. Graphical Exploration (What shape/how many
clusters?)
4. Variable Standardization (Are variable scales
comparable?)
5. Variable Transformation (Are variables correlated? Are
clusters elongated?)
134

Variable Selection
This demonstration illustrates the concepts
discussed previously.
135
clus06d01.sas

What Have You Learned?
Three variables will be used for cluster analysis:
136
HH5
I prefer to shop online rather than offline
HH10
I believe that good service is the most
important thing a company can provide
HH11
Good value for the money is hard to find

Preparation for Clustering
1.
2.
3.
4.
Data and Sample Selection (Who am I clustering?)
Variable Selection (What characteristics matter?)
Graphical Exploration
Variable Standardization (Are variable scales
comparable?)
5. Variable Transformation (Are variables correlated? Are
clusters elongated?)
137

Graphical Exploration of
Selected Variables
This demonstration illustrates the concepts
discussed previously.
138
clus06d02.sas

Preparation for Clustering
1. Data and Sample Selection (Who am I clustering?)
2. Variable Selection (What characteristics matter?)
3. Graphical Exploration (What shape/how many
clusters?)
4. Variable Standardization
5. Variable Transformation
139

What Have You Learned?
 Standardization is unnecessary in this example
because all variables are on the same scale of
measurement
 Transformation might be unnecessary in this example
because there is not evidence of elongated cluster
structure from the plots, and the variables have low
correlation.
140

Selecting a Clustering Method
141

With 200 observations, it is a good idea to use a
hierarchical clustering technique.

Ward’s method is selected for ease of interpretation

Select number of clusters with CCC, PSF and PST2

Use cluster plots to assist in providing cluster labels

Hierarchical Clustering and
Determining the Number of
Clusters
This demonstration illustrates the concepts
discussed previously.
142
clus06d03.sas

Profiling the Clusters
143

There are seven clusters

There are three marketing promotions

Determine whether the seven cluster profiles are
good complements to the three marketing promotions

Otherwise try another number of clusters

Profiling the Seven-Cluster
Solution
This demonstration illustrates the concepts
discussed previously.
144
clus06d04.sas

What Have You Learned?
145

What Have You Learned?
146

What Will You Offer?
Offer 1: Coupon for free
shipping if > 6mo since
last purchase
Offer 2: Fee-based
membership in exclusive
club to get “valet” service,
personal (online) shopper.
Offer 3: Coupon for product
of a brand different from
previously purchased.
147
1. Discriminating online
tastes
2. Savings and service
anywhere
3. Values in-store service
4. Seeks in-store savings
5. Reluctant shopper,
online
6. Reluctant shopper,
in-store
7. Seeks on-line savings

What Will You Offer?
Offer 1: Coupon for free
shipping if > 6mo since
last purchase
Offer 2: Fee-based
membership in exclusive
club to get “valet” service,
personal (online) shopper.
Offer 3: Coupon for product
of a brand different from
previously purchased.
148
1. Discriminating online
tastes
2. Savings and service
anywhere
3. Values in-store service
4. Seeks in-store savings
5. Reluctant shopper,
online
6. Reluctant shopper,
in-store
7. Seeks on-line savings
Offer will be made based on cluster classification
and a high customer lifetime value score.

Predictive Modeling
The marketing team can choose from a variety of predictive
modeling tools, including logistic regression, decision trees,
neural networks, and discriminant analysis
Logistic regression and NN should be neglected because of
the small sample and large number of input variables
Discriminant analysis is used in this example
PROC
PROCDISCRIM
DISCRIM DATA=data-set-1;
DATA=data-set-1;

priors-specification;>
CLASS
CLASS cluster-variable;
cluster-variable;
VAR
VARinput-variables;
input-variables;
RUN;
RUN;
149

Modeling Cluster Membership
This demonstration illustrates the concepts
discussed previously.
150
clus0605.sas

Scoring the Database
Once a model has been developed to predict cluster
membership from purchasing data, the full customer
database can be scored.
Customers are offered specific promotions based on
predicted cluster membership.
PROC
PROCDISCRIM
DISCRIM DATA=data-set-1
DATA=data-set-1
TESTDATA=data-set-2
TESTDATA=data-set-2 TESTOUT=scored-data;
TESTOUT=scored-data;
PRIORS
PRIORS priors-specification;
priors-specification;
CLASS
CLASS cluster-variable;
cluster-variable;
VAR
VARinput-variables;
input-variables;
RUN;
RUN;
151

Let’s Cluster the World!
152

Свежие статьи
Популярно сейчас
А знаете ли Вы, что из года в год задания практически не меняются? Математика, преподаваемая в учебных заведениях, никак не менялась минимум 30 лет. Найдите нужный учебный материал на СтудИзбе!
Ответы на популярные вопросы
Да! Наши авторы собирают и выкладывают те работы, которые сдаются в Вашем учебном заведении ежегодно и уже проверены преподавателями.
Да! У нас любой человек может выложить любую учебную работу и зарабатывать на её продажах! Но каждый учебный материал публикуется только после тщательной проверки администрацией.
Вернём деньги! А если быть более точными, то автору даётся немного времени на исправление, а если не исправит или выйдет время, то вернём деньги в полном объёме!
Да! На равне с готовыми студенческими работами у нас продаются услуги. Цены на услуги видны сразу, то есть Вам нужно только указать параметры и сразу можно оплачивать.
Отзывы студентов
Ставлю 10/10
Все нравится, очень удобный сайт, помогает в учебе. Кроме этого, можно заработать самому, выставляя готовые учебные материалы на продажу здесь. Рейтинги и отзывы на преподавателей очень помогают сориентироваться в начале нового семестра. Спасибо за такую функцию. Ставлю максимальную оценку.
Лучшая платформа для успешной сдачи сессии
Познакомился со СтудИзбой благодаря своему другу, очень нравится интерфейс, количество доступных файлов, цена, в общем, все прекрасно. Даже сам продаю какие-то свои работы.
Студизба ван лав ❤
Очень офигенный сайт для студентов. Много полезных учебных материалов. Пользуюсь студизбой с октября 2021 года. Серьёзных нареканий нет. Хотелось бы, что бы ввели подписочную модель и сделали материалы дешевле 300 рублей в рамках подписки бесплатными.
Отличный сайт
Лично меня всё устраивает - и покупка, и продажа; и цены, и возможность предпросмотра куска файла, и обилие бесплатных файлов (в подборках по авторам, читай, ВУЗам и факультетам). Есть определённые баги, но всё решаемо, да и администраторы реагируют в течение суток.
Маленький отзыв о большом помощнике!
Студизба спасает в те моменты, когда сроки горят, а работ накопилось достаточно. Довольно удобный сайт с простой навигацией и огромным количеством материалов.
Студ. Изба как крупнейший сборник работ для студентов
Тут дофига бывает всего полезного. Печально, что бывают предметы по которым даже одного бесплатного решения нет, но это скорее вопрос к студентам. В остальном всё здорово.
Спасательный островок
Если уже не успеваешь разобраться или застрял на каком-то задание поможет тебе быстро и недорого решить твою проблему.
Всё и так отлично
Всё очень удобно. Особенно круто, что есть система бонусов и можно выводить остатки денег. Очень много качественных бесплатных файлов.
Отзыв о системе "Студизба"
Отличная платформа для распространения работ, востребованных студентами. Хорошо налаженная и качественная работа сайта, огромная база заданий и аудитория.
Отличный помощник
Отличный сайт с кучей полезных файлов, позволяющий найти много методичек / учебников / отзывов о вузах и преподователях.
Отлично помогает студентам в любой момент для решения трудных и незамедлительных задач
Хотелось бы больше конкретной информации о преподавателях. А так в принципе хороший сайт, всегда им пользуюсь и ни разу не было желания прекратить. Хороший сайт для помощи студентам, удобный и приятный интерфейс. Из недостатков можно выделить только отсутствия небольшого количества файлов.
Спасибо за шикарный сайт
Великолепный сайт на котором студент за не большие деньги может найти помощь с дз, проектами курсовыми, лабораторными, а также узнать отзывы на преподавателей и бесплатно скачать пособия.
Популярные преподаватели
Нашёл ошибку?
Или хочешь предложить что-то улучшить на этой странице? Напиши об этом и получи бонус!
Бонус рассчитывается индивидуально в каждом случае и может быть в виде баллов или бесплатной услуги от студизбы.
Предложить исправление
Добавляйте материалы
и зарабатывайте!
Продажи идут автоматически
5119
Авторов
на СтудИзбе
445
Средний доход
с одного платного файла
Обучение Подробнее