Machine Learning in Action ch.11.3 & 11.4 어프라이어리 알고리즘

- 2월 09, 2016

11.3 어프라이어리 알고리즘으로 빈발 아이템 집합 찾기

연관 분석은 빈발 아이템 집합과 연관 규칙이라는 두 가지로 표현된다고 했다.
빈발 아이템 집합을 먼저 찾아야만 연관 규칙을 찾을 수 있다.

빈발 아이템 집합을 찾는 방법이 어프라이어리 알고리즘이다.

11.3.1 후보 아이템 집합 생성

부분집합이 되는 아이템들을 찾기 위해 데이터 집합을 살펴보는 함수를 생성한다.

ef createC1(dataSet):
C1 = []
for transaction in dataSet:
for item in transaction:
if not [item] in C1:
C1.append([item])

C1.sort()
return map(frozenset, C1)#use frozen set so we
#can use it as a key in a dict

def scanD(D, Ck, minSupport):
ssCnt = {}
for tid in D:
for can in Ck:
if can.issubset(tid):
if not ssCnt.has_key(can): ssCnt[can]=1
else: ssCnt[can] += 1
numItems = float(len(D))
retList = []
supportData = {}
for key in ssCnt:
support = ssCnt[key]/numItems
if support >= minSupport:
retList.insert(0,key)
supportData[key] = support
return retList, supportData

1번째 함수는 하나의 후보 아이템 집합인 C1 을 만드는 함수이다.
2번째 함수는 지지도를 측정하고 비교해서 버릴 지 말지 스캔하는 함수이다.

11.3.2 전체 어프라이어리 알고리즘 사용

def aprioriGen(Lk, k): #creates Ck
retList = []
lenLk = len(Lk)
for i in range(lenLk):
for j in range(i+1, lenLk):
L1 = list(Lk[i])[:k-2]; L2 = list(Lk[j])[:k-2]
L1.sort(); L2.sort()
if L1==L2: #if first k-2 elements are equal
retList.append(Lk[i] | Lk[j]) #set union
return retList

def apriori(dataSet, minSupport = 0.5):
C1 = createC1(dataSet)
D = map(set, dataSet)
L1, supportData = scanD(D, C1, minSupport)
L = [L1]
k = 2
while (len(L[k-2]) > 0):
Ck = aprioriGen(L[k-2], k)
Lk, supK = scanD(D, Ck, minSupport)#scan DB to get Lk
supportData.update(supK)
L.append(Lk)
k += 1
return L, supportData

11.4 빈발 아이템 집합으로 연관 규칙 마이닝하기

11.3 에는 빈발 아이템 집합을 알아봤고

이제 연관 규칙 찾는 방법을 알아보자.

우선 연관 규칙을 찾는 것은 빈발 아이템 집합을 가지고 시작한다.
만약, {두유,상추}라는 빈발 아이템 집합을 가지고 있다면, 연관 규칙의 한 예는 두유 -> 상추가 된다. 이러한 표현은 두유를 구매한 사람은 통계적으로 상추도 구매할 확률이 크다는 것이다.

우리는 가능한 규칙 리스트를 생성한 다음 각 규칙에 대한 신뢰도를 검토하고 최소 요구 사항을 만족하지 못한다면 , 이 규칙은 버린다.

아래의 그림을 보면 0,1,2 -> 3 이라는 규칙이 최소 신뢰도를 만족하지 못한다고 가정해보자. 그렇다면 3을 결과로 갖고 {0.1.2}의 부분집합을 원인으로 하는 집합들은 버린다.

이렇게 우리는 연관 규칙의 이러한 속성을 사용하여 검사를 통해 규칙의 개수를 줄일 수 있다.

코드로 표현하면 아래와 같다.

def generateRules(L, supportData, minConf=0.7): #supportData is a dict coming from scanD
bigRuleList = []
for i in range(1, len(L)):#only get the sets with two or more items
for freqSet in L[i]:
H1 = [frozenset([item]) for item in freqSet]
if (i > 1):
rulesFromConseq(freqSet, H1, supportData, bigRuleList, minConf)
else:
calcConf(freqSet, H1, supportData, bigRuleList, minConf)
return bigRuleList

def calcConf(freqSet, H, supportData, brl, minConf=0.7):
prunedH = [] #create new list to return
for conseq in H:
conf = supportData[freqSet]/supportData[freqSet-conseq] #calc confidence
if conf >= minConf:
print freqSet-conseq,'-->',conseq,'conf:',conf
brl.append((freqSet-conseq, conseq, conf))
prunedH.append(conseq)
return prunedH

def rulesFromConseq(freqSet, H, supportData, brl, minConf=0.7):
m = len(H[0])
if (len(freqSet) > (m + 1)): #try further merging
Hmp1 = aprioriGen(H, m+1)#create Hm+1 new candidates
Hmp1 = calcConf(freqSet, Hmp1, supportData, brl, minConf)
if (len(Hmp1) > 1): #need at least two sets to merge
rulesFromConseq(freqSet, Hmp1, supportData, brl, minConf)

이 블로그 검색

Computer & Books