Bulid a recommendation system using R

1. Data load
- 1. 1. Format transformation
2. Calculate similarity
3. Predict user ratings
4. Summary

필요한 R 패키지 list : dplyr, data.table, reshape2

library(dplyr)
library(data.table)
library(reshape2)

1. Data load

가볍게 예제로 이용할 수 있는 데이터로 6명의 사용자가 6개의 영화에 대해 0~5점 사이의 점수를 부여한 가벼운 자료 입니다. 아주 쉽게 적용해볼 수 있습니다.

출처 : https://github.com/sureshgorakala/RecommenderSystems_R/movie_rating.csv

data1 <- read.table(choose.files(), header = TRUE, sep = ",")

knitr::kable(data1)

critic	title	rating
Jack Matthews	Lady in the Water	3.0
Jack Matthews	Snakes on a Plane	4.0
Jack Matthews	You Me and Dupree	3.5
Jack Matthews	Superman Returns	5.0
Jack Matthews	The Night Listener	3.0
Mick LaSalle	Lady in the Water	3.0
Mick LaSalle	Snakes on a Plane	4.0
Mick LaSalle	Just My Luck	2.0
Mick LaSalle	Superman Returns	3.0
Mick LaSalle	You Me and Dupree	2.0
Mick LaSalle	The Night Listener	3.0
Claudia Puig	Snakes on a Plane	3.5
Claudia Puig	Just My Luck	3.0
Claudia Puig	You Me and Dupree	2.5
Claudia Puig	Superman Returns	4.0
Claudia Puig	The Night Listener	4.5
Lisa Rose	Lady in the Water	2.5
Lisa Rose	Snakes on a Plane	3.5
Lisa Rose	Just My Luck	3.0
Lisa Rose	Superman Returns	3.5
Lisa Rose	The Night Listener	3.0
Lisa Rose	You Me and Dupree	2.5
Toby	Snakes on a Plane	4.5
Toby	Superman Returns	4.0
Toby	You Me and Dupree	1.0
Gene Seymour	Lady in the Water	3.0
Gene Seymour	Snakes on a Plane	3.5
Gene Seymour	Just My Luck	1.5
Gene Seymour	Superman Returns	5.0
Gene Seymour	You Me and Dupree	3.5
Gene Seymour	The Night Listener	3.0

1. 1. Format transformation

추천 시스템을 구축하기 앞서 사용자를 포함하는 행(row)과 아이템을 포함하는 열(column), 사용자의 아이템에 대한 평가를 포함하는 셀로 이루어진 행렬(matrix)이 필요합니다.

이를 위해 resahpe2 패키지 안에 내장되어 있는 acast() 함수를 이용할 수 있습니다. acast()는 데이터프레임을 행렬 형식으로 만들어줍니다.

acast(data, formula, value.var)

data : data.frame
formula : row~column 변수명
value.var : cell 값

movie_ratings <- acast(data1, title ~ critic, value.var = "rating")
movie_ratings <- as.data.frame(movie_ratings)

knitr::kable(movie_ratings)

	Claudia Puig	Gene Seymour	Jack Matthews	Lisa Rose	Mick LaSalle	Toby
Just My Luck	3.0	1.5	NA	3.0	2	NA
Lady in the Water	NA	3.0	3.0	2.5	3	NA
Snakes on a Plane	3.5	3.5	4.0	3.5	4	4.5
Superman Returns	4.0	5.0	5.0	3.5	3	4.0
The Night Listener	4.5	3.0	3.0	3.0	3	NA
You Me and Dupree	2.5	3.5	3.5	2.5	2	1.0

변환된 행렬을 보시면 Toby는 3개의 영화에 대해서만 평가를 했음을 확인할 수 있습니다.
여기서 유사한 사용자들의 평가를 기반으로 다른 사용자들에게 그들이 아직 평가하지 않은 영화를 추천해주어야 합니다.

2. Calculate similarity

유사도(similarity)를 측정 하는 방법은 유클리드 거리(euclidean distance), 코사인 거리(cosine distance), 피어슨 상관계수(pearson’s correlation), 자카드 거리(jaccard distance) 등과 같이 다양한 측정방법이 존재합니다.
여기서는 사용자 사이의 유사도 측정 값으로 상관관계를 이용하겠습니다. 상관관계는 두 아이템의 연관도 또는 두 아이템 벡터가 얼마나 밀접하게 공변하는지를 보여줍니다.

함수는 cor() 입니다. 전체 관측치를 고려하는 use = "complete.obs argument를 이용합니다.

sim_users <- cor(movie_ratings[, 1:6], use = "complete.obs")
sim_users

##               Claudia Puig Gene Seymour Jack Matthews Lisa Rose
## Claudia Puig     1.0000000    0.7559289     0.9285714 0.9449112
## Gene Seymour     0.7559289    1.0000000     0.9449112 0.5000000
## Jack Matthews    0.9285714    0.9449112     1.0000000 0.7559289
## Lisa Rose        0.9449112    0.5000000     0.7559289 1.0000000
## Mick LaSalle     0.6546537    0.0000000     0.3273268 0.8660254
## Toby             0.8934051    0.3812464     0.6628490 0.9912407
##               Mick LaSalle      Toby
## Claudia Puig     0.6546537 0.8934051
## Gene Seymour     0.0000000 0.3812464
## Jack Matthews    0.3273268 0.6628490
## Lisa Rose        0.8660254 0.9912407
## Mick LaSalle     1.0000000 0.9244735
## Toby             0.9244735 1.0000000

Toby를 기준으로 Lisa Rose가 0.99, Mick LaSalle가 0.92로 매우 유사하다는 사실을 확인할 수 있습니다.

3. Predict user ratings

Toby가 평가하지 않은 영화들의 평가점수를 예측해보겠습니다. 그러기 위해서는 다음 단계들을 수행해야 합니다.
1. Toby가 아직 평가하지 않은 영화들을 골라냅니다.
2. 다른 사용자들이 제공한 해당 영화의 모든 평가들만 따로 분리시킵니다.
3. Toby를 제외한 모든 사용자들이 제공한 해당 영화들의 평가, 그리고 사용자와 Toby 사이의 유사도 값을 곱합니다.
4. 위 단계에서 구한 각 영화의 전체 평가점수를 더한 후, 해당 값을 사용자 유사도 값의 합으로 나눕니다.
여기서는 data.table 패키지를 사용하겠습니다. data.table 패키지는 속도나 기능적인 측면에서 조금 더 향상된 data.frame을 제공해줍니다. 또한 RAM에서 최대 100GB의 대용량 데이터 세트를 처리할 수 있다는 장점이 있습니다.

Toby가 평가하지 않은 영화를 고르기

data.table 패키지의 setDT() 함수를 이용하여 평가하지 않은 영화 항목을 고르겠습니다. 우선 Toby의 평가 데이터프레임을 뽑습니다.

# 6번째 사용자 : Toby
rating_critic <- setDT(movie_ratings[colnames(movie_ratings)[6]],
                       keep.rownames = TRUE)[]
names(rating_critic) <- c("title", "rating")
rating_critic

##                 title rating
## 1:       Just My Luck     NA
## 2:  Lady in the Water     NA
## 3:  Snakes on a Plane    4.5
## 4:   Superman Returns    4.0
## 5: The Night Listener     NA
## 6:  You Me and Dupree    1.0

그 다음 평가되지 않은 영화들만 분리시킵니다. NA 값들을 걸러내기 위해 is.na() 함수를 이용하였습니다.

titles_NA_critic <- rating_critic$title[is.na(rating_critic$rating)]
titles_NA_critic

## [1] "Just My Luck"       "Lady in the Water"  "The Night Listener"

다른 사용자들이 제공한 해당 영화의 모든 평가들만 따로 분리

그 다음 원래의 데이터 세트를 기반으로 Toby가 평가하지 않은 영화들 중 다른 사용자들이 평가했던 데이터프레임을 뽑습니다. 여기서 %in%는 SQL의 where 조건과 같이 동작합니다.

ratings_t <- data1[data1$title %in% titles_NA_critic, ]
ratings_t

##           critic              title rating
## 1  Jack Matthews  Lady in the Water    3.0
## 5  Jack Matthews The Night Listener    3.0
## 6   Mick LaSalle  Lady in the Water    3.0
## 8   Mick LaSalle       Just My Luck    2.0
## 11  Mick LaSalle The Night Listener    3.0
## 13  Claudia Puig       Just My Luck    3.0
## 16  Claudia Puig The Night Listener    4.5
## 17     Lisa Rose  Lady in the Water    2.5
## 19     Lisa Rose       Just My Luck    3.0
## 21     Lisa Rose The Night Listener    3.0
## 26  Gene Seymour  Lady in the Water    3.0
## 28  Gene Seymour       Just My Luck    1.5
## 31  Gene Seymour The Night Listener    3.0

그 다음 각 사용자들이 가진 Toby와의 유사도 값을 이용해서 similarity라는 유사도 변수를 추가합니다.

x <- setDT(data.frame(sim_users[, 6]),
           keep.rownames = TRUE)[]
names(x) <- c("critic", "similarity")

ratings_t <- merge(ratings_t, x,
                   by = "critic", all.x = TRUE)

knitr::kable(ratings_t)

critic	title	rating	similarity
Claudia Puig	Just My Luck	3.0	0.8934051
Claudia Puig	The Night Listener	4.5	0.8934051
Gene Seymour	Lady in the Water	3.0	0.3812464
Gene Seymour	Just My Luck	1.5	0.3812464
Gene Seymour	The Night Listener	3.0	0.3812464
Jack Matthews	Lady in the Water	3.0	0.6628490
Jack Matthews	The Night Listener	3.0	0.6628490
Lisa Rose	Lady in the Water	2.5	0.9912407
Lisa Rose	Just My Luck	3.0	0.9912407
Lisa Rose	The Night Listener	3.0	0.9912407
Mick LaSalle	Lady in the Water	3.0	0.9244735
Mick LaSalle	Just My Luck	2.0	0.9244735
Mick LaSalle	The Night Listener	3.0	0.9244735

평가점수와 유사도 값을 곱함

평가 점수와 유사도 값을 곱하여 그 결과를 새로운 변수 sim_rating에 추가합니다.

ratings_t$sim_rating <- ratings_t$rating * ratings_t$similarity

knitr::kable(ratings_t)

critic	title	rating	similarity	sim_rating
Claudia Puig	Just My Luck	3.0	0.8934051	2.6802154
Claudia Puig	The Night Listener	4.5	0.8934051	4.0203232
Gene Seymour	Lady in the Water	3.0	0.3812464	1.1437393
Gene Seymour	Just My Luck	1.5	0.3812464	0.5718696
Gene Seymour	The Night Listener	3.0	0.3812464	1.1437393
Jack Matthews	Lady in the Water	3.0	0.6628490	1.9885469
Jack Matthews	The Night Listener	3.0	0.6628490	1.9885469
Lisa Rose	Lady in the Water	2.5	0.9912407	2.4781018
Lisa Rose	Just My Luck	3.0	0.9912407	2.9737221
Lisa Rose	The Night Listener	3.0	0.9912407	2.9737221
Mick LaSalle	Lady in the Water	3.0	0.9244735	2.7734204
Mick LaSalle	Just My Luck	2.0	0.9244735	1.8489469
Mick LaSalle	The Night Listener	3.0	0.9244735	2.7734204

이전 단계에서 구한 각 영화의 모든 평가를 더한 후, 이 값을 각 사용자의 유사도 값의 합으로 나눔

dplyr 패키지 함수 중 group_by()와 summarise() 함수를 이용하여 연산을 수행할 수 있습니다.

result1 <- ratings_t %>% group_by(title) %>% summarise( sum(sim_rating) / sum(similarity) )
result1

## # A tibble: 3 x 2
##                title `sum(sim_rating)/sum(similarity)`
##               <fctr>                             <dbl>
## 1       Just My Luck                          2.530981
## 2  Lady in the Water                          2.832550
## 3 The Night Listener                          3.347790

Toby가 평가하지 않은 영화에 대해서 계산된 평가점수를 확인할 수 있습니다.

이제 Toby가 제공한 평가점수보다 높은 점수를 가진 영화들을 추천할 수 있습니다.
```
mean(rating_critic$rating, na.rm = TRUE)
```
```
## [1] 3.166667
```
- Toby의 평균 평가점수가 3.17이라는 사실을 확인하였으니 이제 평균값보다 점수가 높은 영화를 추천하면 됩니다. 따라서 The Night Listener를 추천할 수 있다고 볼 수 있습니다.

4. Summary

위에서 언급했던 부분을 모든 사용자를 위한 추천 생성으로 확장하기 위해 다음과 같이 함수를 작성할 수 있습니다.

# INPUT : USER ID
generate_Recomm <- function(user) {

  # 3. 1.
  rating_critic <- setDT(movie_ratings[colnames(movie_ratings)[user]],
                         keep.rownames = TRUE)[]
  names(rating_critic) <- c("title", "rating")

  # 3. 2.
  title_NA_critic <- rating_critic$title[is.na(rating_critic$rating)]

  ratings_t <- data1[data1$title %in% title_NA_critic, ]

  x <- setDT(data.frame(sim_users[, user]),
             keep.rownames = TRUE)[]
  names(x) <- c("critic", "similarity")

  ratings_t <- merge(ratings_t, x,
                     by = "critic", all.x = TRUE)

  # 3. 3.
  ratings_t$sim_rating <- ratings_t$rating * ratings_t$similarity

  # 3. 4.
  result <- ratings_t %>% group_by(title) %>% summarise( sum(sim_rating) / sum(similarity) )

  return(result)
}

for (i in 1:6) {
  print( generate_Recomm(i) )
}

## # A tibble: 1 x 2
##               title `sum(sim_rating)/sum(similarity)`
##              <fctr>                             <dbl>
## 1 Lady in the Water                          2.856137
## # A tibble: 0 x 2
## # ... with 2 variables: title <fctr>,
## #   sum(sim_rating)/sum(similarity) <lgl>
## # A tibble: 1 x 2
##          title `sum(sim_rating)/sum(similarity)`
##         <fctr>                             <dbl>
## 1 Just My Luck                          2.409926
## # A tibble: 0 x 2
## # ... with 2 variables: title <fctr>,
## #   sum(sim_rating)/sum(similarity) <lgl>
## # A tibble: 0 x 2
## # ... with 2 variables: title <fctr>,
## #   sum(sim_rating)/sum(similarity) <lgl>
## # A tibble: 3 x 2
##                title `sum(sim_rating)/sum(similarity)`
##               <fctr>                             <dbl>
## 1       Just My Luck                          2.530981
## 2  Lady in the Water                          2.832550
## 3 The Night Listener                          3.347790

'Statistical Modeling & ML > Recommender system' 카테고리의 다른 글

협업 필터링 추천 시스템 구축하기 (0)	2017.10.11
추천 엔진에 이용되는 데이터 마이닝 기법 (0)	2017.10.09
추천 엔진의 이해 (0)	2017.10.09
Introduction: Recommendation System (1)	2017.10.09

TAGS.

제이드의 낙서장

카테고리

방문자수