[R] 11. t-Test

예시로 쓰일 데이터 예제

set.seed(2021)

# 임의로 데이터를 생성한다. (100명의 유저가 특정 곡을 스트리밍한 이력)
temp <- tibble(
  user_id = c(10000:10149),
  user_age = sample(x = round(runif(n = 50, min = 18, max = 50), 0), size = 150, replace = TRUE),
  user_gender = sample(x = c("남성", "여성"), size = 150, prob = c(0.5, 0.5), replace = TRUE),
  song_id = sample(x = letters[1:15], size = 150, replace = TRUE),
  streaming_count = rpois(n = 150, lambda = 20),
  download_count = rpois(n = 150, lambda = 5)
) %>% 
  mutate(
    song_class_flag = case_when(
      song_id %in% c("d", "e", "f") ~ "인기곡",
      TRUE ~ "비인기곡"
    )
  )

temp

## # A tibble: 150 x 7
##    user_id user_age user_gender song_id streaming_count download_count
##      <int>    <dbl> <chr>       <chr>             <int>          <int>
##  1   10000       44 여성        e                    20              6
##  2   10001       47 남성        f                    21              6
##  3   10002       49 남성        k                    14              3
##  4   10003       44 남성        j                     8              4
##  5   10004       26 여성        f                    20              5
##  6   10005       44 여성        j                    17              2
##  7   10006       20 여성        j                    24              3
##  8   10007       27 남성        l                    24              1
##  9   10008       34 여성        f                    20              5
## 10   10009       26 남성        d                    27              6
## # … with 140 more rows, and 1 more variable: song_class_flag <chr>

t-Test

t-Test는 흔히 AB테스트 같은 실험상황에서 두 그룹 간의 평균에 차이가 있는 지, 평균 차이에 대한 검정 시에 주로 쓰입니다.
독립된 두 집단의 평균값을 비교하는 검정을 독립표본 t-Test (two sample t-test)라고 부릅니다.
함수는 t.test()를 사용합니다.
두 집단의 변수의 등분산성 가정 여부를 체크하기도 하는데 이 또한 함수 옵션에 var.equal 이라는 argument로 존재합니다.

예시 데이터에서 성별(user_gender) 스트리밍 횟수에 대한 평균 차이를 확인해보는 코드는 아래와 같습니다.
- 스트리밍 횟수 값 자체를 성별에 상관없이 포아송분포에서 샘플링하였기에 등분산성을 만족한다고 가정합니다. (var.equal = TRUE)
t-Test에 대한 통계적 지식을 사전지식으로 가지고 계신분들은 이 결과를 해석하는데 어렵지 않을 것이라고 예상됩니다.

# t.test(값 ~ 요인별 집단)
temp %>% 
  t.test(streaming_count ~ user_gender, data = ., var.equal = TRUE)

## 
##  Two Sample t-test
## 
## data:  streaming_count by user_gender
## t = -0.43925, df = 148, p-value = 0.6611
## alternative hypothesis: true difference in means between group 남성 and group 여성 is not equal to 0
## 95 percent confidence interval:
##  -2.032707  1.293390
## sample estimates:
## mean in group 남성 mean in group 여성 
##           19.76923           20.13889

마찬가지로 위 결과를 broom 라이브러리의 tidy() 함수를 사용하여 조금 더 보기 편한 데이터 형태로 변환할 수 있습니다.

temp %>% 
  t.test(streaming_count ~ user_gender, data = ., var.equal = TRUE) %>% 
  broom::tidy()

## # A tibble: 1 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1   -0.370      19.8      20.1    -0.439   0.661       148    -2.03      1.29
## # … with 2 more variables: method <chr>, alternative <chr>

이번에는 곡(song_id)별로 성별에 따른 평균 스트리밍 횟수 차이를 확인해보겠습니다.
- split() 함수와 map() 함수를 적절하게 사용합니다.
- 여기서 map() 함수를 사용하게 되면 함수 내부의 데이터를 .이 아니라 .x와 같은 형태로 지정받게 됩니다.

# split() 함수를 통해 곡들을 분리한 후 map() 함수를 적용하여 
temp %>% 
  filter(song_id %in% c("a", "b", "c", "d")) %>% 
  split(.$song_id) %>% 
  map(~ t.test(streaming_count ~ user_gender, data = .x))

## $a
## 
##  Welch Two Sample t-test
## 
## data:  streaming_count by user_gender
## t = 0.46265, df = 6.2242, p-value = 0.6594
## alternative hypothesis: true difference in means between group 남성 and group 여성 is not equal to 0
## 95 percent confidence interval:
##  -7.071859 10.405192
## sample estimates:
## mean in group 남성 mean in group 여성 
##           26.66667           25.00000 
## 
## 
## $b
## 
##  Welch Two Sample t-test
## 
## data:  streaming_count by user_gender
## t = 0.10234, df = 6.1282, p-value = 0.9218
## alternative hypothesis: true difference in means between group 남성 and group 여성 is not equal to 0
## 95 percent confidence interval:
##  -10.25536  11.15536
## sample estimates:
## mean in group 남성 mean in group 여성 
##              21.20              20.75 
## 
## 
## $c
## 
##  Welch Two Sample t-test
## 
## data:  streaming_count by user_gender
## t = -0.74764, df = 5.0154, p-value = 0.4882
## alternative hypothesis: true difference in means between group 남성 and group 여성 is not equal to 0
## 95 percent confidence interval:
##  -14.783576   8.116909
## sample estimates:
## mean in group 남성 mean in group 여성 
##           16.66667           20.00000 
## 
## 
## $d
## 
##  Welch Two Sample t-test
## 
## data:  streaming_count by user_gender
## t = 0.035337, df = 4.794, p-value = 0.9732
## alternative hypothesis: true difference in means between group 남성 and group 여성 is not equal to 0
## 95 percent confidence interval:
##  -9.086786  9.336786
## sample estimates:
## mean in group 남성 mean in group 여성 
##             20.375             20.250

map_dfr() 함수를 사용하여 map에 적용된 결과 값을 조금 더 깔끔하게 표현하면 아래와 같습니다.

temp %>% 
  filter(song_id %in% c("a", "b", "c", "d")) %>% 
  split(.$song_id) %>% 
  map(~ t.test(streaming_count ~ user_gender, data = .x)) %>% 
  map_dfr(~ tidy(.))

## # A tibble: 4 x 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1    1.67       26.7      25      0.463    0.659      6.22    -7.07     10.4 
## 2    0.450      21.2      20.8    0.102    0.922      6.13   -10.3      11.2 
## 3   -3.33       16.7      20     -0.748    0.488      5.02   -14.8       8.12
## 4    0.125      20.4      20.2    0.0353   0.973      4.79    -9.09      9.34
## # … with 2 more variables: method <chr>, alternative <chr>

저작자표시 비영리

'tidyverse' 카테고리의 다른 글

[R] 13. 분산분석(ANOVA) (0)	2021.07.14
[R] 12. 카이제곱 검정(chi-squared test) (0)	2021.07.13
[R] 10. 피어슨 상관계수(Pearson's Corrleation) (0)	2021.07.06
[R] 9. 기술통계분석 (0)	2021.07.05
[R] 8. 데이터 합치기 (join) (0)	2021.07.05

TAGS.

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

제이드의 낙서장

카테고리

방문자수

[R] 11. t-Test

예시로 쓰일 데이터 예제

t-Test

'tidyverse' 카테고리의 다른 글

Comments

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역