제이드의 낙서장

[R] 7. ARIMA (AutoRegressive Integrated Moving Average)

택2 — Fri, 6 Aug 2021 18:14:10 +0900

library(fpp3)
#library(fable)

Stationarity and differencing

정상성(Stationarity)과 차분(differencing)
정상 시계열이란?
- 통계적 특징들이 관측되는 시점에 의존되지 않는 시계열을 말합니다.
- 따라서 추세가 있는 시계열 또는 계절성이 있는 시계열들을 정상성이 있다고 보기 어려우며, 추세와 계절성은 다른 시점의 시계열 값에 영향을 줍니다.
차분이란?
- 연속적으로 측정된 관측치 간의 차이를 계산하는 것을 차분이라고 하며, 비정상 시계열을 정상으로 만들어 주는 방법 중 하나 입니다.
- 차분은 시계열의 추세와 계절성을 제거 또는 축소하여 시계열의 평균을 안정화하는데 도움이 될 수 있습니다.
- 관측치 간의 차이 간격 n에 따라 n차 차분이 있을수도 있으며, 때에 따라 2차 차분까지도 진행될 수 있습니다.
그 외 로그 변환도 시계열의 분산을 안정화하는데 도움이 될 수 있습니다.
또한 ACF 그래프는 비정상 시계열을 식별하는데 도움이 됩니다.
- 정상 시계열의 경우 ACF는 상대적으로 빠르게 0으로 떨어지는 반면 비정상 시계열은 천천히 감소합니다.
- 또한 비정상 시계열의 경우 값이 더 크고 양수인 경우가 많습니다.

Random walk model

\[y_{t} - y_{t-1} = \epsilon_{t}\] - \(\epsilon_{t}\)를 white noise 라고 정의할 때 아래 정의를 랜덤 워크 모형이라고 부릅니다.

\[y_{t} = y_{t-1} + \epsilon_{t}\]

랜덤 워크 모형은 비정상 시계열의 대표적인 예이며 재무 및 경제 분야에서도 널리 사용됩니다.
이 모형의 예측은 마지막 관측치와 동일합니다.
- 미래의 움직임은 예측할 수 없고, 오르거나 내리거나 할 가능성도 동일하기 떄문

Seasonal differencing

차분도 특정 시점 전의 관측치를 뺴는 것처럼 계절 단위로도 차분이 가능합니다.
시계열 데이터가 정상성을 얻기 위해서는 1차 차분과 계절 차분 모두 취해야하는 경우가 있습니다.
아래 예시로 확인하시길 바랍니다.

PBS %>% 
  filter(ATC2 == "H02") %>% 
  summarise(Cost = sum(Cost)/1000000) %>% 
  transmute(
    `Sales ($million)` = Cost,
    `Log sales` = log(Cost),
    `Annual change in log sales` = difference(log(Cost), 12), # Annual data이기에 12
    `Doubly differenced log sales` = difference(difference(log(Cost), 12), 1)
  ) %>% 
  pivot_longer(
    cols = -Month,
    names_to = "Type",
    values_to = "Sales"
  ) %>% 
  mutate(Type = factor(Type, levels = c("Sales ($million)", 
                                        "Log sales",
                                        "Annual change in log sales", 
                                        "Doubly differenced log sales"))) %>% 
  ggplot(aes(x = Month, y = Sales)) +
  geom_line(size = 0.7) +
  facet_wrap(. ~ Type, scales = "free_y", nrow = 4) +
  labs(title = "Coricosteroid durg sales", y = NULL)

계절 차분을 먼저 한 이후 1차 차분을 진행하는 것이 좋습니다.
- 1차 차분 먼저 수행하게되면 여전히 계절성이 존재할 가능성이 있기 때문입니다.

Unit root tests

차분이 필요한지 아닌지를 판단할 수 있는 객관적인 방법 중 하나로 단위 루트 검정을 사용하는 것 입니다.
이는 정상성에 대한 통계적 가설 검정 기법 중 하나입니다.
여기에서는 KPSS(Kwiatkowski-Phillips-Schmidt-Shin) 테스트를 사용합니다.
- 이 검정에서 귀무 가설은 “데이터가 정상성을 만족한다” 입니다.
- 결과적으로 p-value가 유의수준 이하이면 귀무가설을 기각하므로 차분이 필요함을 나타냅니다.
이 검정법은 features() 함수를 같이 사용하여 unitroot_kpss() 함수로 진행할 수 있습니다.

aus_total_retail <- aus_retail %>% 
  summarise(Turnover = sum(Turnover))

aus_total_retail %>% 
  mutate(log_turnover = log(Turnover)) %>% 
  features(.var = log_turnover, feature = unitroot_kpss)

## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       <dbl>       <dbl>
## 1      7.35        0.01

통계량의 유의수준이 0.05보다 작으므로 귀무가설을 기각합니다. 즉, 차분을 할 필요가 있다는 뜻입니다.
그렇다면 몇차 차분이 필요할까요?
그에 대한 답변은 unitroot_nsdiffs() 또는 unitroot_ndiffs() 함수를 통해 구할 수 있습니다.
- unitroot_nsdiffs()는 계절 차분에 대해서
- unitroot_ndiffs()는 일반 n차 차분에 대해서

aus_total_retail %>% 
  mutate(log_turnover = log(Turnover)) %>%
  features(.var = log_turnover, feature = unitroot_nsdiffs)

## # A tibble: 1 x 1
##   nsdiffs
##     <int>
## 1       1

aus_total_retail %>% 
  mutate(log_turnover = difference(log(Turnover), 12)) %>%
  features(.var = log_turnover, feature = unitroot_ndiffs)

## # A tibble: 1 x 1
##   ndiffs
##    <int>
## 1      1

aus_total_retail %>% 
  mutate(log_turnover = difference(difference(log(Turnover), 12), 1)) %>%
  features(.var = log_turnover, feature = unitroot_ndiffs)

## # A tibble: 1 x 1
##   ndiffs
##    <int>
## 1      0

즉, 위 결과를 통해 해당 시계열 데이터는 1차 계절차분과 1차 차분이 필요했음을 확인할 수 있었습니다.

Autoregressive models

이전에 다중 회귀 모형에서는 변수들의 선형 조합( linear combination)을 이용하여 예측했습니다.
자기 회귀 모형에서는 변수들의 과거 값들의 선형 조합을 이용하여 예측합니다.
자기회귀(autoregressive)라는 단어에는 자기 자신에 대한 변수의 회귀라는 의미가 있습니다.
차수 \(p\)에 대한 자기회귀 모형은 아래와 같이 정의할 수 있습니다.

\[y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + ... + \phi_{p}y_{t-p} + \epsilon_{t}\]

이를 \(p\)차 자기회귀 모형인 \(\text{AR}(p)\) 모형이라고 부릅니다.
모형에 따른 제한 조건들이 필요한데 상세한 부분은 여기 페이지 하단 부분을 참고하시면 됩니다!

Moving average models

이동 평균 모형에서는 다중 회귀 모형과 비슷해보이나 과거의 예측 오차(forecast error)를 활용합니다.

\[y_{t} = c + \epsilon_{t} + \theta_{1}\epsilon_{t-1} + \theta_{2}\epsilon_{t-2} + ... + \theta_{q}\epsilon_{t-q}\]

이를 \(q\)차 이동 평균 모형인 \(\text{MA}(q)\) 모형이라고 부릅니다.
\(y_t{t}\)의 값을 과거 몇 개의 예측 오차의 가중 이동 평균으로 고려해볼 수 있다는 점에 주목하시면 됩니다.
- 이동 평균 평활(smoothing)과 헷갈리시면 안됩니다.
- 이동 평균 모형은 미래 값을 예측할 때 사용하지만 평활은 과거 추세-주기를 측정할 떄 사용합니다.
마찬가지로 모형에 필요한 제한 사항들은 여기 페이지 하단 부분을 읽어주세요!

Non-seasonal ARIMA models

차분하는 과정과 자귀회귀(AR), 그리고 이동평균(MA) 모형을 결합하면 비계절성 ARIMA 모형을 얻을 수 있습니다.
ARIMA는 AutoRegressive Integrated Moving Average (이동 평균을 누적한 자기 회귀)의 약자입니다.
모형 식은 아래와 같습니다.

\[y_{t}^{'} = c + \phi_{1}+y_{t-1}^{'} + ... + \phi_{p}+y_{t-[]}^{'} + \theta_{1}\epsilon_{t-1} + ... + \theta_{q}\epsilon_{t-q} + \epsilon_{t}\]

위 식에서 좌변 \(y_{t}^{'}\)는 차분을 통해 얻은 시계열 입니다.
우변은 시차 값과 시차 오차(lagged error) 둘 다를 포함합니다.
이를 \(\text{ARIMA}(p, d, q)\) 모형이라고 부릅니다.
- \(d\)는 몇차 차분이 들어갔는 지를 의미 합니다.
이제부터 예시를 통해 살펴보겠습니다.

아래 그래프는 1960년부터 2017년까지 이집트 국가의 GDP 대비 수출의 비율을 보여줍니다.

global_economy %>% 
  filter(Code == "EGY") %>% 
  autoplot(Exports) +
  labs(y = "% of GDP", title = "Egyptian Exports")

ARIMA 모형을 적용하는 함수는 ARIMA() 입니다. ARIMA 모형의 각 차수 값을 자동으로 선택해줍니다.

fit <- global_economy %>% 
  filter(Code == "EGY") %>% 
  model(ARIMA(Exports))

fit %>% 
  report()

## Series: Exports 
## Model: ARIMA(2,0,1) w/ mean 
## 
## Coefficients:
##          ar1      ar2      ma1  constant
##       1.6764  -0.8034  -0.6896    2.5623
## s.e.  0.1111   0.0928   0.1492    0.1161
## 
## sigma^2 estimated as 8.046:  log likelihood=-141.57
## AIC=293.13   AICc=294.29   BIC=303.43

위 결과와 같이 \(\text{ARIMA}(2, 0, 1)\) 모형이 적합되었습니다.
이를 기반으로 하는 예측은 아래와 같습니다.

fit %>% 
  forecast(h = 10) %>% 
  autoplot(global_economy) +
  labs(y = "% of GDP", title = "Egyptian Exports")

Understanding ARIMA models

모든 차수들을 코드에 의해 자동으로 결정하게 두면 편리하긴 하지만 그래도 모델이 대략적으로 작동하는 방식은 익혀둘 필요가 있습니다.
모형에서 상수 값 \(c\)는 장기 예측값에 중요한 영향을 줍니다.
- \(c=0\)이고 \(d=0\)이면, 장기 예측값이 0에 가까워질 가능성이 높습니다.
- \(c=0\)이고 \(d=1\)이면, 장기 예측값이 0이 아닌 상수에 가까워질 가능성이 높습니다.
- \(c=0\)이고 \(d=2\)이면, 장기 예측값이 직선 형태로 나타날 가능성이 높습니다.
- \(c\neq0\)이고 \(d=0\)이면, 장기 예측값이 데이터의 평균에 가까워질 가능성이 높습니다.
- \(c\neq0\)이고 \(d=1\)이면, 장기 예측값이 직선 형태로 나타날 가까워질 가능성이 높습니다.
- \(c\neq0\)이고 \(d=2\)이면, 장기 예측값이 2차 곡선 추세로 나타날 가능성이 높습니다.
\(d\) 값은 예측 구간(prediction interval)에도 영향을 줍니다.
- \(d\) 값이 클수록 예측 구간의 크기가 더욱 급격하게 늘어납니다.

ACF and PACF plots

보통 단순하게 시계열 그래프만 보고나서 \(p\), \(q\) 값이 데이터에 맞았는지 이야기하기 어렵습니다.
따라서 ACF, PACF 그래프를 참고할 필요도 있습니다.
\(y_{t}\)와 \(y_{t-k}\)의 관계를 측정하는 자기상관 ACF 그래프를 생각해본다면..
\(y_{t}\)와 \(y_{t-1}\)이 상관관계가 있다면, \(y_{t-1}\), \(y_{t-2}\)에도 상관관계가 있어야 합니다.
하지만 \(y_{t}\)와 \(y_{t-2}\)는 값에 종속보다 두 값 모두 \(y_{t-1}\)과 관련이 있어서 상관관계를 가질 수도 있습니다.
이러한 문제를 극복하기 위해 부분 자기상관(partial autocorrelations)을 이용할 수 있습니다.
이 값은 시차의 효과를 제거한 후 \(y_{t}\)와 \(y_{t-k}\) 사이의 관계를 측정합니다.
이 두 개의 값을 구하는 함수는 ACF()와 PACF() 입니다.

global_economy %>% 
  filter(Code == "EGY") %>% 
  ACF(Exports) %>% 
  autoplot()

global_economy %>% 
  filter(Code == "EGY") %>% 
  PACF(Exports) %>% 
  autoplot()

한 번에 ACF, PACF 두 개의 그래프를 그리고 싶다면 아래 코드처럼 gg_tsdisplay() 함수를 이용하시면 됩니다.

global_economy %>% 
  filter(Code == "EGY") %>% 
  gg_tsdisplay(Exports, plot_type = "partial")

이러한 그래프 해석은 다음과 같이 해볼 수 있습니다.
- 아래와 같이 나타날 경우 데이터는 \(\text{ARIMA}(p, d, 0)\) 모형을 따를 수도 있습니다.
  - ACF가 지수적으로 감소하거나 \(\sin(x)\) 모양인 경우
  - 또는 PACF에서 시차 \(p\)에 뾰족한 막대가 유의미하게 있지만 시차 \(p\) 이후에는 없을 때
- 아래와 같이 나타날 경우 데이터는 \(\text{ARIMA}(0, d, q)\) 모형을 따를 수도 있습니다.
  - PACF가 지수적으로 감소하거나 \(\sin(x)\) 모양인 경우
  - 또는 ACF에서 시차 \(q\)에 뾰족한 막대가 유의미하게 있지만 시차 \(q\) 이후에는 없을 때
이런 관점에서 다시 위의 ACF, PACF 그래프를 본다면..
- ACF는 \(\sin(x)\) 함수 형태와 유사한 편이고
- PACF는 4번째 시차에서 마지막으로 유의미한 뾰족 막대를 보입니다.
따라서 이 데이터는 \(\text{ARIMA}(4, 0, 0)\) 모형을 기대해볼 수도 있습니다.
ARIMA()에서 자동이 아닌 수동으로 차수를 지정하여 모형을 적합시킬 때에는 pdq()를 활용합니다.

fit2 <- global_economy %>% 
  filter(Code == "EGY") %>% 
  model(ARIMA(Exports ~ pdq(4, 0, 0)))

fit2 %>% 
  report()

## Series: Exports 
## Model: ARIMA(4,0,0) w/ mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4  constant
##       0.9861  -0.1715  0.1807  -0.3283    6.6922
## s.e.  0.1247   0.1865  0.1865   0.1273    0.3562
## 
## sigma^2 estimated as 7.885:  log likelihood=-140.53
## AIC=293.05   AICc=294.7   BIC=305.41

그 결과 AICc 값은 \(\text{ARIMA}(2, 0, 1)\) 모형이 조금 더 낮으므로 더 나은 모형이라고 생각할 수 있습니다.

Estimation and order selection

Maximum likelihood estimation

모형의 차수(\(p\), \(d\), \(q\))를 찾은 다음 상수항 \(c\)와 \(\theta\), \(\phi\) 매개 변수를 추정해야 합니다.
R에서는 ARIMA 모형을 계산할 때 MLE(Maximum Likelihood Estimation)를 사용합니다.
아래와 같이 오차항을 최소화하는 최소제곱 추정과 비슷합니다.

\[\sum_{t=1}^{T} \epsilon_{t}^{2}\] - 해당 라이브러리에서 쓰인 ARIMA() 함수는 매개변수 값을 추정할 때 로그가능도함수를 최대화하는 값을 기반으로 찾습니다.

Information Criteria

아카이케 정보 기준(AIC)은 ARIMA 모델에서 차수를 결정할 때도 유용합니다.
- \(L\)은 바로 위에서 언급한 가능도함수
- \(c\neq0\)이면 \(k=1\), \(c=0\)이면 \(k=0\)

\[\text{AIC} = -2\log(L) + 2(p+q+k+1)\]

ARIMA에서 수정된 AIC(AICc)는 아래와 같습니다.

\[\text{AIC}_{c}=\text{AIC}+\frac{2(p+q+k+1)(p+q+k+2)}{T-p-q-k-2}\]

참고로 AICc를 활용한 모형 선택은 같은 차수로 차분을 한 ARIMA 모형에서만 의미가 있습니다.

ARIMA modelling in `fable`

How does `ARIMA()` work?

fable 라이브러리에서 ARIMA() 함수는 AICc 및 MLE를 최소화를 결합한 Hyndman-Khandakar 알고리즘을 사용한다고 합니다.
자동화된 ARIMA 모델링을 위한 힌드만-칸다카르 알고리즘을 정말 아주 간략하게만 요약해보면 아래 과정입니다.
- 1. KPSS 검정을 반복하여 차분 횟수를 결정합니다. (\(0 \le d \le 2\))
  1. 데이터를 \(d\)번 차분한 후 AICc를 최소화하여 \(p\)와 \(q\)를 결정합니다. (stepwise exploration)

Modelling procedure

일반적으로 ARIMA 모형에 시계열 데이터 세트를 적합할 때 아래와 같은 절차를 가집니다.
- 1. 데이터를 시각화해보고 이상치들을 식별합니다.
  1. 필요한 경우 데이터를 변환하여 분산을 안정화합니다. (Box-Cox 변환 사용)
  1. 데이터가 정상성을 나타내지 않는다면, 데이터가 정상성을 나타날 때까지 데이터를 가지고 1차 차분을 계산합니다.
  1. ACF, PACF를 살펴봅니다. \(\text{ARIMA}(p, d, 0)\) 또는 \(\text{ARIMA}(0, d, q)\) 어느 것이 적절한지..
  1. 그외 다른 모형도 적합해보고 더 나은 모형을 찾기 위해 AICc를 활용하여 점검합니다.
  1. 잔차의 ACF를 그려보고 검정하여 확인합니다. 잔차가 특별한 패턴을 보이지 않는다면 (백색잡음) 예측 값을 계산합니다.
본격 예시입니다. 비슷하게 global_economy 데이터를 활용하여 중앙아프리카의 수출량을 확인해보겠습니다.

global_economy %>% 
  filter(Code == "CAF") %>% 
  autoplot(Exports) +
  labs(y = "% of GDP", title = "Central African Republic exports")

위 그래프는 감소 추세와 더불어 일부 비정상성을 보입니다.
또한 분산의 변화 증거가 없으므로 Box-Cox 변환은 스킵합니다.
비정상성을 해결하기 위해 gg_tsdisplay() 함수를 적용하여 1차 차분을 적용한 후 ACF, PACF 그래프를 확인해봅니다.

global_economy %>% 
  filter(Code == "CAF") %>% 
  gg_tsdisplay(difference(Exports, 1), plot_type = "partial")

## Warning: Removed 1 row(s) containing missing values (geom_path).

## Warning: Removed 1 rows containing missing values (geom_point).

어느 정도 비정상성이 해소된 것처럼 보입니다.
ACF 그래프를 보면 \(q=3\)인 ARIMA(0, 1, 3), PACF 그래프를 보면 \(p=2\)인 ARIMA(2, 1, 0)가 적당할 것 같습니다.
따라서 두 개의 모형과 더불어 또 다른 하나는 차수를 자동 선택하게끔(stepwise), 또 다른 하나는 전반적인 탐색을 위한 모형(search)을 적합시킵니다.

caf_fit <- global_economy %>% 
  filter(Code == "CAF") %>% 
  model(
    arima013 = ARIMA(Exports ~ pdq(0, 1, 3)),
    arima210 = ARIMA(Exports ~ pdq(2, 1, 0)),
    stepwise = ARIMA(Exports),
    search = ARIMA(Exports, stepwise = FALSE)
  )

caf_fit %>% 
  pivot_longer(
    cols = 2:5,
    names_to = "ModelName",
    values_to = "Orders"
  )

## # A mable: 4 x 3
## # Key:     Country, ModelName [4]
##   Country                  ModelName         Orders
##   <fct>                    <chr>            <model>
## 1 Central African Republic arima013  <ARIMA(0,1,3)>
## 2 Central African Republic arima210  <ARIMA(2,1,0)>
## 3 Central African Republic stepwise  <ARIMA(2,1,2)>
## 4 Central African Republic search    <ARIMA(3,1,0)>

caf_fit %>% 
  select(stepwise) %>% 
  report()

## Series: Exports 
## Model: ARIMA(2,1,2) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2
##       -0.6741  -0.7142  0.2468  0.4831
## s.e.   0.1821   0.2037  0.2531  0.2576
## 
## sigma^2 estimated as 6.416:  log likelihood=-132.1
## AIC=274.2   AICc=275.37   BIC=284.41

caf_fit %>% 
  select(search) %>% 
  report()

## Series: Exports 
## Model: ARIMA(3,1,0) 
## 
## Coefficients:
##           ar1      ar2     ar3
##       -0.4419  -0.1850  0.2055
## s.e.   0.1295   0.1385  0.1274
## 
## sigma^2 estimated as 6.519:  log likelihood=-133
## AIC=274   AICc=274.77   BIC=282.18

caf_fit %>% 
  glance() %>% 
  arrange(AICc) %>% 
  select(.model, AICc)

## # A tibble: 4 x 2
##   .model    AICc
##   <chr>    <dbl>
## 1 search    275.
## 2 arima210  275.
## 3 arima013  275.
## 4 stepwise  275.

네 가지 모형은 거의 동일한 AICc 값을 같습니다.
그나마 근소하게 search 이름으로 적합된 ARIMA(3, 1, 0)이 가장 낮은 AICc 값을 보입니다.
따라서 해당 모형을 가지고 gg_tsresiduals() 함수를 사용하여 잔차의 백색잡음 여부와 잔차의 ACF를 확인해봅니다.

caf_fit %>% 
  select(search) %>% 
  gg_tsresiduals()

ljung_box를 적용하여 포트멘토 검정(portmanteau)도 해볼 수 있습니다.
- lag는 계산에 쓰일 시차 자기상관 계수의 수
- dof는 적합된 모형의 자유도

caf_fit %>% 
  augment() %>% 
  filter(.model == "search") %>% 
  features(
    .var = .innov,
    features = ljung_box,
    lag = 10,
    dof = 3
  )

## # A tibble: 1 x 4
##   Country                  .model lb_stat lb_pvalue
##   <fct>                    <chr>    <dbl>     <dbl>
## 1 Central African Republic search    5.75     0.569

p-value가 유의수준 보다 크게 나왔습니다. 해당 잔차는 백색잡음(유의미한 패턴이 없음)임을 확인할 수 있습니다.
마지막으로 forecast() 함수를 사용하여 예측값을 확인할 수 있습니다.

caf_fit %>% 
  select(Country, search) %>% 
  forecast(h = 5) %>% 
  autoplot(global_economy)

[R] 6. Exponential smoothing

택2 — Thu, 29 Jul 2021 19:55:43 +0900

library(fpp3)
#library(fable)

Simple exponential smoothing

SES(Simple Exponential Smoothing)는 지수 평활화 방법 중 가장 단순한 방법입니다.
이 방법은 뚜렷한 추세나 계절적 패턴이 없는 데이터를 예측하는데 적합합니다.
SES와 같이 나이브한 방법을 사용하면 미래에 대한 모든 예측값은 시계열의 마지막 관측값과 같습니다.
따라서 예측 시점을 기준으로 가장 최근의 관측치가 유일하게 중요한 관측치이면서 그 이전 관측치는 미래에 대한 정보를 제공하지 않는다고 가정합니다.
즉, 모든 가중치가 마지막 관측치에 주어지는 가중 평균으로 생각할 수 있습니다.

\[\hat{y}_{T+h|T} = \frac{1}{T}\sum^{T}_{t=1}y_{t}\] - 그러나 먼 과거의 관측치보다 최그 관측치에 더 큰 가중치를 부여하는 것이 현명할 수 있습니다. - 이것이 바로 SES의 개념입니다. - 예측값은 가중 평균을 사용하여 계산하되, 관측치가 더 먼 과거에서 올수록 가중치가 지수적으로 감소하게 되는 형태입니다.

\[\hat{Y}_{T+1|T} = \alpha y_{T} + \alpha(1-\alpha)y_{T-1} + \alpha(1-\alpha)^{2}y_{T-2} + ...\]

위 식에서 \(\alpha\)는 0과 1 사이의 값을 갖는 smoohting parameter 이며 가중치가 감소하는 비율은 이 매개변수에 의해 제어됩니다.
더 자세한 수식 전개는 다루지 않겠습니다.
SES는 평평한 예측(Flat forecasts)이 가능합니다.
즉, 모든 예측은 마지막 관측치 요소와 동일한 수준의 값을 취합니다.
이러한 예측은 시계열의 추세나 계절 성분이 없는 경우에만 적합하다는 것을 참고하세요!

\[\hat{y}_{T+h|T} = \hat{y}_{T+1|T} = l_{T}\]

Optimisation

회귀 모형에서 잔차제곱합을 최소화하여 회귀 모델의 계수를 추정하는 것처럼 이도 유사하게 아래 SSE 값을 최소화하여 추정합니다.
물론 unknown parameter인 \(\alpha\)는 분석가가 사전에 셋팅해야합니다.

\[\text{SSE} = \sum^{T}_{t=1}\bigg(y_{t} - \hat{y}_{t|t-1}\bigg)^{2} = \sum^{T}_{t=1}\epsilon_{t}^{2}\]

예시를 위해 아래와 같이 tsibbledata 라이브러리의 global_economy 데이터를 사용할 것입니다.

algeria_economy <- global_economy %>% 
  filter(Country == "Algeria")

algeria_economy %>% 
  autoplot(Exports) +
  labs(y = "% of GDP", title = "Exports: Algeria")

지수평활화를 적용하는 함수는 ETS() 입니다.
- 추세, 계절, 에러 등 각 항별로 수식의 폼을 정할 수 있습니다.
- “A”: Additive (가법) " “M”: Multiplicative (승법)
- “Ad”, “Md”: 감쇠방법
- “N”: None

# 추세와 계절성이 없기에 method arguments는 "N"
fit <- algeria_economy %>% 
  model(ETS(formula = Exports ~ error("A") + trend("N") + season("N")))

fit %>% 
  report()

## Series: Exports 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.8399875 
## 
##   Initial states:
##    l[0]
##  39.539
## 
##   sigma^2:  35.6301
## 
##      AIC     AICc      BIC 
## 446.7154 447.1599 452.8968

smoothing parameter \(\alpha\)는 대략 0.84이며 SSE를 최소화하면서 얻어진 초기값은 39.5입니다.

fc <- fit %>% 
  forecast(h = 5)

fc %>% 
  autoplot(algeria_economy) +
  geom_line(data = fit %>% augment(), aes(y = .fitted), color = "#D55E00") +
  labs(y = "% of GDP", title = "Exports: Algeria")

Methods with trend

Holt’s linear trend method

추세가 있는 시계열 데이터를 단순 지수 평활에 적용하기 위해 확장한 방법이 있습니다.
이 방법에는 예측 방정시과 두 개의 평활 방정식이 포함됩니다.
- Forecast equation = \(\hat{y}_{t+h|t} = l_t + hb_{t}\)
- Level equation = \(l_{t} = \alpha y_{t} + (1-\alpha)(l_{t-1}+b_{t-1})\)
- Trend equation = \(b_{t} = \beta^{*}(l_{t} - l_{t-1})+(1-\beta^{*})b_{t-1}\)
- \(l_{t}\)는 \(t\)시점에서의 level 추정치이고, \(b_{t}\)는 트렌드(기울기) 추정치입니다.
- 마찬가지로 각각의 smoothing parameter인 \(\alpha\)와 \(\beta^{*}\)는 0과 1사이의 값을 갖습니다.
이렇게 추정하면 예측 함수는 더 이상 평평하지 않고 어느 정도 트렌드를 타게됩니다.
아래 예시는 1960년부터 2017년까지 호주의 연간 인구수를 보여줍니다.

aus_economy <- global_economy %>% 
  filter(Code == "AUS") %>% 
  mutate(Pop = Population / 1000000)

aus_economy

## # A tsibble: 58 x 10 [1Y]
## # Key:       Country [1]
##    Country   Code   Year       GDP Growth   CPI Imports Exports Population   Pop
##    <fct>     <fct> <dbl>     <dbl>  <dbl> <dbl>   <dbl>   <dbl>      <dbl> <dbl>
##  1 Australia AUS    1960   1.86e10  NA     7.96    14.1    13.0   10276477  10.3
##  2 Australia AUS    1961   1.96e10   2.49  8.14    15.0    12.4   10483000  10.5
##  3 Australia AUS    1962   1.99e10   1.30  8.12    12.6    13.9   10742000  10.7
##  4 Australia AUS    1963   2.15e10   6.21  8.17    13.8    13.0   10950000  11.0
##  5 Australia AUS    1964   2.38e10   6.98  8.40    13.8    14.9   11167000  11.2
##  6 Australia AUS    1965   2.59e10   5.98  8.69    15.3    13.2   11388000  11.4
##  7 Australia AUS    1966   2.73e10   2.38  8.98    15.1    12.9   11651000  11.7
##  8 Australia AUS    1967   3.04e10   6.30  9.29    13.9    12.9   11799000  11.8
##  9 Australia AUS    1968   3.27e10   5.10  9.52    14.5    12.3   12009000  12.0
## 10 Australia AUS    1969   3.66e10   7.04  9.83    13.3    12.0   12263000  12.3
## # … with 48 more rows

aus_economy %>% 
  autoplot(Pop) +
  labs(y = "Millions", title = "Australian population")

여기의 Holt’s linear trend method를 적용하겠습니다.
마찬가지로 ETS() 함수를 쓰되 추세항에 Additive 옵션을 줍니다.

fit <- aus_economy %>% 
  model(AAN = ETS(formula = Pop ~ error("A") + trend("A") + season("N")))

fit %>% 
  report()

## Series: Pop 
## Model: ETS(A,A,N) 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.3266366 
## 
##   Initial states:
##      l[0]      b[0]
##  10.05414 0.2224818
## 
##   sigma^2:  0.0041
## 
##       AIC      AICc       BIC 
## -76.98569 -75.83184 -66.68347

\(\hat{\alpha} = 0.9999\), \(\hat{\beta^{*}} = 0.3266\)

fc <- fit %>% 
  forecast(h = 5)

fc %>% 
  autoplot(aus_economy) +
  geom_line(data = fit %>% augment(), aes(y = .fitted), color = "#D55E00") +
  labs(y = "Millions", title = "Australian population")

Damped trend methods

위의 Holt’s linear trend method에 의해 예측되는 값은 조금 더 긴 예측 기간에 대해서 과도하게 예측하는 경향이 있다고 합니다.
이러한 부분을 어느 정도 보안하는 방법으로 미래의 어느 시점에서 추세를 평평하게 완화하는 매개변수를 도입한 방법이 있습니다.
Gradner & McKenize (1985)는 아래와 같이 0과 1 사이 값인 감쇠 매개변수(dampens parameter)를 도입하였습니다.

\[\hat{y}_{t+h|t} = l_{t} + (\phi + \phi^2 + ... + \phi^{h})b_{t}\] \[l_{t} = \alpha y_{t} + (1-\alpha)(l_{t-1} + \phi b_{t-1})\] \[b_{t} = \beta^{*}(l_{t}-l_{t-1}) + (1-\beta^{*})\phi b_{t-1}\]

여기서 \(\phi = 1\)이면 이 방법은 holt’s의 방법과 동일합니다.
이 감쇠 매개변수로 인해 단기 예측은 일정한 추세는 살리지만 장기로 갈수록 일정하게 만듭니다.
일반적으로 이 감쇠 매개변수는 최소 0.8에서 최대 0.98까지 둔다고 합니다.
아래 예시 코드로 비교해보겠습니다.
- 감쇠 방법에는 trend term에 “Ad”가 들어갑니다.

aus_economy %>% 
  model(
    `Holt's method` = ETS(Pop ~ error("A") + trend("A") + season("N")),
    `Damped Holt's method` = ETS(Pop ~ error("A") + trend("Ad", phi = 0.9) + season("N"))
  ) %>% 
  forecast(h = 15) %>% 
  autoplot(aus_economy, level = NULL) +
  labs(y = "Millions", title = "Australian population") +
  guides(color = guide_legend(title = "Forecast method"))

또 다른 예시입니다. 100분 동안 관찰된 1분 당 인터넷 사용자 수 데이터입니다.

www_usage <- WWWusage %>% 
  as_tsibble()

www_usage

## # A tsibble: 100 x 2 [1]
##    index value
##    <dbl> <dbl>
##  1     1    88
##  2     2    84
##  3     3    85
##  4     4    85
##  5     5    84
##  6     6    85
##  7     7    83
##  8     8    85
##  9     9    88
## 10    10    89
## # … with 90 more rows

www_usage %>% 
  autoplot(value) +
  labs(x = "Minute", y = "Number of users", title = "Internet usage per minute")

시계열 교차검증을 사용하여 아래 세 가지 방법의 1단계 예측 정확도를 비교해봅니다.
- stretch_tsibble() 함수는 관측치들을 여러 조각들로 롤링(rolling) 해주는 역할을 합니다.

www_usage %>% 
  stretch_tsibble(.init = 10) %>% 
  model(
    `SES` = ETS(value ~ error("A") + trend("N") + season("N")),
    `Holt` = ETS(value ~ error("A") + trend("A") + season("N")),
    `Damped` = ETS(value ~ error("A") + trend("Ad") + season("N"))
  ) %>% 
  forecast(h = 1) %>% 
  accuracy(www_usage)

## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 1 observation is missing at 101

## # A tibble: 3 x 10
##   .model .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   <chr>  <chr>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Damped Test  0.288   3.69  3.00 0.347  2.26 0.663 0.636 0.336
## 2 Holt   Test  0.0610  3.87  3.17 0.244  2.38 0.701 0.668 0.296
## 3 SES    Test  1.46    6.05  4.81 0.904  3.55 1.06  1.04  0.803

RMSE 값으로 비교해본 결과 Damped Holt’s method가 가장 좋습니다.
따라서 감쇠 방법으로 예측 추이를 확인해보겠습니다.

fit <- www_usage %>% 
  model(`Damped` = ETS(value ~ error("A") + trend("Ad") + season("N")))

fit %>% 
  report()

## Series: value 
## Model: ETS(A,Ad,N) 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.9966439 
##     phi   = 0.814958 
## 
##   Initial states:
##      l[0]        b[0]
##  90.35177 -0.01728234
## 
##   sigma^2:  12.2244
## 
##      AIC     AICc      BIC 
## 717.7310 718.6342 733.3620

fit %>% 
  tidy()

## # A tibble: 5 x 3
##   .model term  estimate
##   <chr>  <chr>    <dbl>
## 1 Damped alpha   1.00  
## 2 Damped beta    0.997 
## 3 Damped phi     0.815 
## 4 Damped l[0]   90.4   
## 5 Damped b[0]   -0.0173

기울기에 대한 평활화 매개변수 \(\beta\)는 거의 1에 가깝게 추정되며, \(\alpha\) 또한 1에 가까운 수준으로 새로운 관측치에 대해 강력하게 반응하는 것으로 보여집니다.

fit %>% 
  forecast(h = 10) %>% 
  autoplot(www_usage) +
  labs(x = "Minute", y = "Number of users", title = "Internet usage per minute")

Methods with seasonality

추세성 뿐만 아니라 계절성까지 확보하기 위해 위의 Holt 방법을 확장했습니다. (Holt & Winters, 1960)
여기에는 위의 매개변수 \(\alpha\), \(\beta^{*}\)에 추가로 계절성에 대한 smoothing parameter \(\gamma\)까지 고려가 됩니다.
이 방법에는 계절 성분의 특성에 따라 두 가지로 나누어서 적용해볼 수 있습니다.
- 계절적 변동이 전반적으로 거의 일정한 편일 때는 가산법(additive)이 선호되고
  비례하여 변동이 있을 때는 승법(multiplicative)이 선호됩니다.
수식에 대한 접근은 아래 링크를 통해서 확인 부탁드립니다!
- Holt-Winters’ additive method
- Holt-Winters’ multiplicative method
바로 예시로 접근하겠습니다. 마찬가지로 tsibbledata 라이브러리 내 tourism 데이터를 활용합니다.
- 분기별 호주 관광객 수를 나타낸 데이터로, 예시에서는 휴가철에 방문객 수 예측을 목적으로 합니다.

aus_holidays <- tourism %>% 
  filter(Purpose == "Holiday") %>% 
  summarise(Trips = sum(Trips/1000))

aus_holidays

## # A tsibble: 80 x 2 [1Q]
##    Quarter Trips
##      <qtr> <dbl>
##  1 1998 Q1 11.8 
##  2 1998 Q2  9.28
##  3 1998 Q3  8.64
##  4 1998 Q4  9.30
##  5 1999 Q1 11.2 
##  6 1999 Q2  9.61
##  7 1999 Q3  8.91
##  8 1999 Q4  9.03
##  9 2000 Q1 11.1 
## 10 2000 Q2  9.20
## # … with 70 more rows

fit <- aus_holidays %>% 
  model(
    `additive` = ETS(Trips ~ error("A") + trend("A") + season("A")),
    `multiplicative` = ETS(Trips ~ error("M") + trend("A") + season("M"))
  )

fc <- fit %>% 
  forecast(h = "3 years")

fc %>% 
  autoplot(aus_holidays, level = NULL) +
  labs(y = "Overnight trips (millions)", title = "Australian domestic tourism") +
  guides(color = guide_legend(title = "Forecast"))

아래 glance() 함수를 사용하여 모형 평가 메트릭을 확인해본 결과 MSE가 더 낮은 것은 승법모형으로 보여집니다.

fit %>% 
  glance()

## # A tibble: 2 x 9
##   .model          sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE    MAE
##   <chr>            <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1 additive       0.193     -105.  229.  231.  250. 0.174 0.184 0.321 
## 2 multiplicative 0.00212   -104.  227.  229.  248. 0.170 0.183 0.0328

Holt-Winters’ damped method

마찬가지로 이 방법 또한 감쇠법이 가능합니다. 바로 예시로 살펴보겠습니다.

sth_cross_ped <- pedestrian %>% 
  filter(Date >= "2016-07-01", Sensor == "Southern Cross Station") %>% 
  index_by(Date) %>% 
  summarise(Count = sum(Count)/1000)

sth_cross_ped

## # A tsibble: 184 x 2 [1D]
##    Date       Count
##    <date>     <dbl>
##  1 2016-07-01 17.6 
##  2 2016-07-02  2.52
##  3 2016-07-03  1.73
##  4 2016-07-04 17.3 
##  5 2016-07-05 16.5 
##  6 2016-07-06 16.8 
##  7 2016-07-07 17.0 
##  8 2016-07-08 17.2 
##  9 2016-07-09  3.21
## 10 2016-07-10  1.75
## # … with 174 more rows

sth_cross_ped %>%
  filter(Date <= "2016-07-31") %>%
  model(hw = ETS(Count ~ error("M") + trend("Ad") + season("M"))) %>%
  forecast(h = "2 weeks") %>%
  autoplot(sth_cross_ped %>% filter(Date <= "2016-08-14")) +
  labs(y = "Pedestrians ('000)", title = "Daily traffic: Southern Cross")

모형은 데이터의 끝부분에서 주간 계절성 패턴과 증가 추세를 충분히 식별했다는 점을 확인할 수 있습니다.

Model selection

ETS 통계 프레임워크의 가장 큰 장점은 정보 기준을 모형 선택에 사용할 수 있다는 점입니다.
AIC, AICs 및 BIC 등이 있습니다.
\(k\)는 모형에서 주어진 매개변수의 갯수이고 \(L\)은 모형의 우도값(likelihood)라고 할 때 아래와 같이 표현할 수 있습니다.

\[\text{AIC} = -2\log{L} + 2k\] \[\text{AIC}_{c} = AIC + \frac{2k(k+1)}{T-k-1}\] \[BIC = AIC + k(\log{T}-2)\]

아래 예시를 통해 살펴보겠습니다.
기존 위에서 보였던 예시들은 ETS() 함수내에 formula 항을 모두 직접 정의하였지만 단순히 종속변수만 인자로 받을 경우 ETS() 함수는 \(\text{AIC}_{c}\)를 최소로하는 모델을 선택해줍니다.

fit <- aus_holidays %>% 
  model(ETS(Trips))

fit %>% 
  report()

## Series: Trips 
## Model: ETS(M,N,A) 
##   Smoothing parameters:
##     alpha = 0.3484054 
##     gamma = 0.0001000018 
## 
##   Initial states:
##      l[0]       s[0]      s[-1]      s[-2]    s[-3]
##  9.727072 -0.5376106 -0.6884343 -0.2933663 1.519411
## 
##   sigma^2:  0.0022
## 
##      AIC     AICc      BIC 
## 226.2289 227.7845 242.9031

components() 함수를 사용하여 각 모형의 성분들 값을 확인할 수 있습니다.

fit %>% 
  components() %>% 
  autoplot() +
  labs(title = "ETS(M, N, A) components")

## Warning: Removed 4 row(s) containing missing values (geom_path).

fit %>% 
  augment() %>% 
  autoplot(.resid)

fit %>% 
  augment() %>% 
  autoplot(.innov)

[R] 5. Time-series Regression

택2 — Thu, 29 Jul 2021 11:50:42 +0900

library(fpp3)
#library(fable)

The linear model

\[y_{t} = \beta_0 + \beta_1 x_{t} + \epsilon_{t}\]

우리가 흔히 알고 있는 단순선형회귀 모형입니다.
이는 시계열 데이터에 적용할때도 마찬가지로 오차항에 대한 가정을 합니다.
- iid(independent identically distributed)
이를 fable 라이브러리 내 함수를 활용하여 살펴보겠습니다.

예시 데이터는 us_change로 tsibbledata 라이브러리 내에 있습니다.
- 1970년 1분기부터 2019년 2분기까지 미국의 개인 소비 지출(personal consumption expenditure)과 개인 소득(personal disposable income)의 분기별 변화(성장률) 시계열 데이터입니다.

us_change

## # A tsibble: 198 x 6 [1Q]
##    Quarter Consumption Income Production Savings Unemployment
##      <qtr>       <dbl>  <dbl>      <dbl>   <dbl>        <dbl>
##  1 1970 Q1       0.619  1.04      -2.45    5.30         0.9  
##  2 1970 Q2       0.452  1.23      -0.551   7.79         0.5  
##  3 1970 Q3       0.873  1.59      -0.359   7.40         0.5  
##  4 1970 Q4      -0.272 -0.240     -2.19    1.17         0.700
##  5 1971 Q1       1.90   1.98       1.91    3.54        -0.100
##  6 1971 Q2       0.915  1.45       0.902   5.87        -0.100
##  7 1971 Q3       0.794  0.521      0.308  -0.406        0.100
##  8 1971 Q4       1.65   1.16       2.29   -1.49         0    
##  9 1972 Q1       1.31   0.457      4.15   -4.29        -0.200
## 10 1972 Q2       1.89   1.03       1.89   -4.69        -0.100
## # … with 188 more rows

각 변수별 분기 단위의 추이를 살펴보기 위해 tidyr::pivot_longer() 함수를 사용하여 적절히 가공한 후 시각화해보겠습니다.

us_change %>% 
  pivot_longer(
    cols = c("Consumption", "Income"),
    names_to = "Series",
    values_to = "value"
  ) %>% 
  autoplot(value) +
  labs(y = "% change")

소득과 지출간의 선형관계를 살펴보겠습니다. (geom_smooth(method = "lm") 활용)

us_change %>% 
  ggplot(aes(x = Income, y = Consumption)) +
  geom_point(size = 1.2) +
  geom_smooth(formula = y ~ x, method = "lm", se = FALSE) + # method = "lm"은 linear regerssion을 의미합니다.
  labs(
    x = "Income (quarterly % change)",
    y = "Consumption (quaterly % change)"
  )

같은 방식을 TSLM() 함수를 사용하여 추정해보겠습니다.
결과물 출력은 report() 함수를 활용합니다.

us_change %>% 
  model(TSLM(Consumption ~ Income)) %>% 
  report()

## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58236 -0.27777  0.01862  0.32330  1.42229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.54454    0.05403  10.079  < 2e-16 ***
## Income       0.27183    0.04673   5.817  2.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5905 on 196 degrees of freedom
## Multiple R-squared: 0.1472,  Adjusted R-squared: 0.1429
## F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08

그 결과 \(\text{Consumption} = 0.54454 + 0.27183\times\text{Income}\) 선형식이 도출되었습니다.
변수의 p-value 값은 유의미하며, 해석을 하자면 소득이 한 단위 증가할 때 지출은 평균적으로 0.27 단위가 증가한다고 볼 수 있습니다.
(절편까지 고려한다면 소득의 1%p 증가는 평균적으로 지출 0.27+0.54 = 0.82%p 증가로 이어짐)

Multiple linear model

\[y_{t} = \beta_0 + \beta_1 x_{1, t} + \beta_2 x_{2, t} + ... + \beta_k x_{k, t} + \epsilon_{t}\]

독립변수가 두 개 이상인 선형 회귀 모형입니다.
바로 예시로 적용해보겠습니다. 소비 지출을 예측하는 데 있어서 개인 소득뿐만 아니라 산업 생산과 개인 저축, 실업률을 포함하여 보고자합니다.

us_change %>% 
  pivot_longer(
    cols = c("Production", "Savings", "Unemployment"),
    names_to = "Series",
    values_to = "value"
  ) %>% 
  autoplot(value, show.legend = FALSE) +
  labs(y = "% change") +
  facet_wrap(. ~ Series, scales = "free_y", nrow = 3)

먼저 각 변수들간의 상관관계를 확인하고 싶습니다.
GGally 라이브러리의 ggpairs() 함수를 사용하여 확인할 수 있습니다.

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

us_change %>% 
  ggpairs(columns = 2:6)

Least Squares Estimation

선형 회귀에서 각 회귀계수 \(\beta_{k}\)의 추정은 아래 오차항의 제곱합을 최소화하는 과정에서 얻어집니다.

\[\sum^{T}_{t=1}\epsilon_{t}^{2} = \sum^{T}_{t=1}\bigg(y_{t} - \beta_0 - \beta_1x_{1,t}-...-\beta_{k}x_{k, t} \bigg)^{2}\]

이 과정을 최소제곱추정이라고 합니다.
TSLM() 함수는 시계열 데이터를 선형회귀모형에 적합시켜줍니다.
물론 가장 널리쓰이는 lm() 함수와 유사하지만 TSLM()은 시계열 처리를 위한 부가 기능이 더 있다고 보시면 될 것 같습니다.
위 예시를 그대로 이어가겠습니다.

fit_consMR <- us_change %>% 
  model(tslm = TSLM(formula = Consumption ~ Income + Production + Unemployment + Savings))

fit_consMR %>% 
  report()

## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90555 -0.15821 -0.03608  0.13618  1.15471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
## Income        0.740583   0.040115  18.461  < 2e-16 ***
## Production    0.047173   0.023142   2.038   0.0429 *  
## Unemployment -0.174685   0.095511  -1.829   0.0689 .  
## Savings      -0.052890   0.002924 -18.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683,  Adjusted R-squared: 0.7635
## F-statistic:   160 on 4 and 193 DF, p-value: < 2.22e-16

계수에 대한 해석은 다들 아실것이라 생각하고 스킵하겠습니다.

Fitted values

augment() 함수를 사용하여 적합된 값을 출력할 수 있습니다.

fit_consMR %>% 
  augment()

## # A tsibble: 198 x 6 [1Q]
## # Key:       .model [1]
##    .model Quarter Consumption .fitted  .resid  .innov
##    <chr>    <qtr>       <dbl>   <dbl>   <dbl>   <dbl>
##  1 tslm   1970 Q1       0.619   0.474  0.145   0.145 
##  2 tslm   1970 Q2       0.452   0.635 -0.183  -0.183 
##  3 tslm   1970 Q3       0.873   0.931 -0.0583 -0.0583
##  4 tslm   1970 Q4      -0.272  -0.212 -0.0603 -0.0603
##  5 tslm   1971 Q1       1.90    1.64   0.264   0.264 
##  6 tslm   1971 Q2       0.915   1.07  -0.158  -0.158 
##  7 tslm   1971 Q3       0.794   0.658  0.137   0.137 
##  8 tslm   1971 Q4       1.65    1.30   0.347   0.347 
##  9 tslm   1972 Q1       1.31    1.05   0.262   0.262 
## 10 tslm   1972 Q2       1.89    1.37   0.513   0.513 
## # … with 188 more rows

fit_consMR %>% 
  augment() %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Consumption, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  labs(y = NULL, title = "Percent change in US consumption expenditure") +
  scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00")) +
  guides(colour = guide_legend(title = NULL))

아래 그래프는 예상 소비 지출 대비 실제 소비 지출을 나타낸 산점도 입니다.

fit_consMR %>% 
  augment() %>% 
  ggplot(aes(x = Consumption, y = .fitted)) +
  geom_point(size = 1.2) +
  labs(
    x = "Data (actual values)",
    y = "Fitted (predicted values)",
    title = "Percent change in US consumption expenditure"
  ) +
  geom_abline(intercept = 0, slope = 1, color = "grey50")

Goodness-of-fit

선형 회귀 모형이 데이터에 얼마나 잘 맞는지 요약하는 일반적인 방법은 결정계수(coefficient of determination, \(R^{2}\)) 입니다.
이는 실제 값과 예측 값 사이의 상관관계 제곱으로 계산하거나 아래와 같이 계산할 수 있습니다.

\[R^{2} = \frac{\sum(\hat{y_{t}} - \bar{y})^2}{\sum(y_{t} - \bar{y})^2}\]

해당 값은 0과 1 사이에 위치하여 예측이 실제값과 가까울수록 1에 가까워지는 값을 가집니다.
이는 report() 함수를 통해 결과물을 출력할 때 Adjusted R-squared 값으로 확인할 수 있습니다.

fit_consMR %>% 
  report()

## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90555 -0.15821 -0.03608  0.13618  1.15471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
## Income        0.740583   0.040115  18.461  < 2e-16 ***
## Production    0.047173   0.023142   2.038   0.0429 *  
## Unemployment -0.174685   0.095511  -1.829   0.0689 .  
## Savings      -0.052890   0.002924 -18.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683,  Adjusted R-squared: 0.7635
## F-statistic:   160 on 4 and 193 DF, p-value: < 2.22e-16

Standard error of the regression

모형이 데이터를 얼마나 잘 적합시켰는지에 대한 또 다른 평가는 잔차의 표준 오차로 알려진 잔차 표준 편차입니다.

\[\hat{\sigma_{e}} = \sqrt{\frac{1}{T-k-1}\sum^{T}_{t=1}e_{t}^2}\]

이 또한 report() 함수를 통해 결과물을 출력할 때 Residual standard error 값으로 확인할 수 있습니다.

Evaluating the regression model

회귀 모형 적합 이후에는 모형의 가정이 충족되었는지 확인하기 위해 잔차항을 체크해야합니다.

ACF plot of residuals & Histogram of residuals

시계열 데이터의 경우 현재 t시점의 값이 이전 t-1 시점 또는 그 이전 기간의 값과 유사하거나 영향을 받을 가능성이 매우 높습니다.
따라서 시계열 데이터를 회귀 모형에 적합시킬 때는 잔차항에서 자기 상관을 찾는 것이 일반적입니다.
자기 상관이 있는 모델의 예측은 편향은 없기에 틀림이라고 볼 수는 없지만, 일반적으로 더 갭이 큰 예측 오차를 가질 가능성이 높습니다.
또한 잔차가 정규 분포를 따르는 지 확인해야 합니다.
gg_tsresiduals() 함수를 통해 잔차항에 대한 진단을 해볼 수 있습니다.

fit_consMR %>% 
  gg_tsresiduals()

fit_consMR %>% 
  augment() %>% 
  features(
    .var = .innov,
    features = ljung_box,
    lag = 10, 
    dof = 5
  )

## # A tibble: 1 x 3
##   .model lb_stat lb_pvalue
##   <chr>    <dbl>     <dbl>
## 1 tslm      18.9   0.00204

자기상관함수 그래프에서 시차 7에 벗어나는 점을 보이지만 크게 영향을 미치지 않을 수도 있습니다.

Residual plots against predictors

각 변수에 대해 산점도를 체크해볼 필요도 있습니다.

fit_consMR %>% 
  residuals()

## # A tsibble: 198 x 3 [1Q]
## # Key:       .model [1]
##    .model Quarter  .resid
##    <chr>    <qtr>   <dbl>
##  1 tslm   1970 Q1  0.145 
##  2 tslm   1970 Q2 -0.183 
##  3 tslm   1970 Q3 -0.0583
##  4 tslm   1970 Q4 -0.0603
##  5 tslm   1971 Q1  0.264 
##  6 tslm   1971 Q2 -0.158 
##  7 tslm   1971 Q3  0.137 
##  8 tslm   1971 Q4  0.347 
##  9 tslm   1972 Q1  0.262 
## 10 tslm   1972 Q2  0.513 
## # … with 188 more rows

us_change %>% 
  left_join(residuals(fit_consMR), by = "Quarter") %>% 
  pivot_longer(
    cols = Income:Unemployment,
    names_to = "regressor",
    values_to = "x"
  ) %>% 
  ggplot(aes(x = x, y = .resid)) +
  geom_point(size = 1.2) +
  facet_wrap(. ~ regressor, scales = "free_x") +
  labs(x = NULL, y = "Residuals")

위 그래프에서 보이다시피 각 변수별 잔차가 무작위로 흩어져 분포되어 있는 것처럼 보이므로 어느 정도 가정을 만족한다고 볼 수 있습니다.

Residual plots against fitted values

적합된 값과 잔차간의 분포에도 특별한 패턴이 보이면 안됩니다.
패턴이 관찰되면 오차에 등분산성 가정이 위배될 가능성이 있기 때문입니다.
이렇게 등분산성 가정이 만족하지 못할 경우 로그 또는 제곱근과 같이 변수의 변환을 주어야할 수도 있습니다.

fit_consMR %>% 
  augment() %>% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(size = 1.2) +
  labs(x = "Fitted", y = "Residuals")

어느 특정한 패턴이 보이지 않으므로 등분산성을 만족한다고 볼 수 있습니다.

Outliers and influential observations

대부분의 관측값과 비교하여 극단적인 값을 취하는 값을 이상치라고 합니다.
회귀 모델의 추정 계수에 큰 영향을 미치는 관측치는 영향력 있는 관측치(influential observations)라고 합니다.
일반적으로 영향력 있는 관측치는 극단의 이상치일 가능성도 있습니다.
이러한 이상치들을 무조건 제거한다고 그게 올바르다고 보기는 어렵습니다.
이상치여도 의미 있는 값이 있을수도 있기에, 그 값이 가능한 이유에 대해서 분석하는 것이 중요합니다.

Spurious regression

시계열 데이터는 종종 정상성(stationarity)을 보이지 않는 경우도 있습니다.
- 시간이 지나도 분산이 일정한.. 즉, 시계열의 변동이 시간의 흐름에 따라 일정한 것을 정상성이라고 이해하시면 됩니다.
이렇게 비정상 시계열은 정상 시계열로 변환해줄 필요가 있습니다.
비정상 시계열을 회귀 모형에 적합하면 부정확한 모형이 될 가능성이 있기 때문입니다.

temp_fit <- aus_airpassengers %>% 
  filter(Year <= 2011) %>% 
  left_join(guinea_rice, by = "Year") %>% 
  model(TSLM(Passengers ~ Production))

temp_fit %>% 
  report()

## Series: Passengers 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9448 -1.8917 -0.3272  1.8620 10.4210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.493      1.203  -6.229 2.25e-07 ***
## Production    40.288      1.337  30.135  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.239 on 40 degrees of freedom
## Multiple R-squared: 0.9578,  Adjusted R-squared: 0.9568
## F-statistic: 908.1 on 1 and 40 DF, p-value: < 2.22e-16

temp_fit %>% 
  gg_tsresiduals()

Some useful predictors

Trend

\[y_{t} = \beta_{0} + \beta_{1} t + \epsilon_{t}\]

TSLM() 함수에는 trend()를 적용하여 추세 변수를 지정할 수 있습니다.

Dummy variables

예를 들어 시계열 데이터에서 어떤 특정 날짜가 공휴일인지 여부를 고려할 때 변수에 공휴일이면 1, 아니면 0 과 같이 값을 취할 수 있습니다.
이를 더미 변수라고 하며 Indicator variable이라고도 불리웁니다.
범주가 세 개 이상인 k개일 경우에는 (k-1)개의 변수를 더미처리하여 사용할 수 있습니다.
TSLM() 함수에서는 더미 지정하면 이를 자동으로 처리해줍니다.

Seasonal dummy variables

계절성을 갖는 더미 변수(요일, 분기 등)도 마찬가지로 TSLM() 함수 내 season()을 적용하여 지정할 수 있습니다.
아래 예시를 들어 확인해보겠습니다. (호주의 분기별 맥주 생산량 데이터)

recent_production <- aus_production %>% 
  filter(year(Quarter) >= 1992)

recent_production %>% 
  autoplot(Beer) +
  labs(y = "Megalitres", title = "Australian quarterly beer production")

미래의 맥주 생산량을 확인해보고 싶습니다.
선형 추세(trend)와 더미 변수가 있는 회귀 모형을 사용하여 모델링할 수 있습니다.
- 분기 데이터이기에 더미는 세 개가 됩니다. (4-1 = 3)

\[y_{t} = \beta_{0} + \beta_{1}t + \beta_{2}t_{2, t} + \beta_{3}t_{3, t} + \beta_{4}t_{4, t} + \epsilon_t\]

# trend() 함수와 season() 함수는 디폴트로 들어가는 표준함수가 아닙니다. 필요 시 TSLM() 함수 안에서 적용합니다.
fit_beer <- recent_production %>% 
  model(TSLM(formula = Beer ~ trend() + season()))

fit_beer %>% 
  report()

## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -42.9029  -7.5995  -0.4594   7.9908  21.7895 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   441.80044    3.73353 118.333  < 2e-16 ***
## trend()        -0.34027    0.06657  -5.111 2.73e-06 ***
## season()year2 -34.65973    3.96832  -8.734 9.10e-13 ***
## season()year3 -17.82164    4.02249  -4.430 3.45e-05 ***
## season()year4  72.79641    4.02305  18.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243,  Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16

적합된 결과를 확인해보면 분기당 평균 -0.34 감소 추세가 있습니다.
평균적으로 2분기는 1분기보다 -34.7, 3분기는 1분기보다 -17.8, 반면 4분기는 1분기보다 72.8 정도 생산량이 많습니다.

fit_beer %>% 
  augment() %>% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, colour = "Data")) +
  geom_line(aes(y = .fitted, colour = "Fitted")) +
  scale_colour_manual(values = c(Data = "black", Fitted = "#D55E00")) +
  labs(y = "Megalitres", title = "Australian quarterly beer production") +
  guides(colour = guide_legend(title = "Series"))

아래 그래프와 같이 분기별 실제 맥주 생산량과 예상치를 표현해볼 수도 있습니다.

fit_beer %>% 
  augment() %>% 
  ggplot(aes(x = Beer, y = .fitted, colour = factor(quarter(Quarter)))) +
  geom_point(size = 1.2) +
  labs(x = "Actual values", y = "Fitted", title = "Australian quarterly beer production") +
  geom_abline(intercept = 0, slope = 1) +
  guides(colour = guide_legend(title = "Quarter"))

Intervention variables

종속 변수에 영향을 줄 수 있는 변수들을 개입하는 것이 종종 필요할 수도 있습니다.
예를 들자면 특정 프로모션 기간 전후에 차이가 있다고 가정할 때 프로모션 전에는 0, 후에는 1과 같이 더미처럼 변수 개입을 주는 것입니다.
이러한 개입은 piecewise linear trend가 되기에 기울기의 변화가 발생하게 되고, 이는 즉 비선형에 해당되게 됩니다.

Fourier series

긴 계절 기간 동안 계절 더미 변수를 사용할 때 대안 중 하나는 푸리에(Fourier) 항을 사용하는 것입니다.
\(m\) = seasonal period
- \(x_{1, t} = \sin\big(\frac{2\pi t}{m}\big)\)
- \(x_{2, t} = \cos\big(\frac{2\pi t}{m}\big)\)
- \(x_{3, t} = \sin\big(\frac{4\pi t}{m}\big)\)
- \(x_{4, t} = \cos\big(\frac{4\pi t}{m}\big)\)
- \(x_{5, t} = \sin\big(\frac{6\pi t}{m}\big)\)
- \(x_{6, t} = \cos\big(\frac{6\pi t}{m}\big)\)
- …
푸리에 항을 사용하면 더미 변수가 줄어들 수 있는 이점을 갖습니다.
예를 들자면 시계열이 주간 단위로 되어 있는 케이스 등이 될 것 같습니다. (\(m = 52\))
이러한 푸리에 항은 fourier() 함수로 적용할 수 있습니다.
- K arguments는 몇개의 sin, cos 항을 포함시킬지 결정하는 부분
- \(m\)이 계절 주기라고 할 때 허용되는 \(K\)의 최대값은 \(K = m/2\)
- 분기별 데이터이므로 \(m=4\)이기에 여기서는 \(K=2\)

recent_production %>% 
  model(TSLM(formula = Beer ~ trend() + fourier(K = 2))) %>% 
  report()

## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -42.9029  -7.5995  -0.4594   7.9908  21.7895 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        446.87920    2.87321 155.533  < 2e-16 ***
## trend()             -0.34027    0.06657  -5.111 2.73e-06 ***
## fourier(K = 2)C1_4   8.91082    2.01125   4.430 3.45e-05 ***
## fourier(K = 2)S1_4 -53.72807    2.01125 -26.714  < 2e-16 ***
## fourier(K = 2)C2_4 -13.98958    1.42256  -9.834 9.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243,  Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16

Selecting predictors

회귀 모형 적합 이후 변수를 선별할 때 우리는 흔히 p-value 값이 일정 유의수준 이하인 경우를 채택했습니다.
하지만 이렇게 p-value의 통계적 유의성이 항상 맞고 옳다고는 보기 어렵습니다.
두 개 이상의 변수가 서로 상관 관계가 있을 경우에 p-value로의 판단은 잘못된 판단이 될 수도 있기 때문입니다.
이러한 오차를 최소화하기 위해 변수 클렌징이 다 된 최종 모형에 대한 평가를 고려해볼 수 있습니다.
glance() 함수는 이러한 값을 제공해줍니다.

fit_consMR %>% 
  glance() %>% 
  select(adj_r_squared, CV, AIC, AICc, BIC)

## # A tibble: 1 x 5
##   adj_r_squared    CV   AIC  AICc   BIC
##           <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         0.763 0.104 -457. -456. -437.

Adjusted R-squared

일반적인 \(R^2\) 값은 모형이 과거 데이터에 얼마나 잘 맞는지를 측정하지만, 미래에 예측을 얼마나 잘하는지는 측정하지 않습니다.
또한 자유도(degree of freedom) 항을 고려하지 않기에 변수가 많을수록 \(R^2\) 값은 증가하는 경향이 있습니다.
따라서 이렇게 자유도 항을 고려한 것이 Adjusted R-squared 입니다.
- 우리말로는 조정된 결정계수 또는 수정된 결정계수 라고도 부르는 것 같습니다.
이 값을 최대화하는 것은 모델 오차항의 표준오차 \(\sigma_{e}\)를 최소화하는 것과 같습니다.

Cross-validation

설명이 길어지기에 일단 교차검증에 대한 위키 백과 링크를 첨부드립니다.
시계열에서도 마찬가지로 교차 검증은 모델의 예측력을 평가하기 위한 일반적인 방법입니다.
이 값이 작을수록 좋은 모형이라고 평가할 수 있습니다.

Akaike’s Information Criterion

\[\text{AIC} = T \log\bigg(\frac{\text{SSE}}{T}\bigg) + 2(k+2)\]

\(T\)는 관측치의 수, \(k\)는 변수의 수 입니다.
마찬가지로 작은 값을 가질수록 가장 적합한 모형이라고 평가할 수 있습니다.
AIC를 최소화하는 것은 CV를 최소화하는 것과 동일한 맥락이라고 볼 수 있습니다.

Corrected Akaike’s Information Criterion

관측치의 수 \(T\)가 작을 때 AIC는 너무 많은 예측 변수를 선택하는 경향이 있으므로 아래와 같이 수정된 버전입니다.

\[\text{AIC}_{c} = \text{AIC} + \frac{2(k+2)(k+3)}{T-k-3}\]

마찬가지로 작을수록 베스트입니다.

Schwarz’s Bayesian Information Criterion

\[\text{BIC} = T\log\bigg(\frac{\text{SSE}}{T}\bigg) + (k+2)\log(T)\]

AIC와 동일한 맥락으로 BIC를 최소화하는 것이 좋습니다.

Which measure should we use?

관련 문서에서는 AICc, AIC 또는 CV 통계량 중 하나를 사용하는 것을 추천합니다.
길지 않으니 관련 문서를 꼭 한 번 읽어보시길 권장드립니다.
대부분의 예시들에서도 AICc 값으로 예측 모형을 선택하겠습니다.

Forecasting with regression

TSLM() 함수에 적합된 객체를 가지고 forecast() 함수를 적용하여 예측을 해볼 수 있습니다.

fit_beer %>% 
  report()

## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -42.9029  -7.5995  -0.4594   7.9908  21.7895 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   441.80044    3.73353 118.333  < 2e-16 ***
## trend()        -0.34027    0.06657  -5.111 2.73e-06 ***
## season()year2 -34.65973    3.96832  -8.734 9.10e-13 ***
## season()year3 -17.82164    4.02249  -4.430 3.45e-05 ***
## season()year4  72.79641    4.02305  18.095  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243,  Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: < 2.22e-16

fc_beer <- fit_beer %>% 
  forecast()

fc_beer

## # A fable: 8 x 4 [1Q]
## # Key:     .model [1]
##   .model                                    Quarter        Beer .mean
##   <chr>                                       <qtr>      <dist> <dbl>
## 1 TSLM(formula = Beer ~ trend() + season()) 2010 Q3 N(398, 164)  398.
## 2 TSLM(formula = Beer ~ trend() + season()) 2010 Q4 N(489, 164)  489.
## 3 TSLM(formula = Beer ~ trend() + season()) 2011 Q1 N(416, 165)  416.
## 4 TSLM(formula = Beer ~ trend() + season()) 2011 Q2 N(381, 165)  381.
## 5 TSLM(formula = Beer ~ trend() + season()) 2011 Q3 N(397, 166)  397.
## 6 TSLM(formula = Beer ~ trend() + season()) 2011 Q4 N(487, 166)  487.
## 7 TSLM(formula = Beer ~ trend() + season()) 2012 Q1 N(414, 166)  414.
## 8 TSLM(formula = Beer ~ trend() + season()) 2012 Q2 N(379, 166)  379.

그 결과를 시각화로 표현한 결과는 아래와 같습니다.
- 음영으로 된 구간중 어두운 음영은 예측 80% 신뢰구간, 조금 더 밝은 음영은 95% 신뢰구간을 나타냅니다.

fc_beer %>% 
  autoplot(recent_production) +
  labs(y = "megalitres", title = "Forecasts of beer production using regression")

이번엔 다른 데이터인 us_change로 시나리오 기반 예측을 해보겠습니다.

us_change

## # A tsibble: 198 x 6 [1Q]
##    Quarter Consumption Income Production Savings Unemployment
##      <qtr>       <dbl>  <dbl>      <dbl>   <dbl>        <dbl>
##  1 1970 Q1       0.619  1.04      -2.45    5.30         0.9  
##  2 1970 Q2       0.452  1.23      -0.551   7.79         0.5  
##  3 1970 Q3       0.873  1.59      -0.359   7.40         0.5  
##  4 1970 Q4      -0.272 -0.240     -2.19    1.17         0.700
##  5 1971 Q1       1.90   1.98       1.91    3.54        -0.100
##  6 1971 Q2       0.915  1.45       0.902   5.87        -0.100
##  7 1971 Q3       0.794  0.521      0.308  -0.406        0.100
##  8 1971 Q4       1.65   1.16       2.29   -1.49         0    
##  9 1972 Q1       1.31   0.457      4.15   -4.29        -0.200
## 10 1972 Q2       1.89   1.03       1.89   -4.69        -0.100
## # … with 188 more rows

여기서 시나리오는 아래와 같이 셋팅합니다.
- 고용률(Unemployment 관련)의 변화 없이 소득(Income)과 저축(Savings)이 각각 1%와 0.5%로 증가하거나 감소
- 예측을 위한 시나리오 셋팅은 scenarios() 함수를 통해 구성해볼 수 있습니다.
tsibble 라이브러리 내 new_data() 함수는 key-index 조합으로 원하는 시점만큼의 미래 값을 생성해주는 함수입니다.

# 회귀모형 적합
fit_consBest <- us_change %>% 
  model(lm = TSLM(formula = Consumption ~ Income + Savings + Unemployment))

# 시나리오 구성
future_scenarios <- scenarios(
  Increase = new_data(us_change, 4) %>% 
    mutate(
      Income = 1,
      Savings = 0.5,
      Unemployment = 0
    ),
  Decrease = new_data(us_change, 4) %>% 
    mutate(
      Income = -1,
      Savings = -0.5,
      Unemployment = 0
    ),
  names_to = "Scenario"
)

future_scenarios

## $Increase
## # A tsibble: 4 x 4 [1Q]
##   Quarter Income Savings Unemployment
##     <qtr>  <dbl>   <dbl>        <dbl>
## 1 2019 Q3      1     0.5            0
## 2 2019 Q4      1     0.5            0
## 3 2020 Q1      1     0.5            0
## 4 2020 Q2      1     0.5            0
## 
## $Decrease
## # A tsibble: 4 x 4 [1Q]
##   Quarter Income Savings Unemployment
##     <qtr>  <dbl>   <dbl>        <dbl>
## 1 2019 Q3     -1    -0.5            0
## 2 2019 Q4     -1    -0.5            0
## 3 2020 Q1     -1    -0.5            0
## 4 2020 Q2     -1    -0.5            0
## 
## attr(,"names_to")
## [1] "Scenario"

이러한 시나리오 구성을 forecast() 함수 내 new_data argument에 적용합니다.

fc <- fit_consBest %>% 
  forecast(new_data = future_scenarios)

fc

## # A fable: 8 x 8 [1Q]
## # Key:     Scenario, .model [2]
##   Scenario .model Quarter   Consumption  .mean Income Savings Unemployment
##   <chr>    <chr>    <qtr>        <dist>  <dbl>  <dbl>   <dbl>        <dbl>
## 1 Increase lm     2019 Q3   N(1, 0.098)  0.996      1     0.5            0
## 2 Increase lm     2019 Q4   N(1, 0.098)  0.996      1     0.5            0
## 3 Increase lm     2020 Q1   N(1, 0.098)  0.996      1     0.5            0
## 4 Increase lm     2020 Q2   N(1, 0.098)  0.996      1     0.5            0
## 5 Decrease lm     2019 Q3 N(-0.46, 0.1) -0.464     -1    -0.5            0
## 6 Decrease lm     2019 Q4 N(-0.46, 0.1) -0.464     -1    -0.5            0
## 7 Decrease lm     2020 Q1 N(-0.46, 0.1) -0.464     -1    -0.5            0
## 8 Decrease lm     2020 Q2 N(-0.46, 0.1) -0.464     -1    -0.5            0

이를 가지고 autolayer() 함수에 시나리오 예측 적용 객체를 씌워주어 아래와 같이 시각화 해볼 수 있습니다.

us_change %>% 
  autoplot(Consumption) +
  autolayer(fc) +
  labs(title = "US consumption", y = "% change")

Nonlinear regression

일반적인 모형의 수식 형태는 아래와 같습니다.

\[y_{t} = f(x_{t})+\epsilon_{t}\]

여기서 함수 \(f()\)는 비선형함수를 고려합니다.
- 가장 단순한 형태의 \(f()\) 함수는 piecewise linear 포맷이 있을 수 있습니다.
- regression splines 스플라인 회귀도 있습니다.
- 2차항 이상의 고차항 함수를 적용하거나 자연로그 (변수 값들이 양수 이상 일때) 등을 씌우는 방법도 있습니다.
아래 예시를 들어 보겠습니다.

boston_men <- boston_marathon %>% 
  filter(Year >= 1924 & Event == "Men's open division") %>% 
  mutate(Minutes = as.numeric(Time)/60)

boston_men

## # A tsibble: 96 x 6 [1Y]
## # Key:       Event [1]
##    Event              Year Champion               Country      Time      Minutes
##    <fct>             <int> <chr>                  <chr>        <drtn>      <dbl>
##  1 Men's open divis…  1924 Clarence H. DeMar      United Stat…  8980 se…    150.
##  2 Men's open divis…  1925 Charles L. (Chuck) Me… United Stat…  9180 se…    153 
##  3 Men's open divis…  1926 John C. Miles          Canada        8740 se…    146.
##  4 Men's open divis…  1927 Clarence H. DeMar      United Stat…  9622 se…    160.
##  5 Men's open divis…  1928 Clarence H. DeMar      United Stat…  9427 se…    157.
##  6 Men's open divis…  1929 John C. Miles          Canada        9188 se…    153.
##  7 Men's open divis…  1930 Clarence H. DeMar      United Stat…  9288 se…    155.
##  8 Men's open divis…  1931 James P. Henigan       United Stat… 10005 se…    167.
##  9 Men's open divis…  1932 Paul de Bruyn          Germany       9216 se…    154.
## 10 Men's open divis…  1933 Leslie S. Pawson       United Stat…  9061 se…    151.
## # … with 86 more rows

boston_men %>% 
  autoplot(Minutes) +
  geom_smooth(formula = y ~ x, se = FALSE, method = "lm") +
  labs(x = "Year", y = "Minutes")

boston_men %>% 
  model(TSLM(formula = Minutes ~ trend())) %>% 
  gg_tsresiduals()

위 그래프에서 시간이 지남에 따라 감소하는 듯한 선형 추이를 보이지만,
선형추세로 나온 잔차를 보면 비선형 패턴이 보여지게 됩니다.
따라서 아래 코드와 같이 자연로그 또는 piecewise reg. 등을 적합시켜 비교해보겠습니다.

fit_trends <- boston_men %>% 
  model(
    linear = TSLM(formula = Minutes ~ trend()),
    exponential = TSLM(formula = log(Minutes) ~ trend()),
    piecewise = TSLM(formula = Minutes ~ trend(knots = c(1950, 1980)))
  )

fc_trends <- fit_trends %>% 
  forecast(h = 10)

boston_men %>% 
  autoplot(Minutes) +
  geom_line(data = fit_trends %>% fitted(), aes(y = .fitted, colour = .model)) +
  autolayer(fc_trends, alpha = 0.5, level = 95) +
  labs(y = "Minutes", title = "Boston marathon winning times")

가장 베스트는 piecewise reg.에서 도출될 것 같습니다.

[R] 4. feasts

택2 — Wed, 28 Jul 2021 13:56:35 +0900

`feasts`

feasts의 의미는 Feature Extraction And Statistics for Time Series의 약자라고 합니다. (FEASTS)
시계열 데이터 분석에 필요한 여러 가지 함수들을 제공하는 라이브러리 입니다.
- 시계열 분해, 추출, 시각화 등

Graphics: `gg_season()`, `gg_subseries()`, `gg_lag()`, `ACF()`

시계열 데이터의 패턴을 이해하기 위해 첫 단계로 시각화로 접근을 합니다.
먼저 gg_season() 함수를 사용하여 계절성(seasonality)을 확인해볼 수 있습니다.
예시로 tsibbledata 라이브러리 내 aus_production 데이터를 사용하겠습니다.
- 해당 데이터는 호주의 맥주, 담배 등 여러 품목별 생산지표 추정치에 관한 데이터입니다.

aus_production %>% 
  gg_season(Beer)

gg_subseries() 함수를 사용하면 시계열의 각 season별로 시각화를 보일 수 있습니다.

aus_production %>% 
  gg_subseries(Beer)

gg_lag()를 이용하면 원 데이터와 그 시점의 시차(lag)에 대한 산점도를 season별로 시각화할 수 있습니다.

aus_production %>% 
  filter(year(Quarter) > 1991) %>% 
  gg_lag(Beer, geom = "point")

분기 단위의 데이터이기에 lag 4와 lag 8을 보면 각 season별로 원 데이터(x축)와 lag(y축)간의 선형관계가 잘 놓여져있는 것을 확인할 수 있습니다.

ACF(자기상관함수, Auto Correlation Function)도 ACF() 함수와 autoplot() 함수를 사용하여 그릴 수 있습니다.
- 자기 상관 함수란 {i} 시점과 {i+k} 시점간에 상관계수 값이라고 이해하시면 됩니다.

aus_production %>% 
  ACF()

## Response variable not specified, automatically selected `var = Beer`

## # A tsibble: 23 x 2 [1Q]
##      lag   acf
##    <lag> <dbl>
##  1    1Q 0.684
##  2    2Q 0.500
##  3    3Q 0.667
##  4    4Q 0.940
##  5    5Q 0.644
##  6    6Q 0.458
##  7    7Q 0.621
##  8    8Q 0.887
##  9    9Q 0.598
## 10   10Q 0.410
## # … with 13 more rows

aus_production %>% 
  ACF() %>% 
  autoplot()

## Response variable not specified, automatically selected `var = Beer`

Decompositions

시계열 분해(decomposition)는 시계열 데이터 분석에서 흔히 수행되는 작업 중 하나 입니다.
시계열에 대한 패턴을 이해하는데 도움을 주며, 추후 예측 모델링 적용 시 정교성에 도움을 주기도 합니다.
즉, 시계열의 패턴을 조금 더 정교하게 하고 예측 성능을 향상시키기 위한 목적으로 필수적인 사전 전처리입니다.

Decompositions: Classical decomposition

접근 방식에 따라 크게 가법(additive), 승법(multiplicative) 두 가지로 분류됩니다.
- 보통 가법은 계절성이 추세에 따라 무관하게 일정한 크기나 수준을 유지하는 케이스일 때
- 승법은 계절성의 크기가 추세의 크기에 따라 변화하는 케이스일 때
원리에 대해서 해당 포스팅에서는 간단하게만 설명하겠습니다.
- 1단계: 최소제곱법으로 추세선(trend)을 적합하여 추정하고 이를 원래 데이터에서 뺴줌으로써 추세가 조정된 시계열을 만들 수 있습니다.
- 2단계: 1단계에서 구한 조정된 추세선을 가지고 계절성의 길이만큼 이동평균을 구하여 계절성을 제거합니다.
- 3단계: 2단계에서 계절성을 정리한 시계열을 가지고 다시 이동평균을 나누어 줌으로써 일차적으로 계절성을 추정합니다.
- 4단계: 3단계에서 구한 계절성들의 각 계절별 평균을 구하고 이 계절들 평균들의 합이 계절성의 길이가 되도록 조정한 지수를 만듭니다. 이 지수가 계절지수가 됩니다.
위 원리는 승법 분해에 대해서 정리한것인데 가법모형은 위 과정 3단계에서 비율을 이용해 나누어주는 것이 아닌 뺄셈을 통해 진행합니다.

위 예시 데이터를 가지고 가법 분해를 적용하는 사례입니다.
classical_decomposition() 함수 내 type = "additive" 옵션을 적용하여 분해해볼 수 있습니다.

dcmp <- aus_production %>% 
  model(classical_decomposition(Beer, type = "additive"))

분해된 시계열 요소들은 components() 함수로 불러올 수 있습니다.

dcmp %>% 
  components()

## # A dable: 218 x 7 [1Q]
## # Key:     .model [1]
## # :        Beer = trend + seasonal + random
##    .model                     Quarter  Beer trend seasonal  random season_adjust
##    <chr>                        <qtr> <dbl> <dbl>    <dbl>   <dbl>         <dbl>
##  1 "classical_decomposition(… 1956 Q1   284   NA      2.13  NA              282.
##  2 "classical_decomposition(… 1956 Q2   213   NA    -42.5   NA              256.
##  3 "classical_decomposition(… 1956 Q3   227  255.   -28.5    0.256          256.
##  4 "classical_decomposition(… 1956 Q4   308  254.    68.9  -15.3            239.
##  5 "classical_decomposition(… 1957 Q1   262  257.     2.13   2.49           260.
##  6 "classical_decomposition(… 1957 Q2   228  260    -42.5   10.5            271.
##  7 "classical_decomposition(… 1957 Q3   236  263.   -28.5    1.76           265.
##  8 "classical_decomposition(… 1957 Q4   320  265.    68.9  -13.5            251.
##  9 "classical_decomposition(… 1958 Q1   272  265.     2.13   4.49           270.
## 10 "classical_decomposition(… 1958 Q2   233  265.   -42.5   10.9            276.
## # … with 208 more rows

그리고 이 분해된 요소들을 autoplot() 함수에 적용하면 아래와 같은 시각화를 볼 수 있습니다.

dcmp %>% 
  components() %>% 
  autoplot() %>% 
  labs(title = "Classical additive decomposition of Quaterly production of Beer in Australia")

마찬가지로 승법 분해는 classical_decomposition() 함수 내 type = "multiplicative" 옵션을 적용할 수 있습니다.

Decompositions: STL decomposition

STL은 Seasonal and Trend decomposition using Loess의 줄임말로 robust한 시계열 분해 방법에 해당됩니다.
- 여기서 loess란 Local regression, 우리가 흔히 알고 있는 선형회귀말고 비선형회귀에 해당합니다.
- 계절성(S) + 추세성(T) + Remainder component 로 분해
계절성에 대해 다른 분해 방법보다 조금 더 자유도가 높은 편이며, 시간에 따라 변화하는 계절성의 변화율을 분석가가 직접 조절할 수 있다는 장점이 있습니다.
STL 분해에 대해 조금 더 자세히 알고 싶다면 여기를 참고하는 것도 좋을 것 같습니다!

아래 예시를 들어 설명하겠습니다.

aus_production %>% 
  model(STL(formula = Beer ~ trend(window = 4) + season(window = "periodic"), robust = TRUE)) %>% 
  components() %>% 
  autoplot()

위 코드에서 보셨듯, 시계열의 추세 요소는 window 옵션을 주어 flexible하게 추정할 수 있고
계절성은 window = "periodic"으로 하여 고정시켰습니다.
더 자세한 옵션은 ?STL로 확인하시면 됩니다!

Feature extraction and statistics

features() 함수를 통해 여러 가지 통계량이나 ACF 등을 추출할 수 있습니다.
예시 데이터는 tourism 데이터를 활용하겠습니다.
- 해당 데이터는 분기별 호주 애형객 수에 관한 자료입니다.
먼저 단순히 평균, 분위수 값을 뽑는 방법은 아래와 같습니다.

#features() 함수에 대상 변수와 해당 통계량 함수들을 적용하시면 됩니다.
tourism %>% 
  features(
    .var = Trips, 
    features = list(avg = mean, quantile)
  )

## # A tibble: 304 x 9
##    Region      State         Purpose    avg    `0%`  `25%`   `50%`  `75%` `100%`
##    <chr>       <chr>         <chr>    <dbl>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
##  1 Adelaide    South Austra… Busine… 156.    68.7   134.   153.    177.   242.  
##  2 Adelaide    South Austra… Holiday 157.   108.    135.   154.    172.   224.  
##  3 Adelaide    South Austra… Other    56.6   25.9    43.9   53.8    62.5  107.  
##  4 Adelaide    South Austra… Visiti… 205.   137.    179.   206.    229.   270.  
##  5 Adelaide H… South Austra… Busine…   2.66   0       0      1.26    3.92  28.6 
##  6 Adelaide H… South Austra… Holiday  10.5    0       5.77   8.52   14.1   35.8 
##  7 Adelaide H… South Austra… Other     1.40   0       0      0.908   2.09   8.95
##  8 Adelaide H… South Austra… Visiti…  14.2    0.778   8.91  12.2    16.8   81.1 
##  9 Alice Spri… Northern Ter… Busine…  14.6    1.01    9.13  13.3    18.5   34.1 
## 10 Alice Spri… Northern Ter… Holiday  31.9    2.81   16.9   31.5    44.8   76.5 
## # … with 294 more rows

ACF에 관한 정보는 feat_acf() 함수를 이용합니다.
feat_acf()는 기본적으로 ACF와 관련된 값들을 제공합니다.
- acf1: 원래 시계열 데이터의 1차 자기상관계수
- acf10: 1~10차 자기상관계수 제곱합
- diff1_acf1: 1차 차분(lag) 시계열의 1차 자기상관계수
- diff1_acf10: 1차 차분 시계열의 1~10차 자기상관계수 제곱합
- diff2_acf1: 2차 차분 시계열의 1차 자기상관계수
- diff2_acf10: 2차 차분 시계열의 1~10차 자기상관계수 제곱합
- season_acf1: 첫 번째 계절 시차(seasonal lag)에서의 자기상관계수

tourism %>% 
  features(
    .var = Trips,
    features = feat_acf
  )

## # A tibble: 304 x 10
##    Region    State      Purpose     acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1
##    <chr>     <chr>      <chr>      <dbl> <dbl>      <dbl>       <dbl>      <dbl>
##  1 Adelaide  South Aus… Busine…  0.0333  0.131     -0.520       0.463     -0.676
##  2 Adelaide  South Aus… Holiday  0.0456  0.372     -0.343       0.614     -0.487
##  3 Adelaide  South Aus… Other    0.517   1.15      -0.409       0.383     -0.675
##  4 Adelaide  South Aus… Visiti…  0.0684  0.294     -0.394       0.452     -0.518
##  5 Adelaide… South Aus… Busine…  0.0709  0.134     -0.580       0.415     -0.750
##  6 Adelaide… South Aus… Holiday  0.131   0.313     -0.536       0.500     -0.716
##  7 Adelaide… South Aus… Other    0.261   0.330     -0.253       0.317     -0.457
##  8 Adelaide… South Aus… Visiti…  0.139   0.117     -0.472       0.239     -0.626
##  9 Alice Sp… Northern … Busine…  0.217   0.367     -0.500       0.381     -0.658
## 10 Alice Sp… Northern … Holiday -0.00660 2.11      -0.153       2.11      -0.274
## # … with 294 more rows, and 2 more variables: diff2_acf10 <dbl>,
## #   season_acf1 <dbl>

위 결과에서 season_acf1는 첫번째 계절 시차에서의 ACF를 나타내는데, 해당 데이터는 분기단위이기에 계절 주기는 4입니다.
즉, 해당 값은 원계열 시차 4에서의 ACF값을 나타낸다고도 볼 수 있습니다.

feat_stl() 함수를 사용하여 STL 분해 요소를 출력할 수도 있습니다.
해당 함수는 추세와 계절성의 강도를 표현해주면서 아래 요소들도 같이 출력해줍니다.
- seasonal_peak_year: 계절성이 가장 큰 시점 (분기 등)
- seasonal_trough_year: 계절성이 가장 작은 시점
- spikiness: Remainder component의 분산. 그냥 쉽게 말하자면 오차항의 분산 정도라고 생각하면 됩니다.
- linearity: STL 분해의 추세(trend) 성분의 선형성
- curvature: STL 분해의 추세 성분의 곡률(curvature)
- stl_e_acf1: 계절성과 추세성분을 제외한 나머지 계열들의 1차 자기상관계수
- stl_e_acf10: 계절성과 추세성분을 제외한 나머지 계열들의 1~10차 자기상관계수 제곱합

tourism %>% 
  features(
    .var = Trips,
    features = feat_stl
  )

## # A tibble: 304 x 12
##    Region    State     Purpose trend_strength seasonal_strengt… seasonal_peak_y…
##    <chr>     <chr>     <chr>            <dbl>             <dbl>            <dbl>
##  1 Adelaide  South Au… Busine…          0.464             0.407                3
##  2 Adelaide  South Au… Holiday          0.554             0.619                1
##  3 Adelaide  South Au… Other            0.746             0.202                2
##  4 Adelaide  South Au… Visiti…          0.435             0.452                1
##  5 Adelaide… South Au… Busine…          0.464             0.179                3
##  6 Adelaide… South Au… Holiday          0.528             0.296                2
##  7 Adelaide… South Au… Other            0.593             0.404                2
##  8 Adelaide… South Au… Visiti…          0.488             0.254                0
##  9 Alice Sp… Northern… Busine…          0.534             0.251                0
## 10 Alice Sp… Northern… Holiday          0.381             0.832                3
## # … with 294 more rows, and 6 more variables: seasonal_trough_year <dbl>,
## #   spikiness <dbl>, linearity <dbl>, curvature <dbl>, stl_e_acf1 <dbl>,
## #   stl_e_acf10 <dbl>

해당 결과를 아래와 같이 시각화하여 어떤 유형이 가장 트렌드(x축)하고 계절적(y축)인지도 확인해볼 수 있습니다.

tourism %>% 
  features(.var = Trips, features = feat_stl) %>% 
  ggplot(aes(x = trend_strength, y = seasonal_strength_year, color = Purpose)) +
  geom_point() +
  facet_wrap(~ State)

휴가 등을 목적으로 하는 관광은 가장 계절적 패턴을 보입니다.
경향성은 Western Australia에서 가장 강하게 나타납니다.

tourism %>% 
  features(.var = Trips, features = feat_stl) %>% 
  filter(seasonal_strength_year == max(seasonal_strength_year)) %>% 
  select(Region, State, Purpose)

## # A tibble: 1 x 3
##   Region          State           Purpose
##   <chr>           <chr>           <chr>  
## 1 Snowy Mountains New South Wales Holiday

tourism %>% 
  features(.var = Trips, features = feat_stl) %>% 
  filter(seasonal_strength_year == max(seasonal_strength_year)) %>% 
  select(Region, State, Purpose) %>% 
  left_join(tourism, by = c("Region", "State", "Purpose")) %>% 
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line(size = 0.7)

[R] 3. tsibbledata

택2 — Wed, 28 Jul 2021 11:20:50 +0900

`tsibbledata`

tsibbledata 라이브러리는 시계열 데이터로 적절한 예시로 쓰일 수 있는 데이터들을 제공해줍니다.
github에서 나와 있는 예시는 olympic_running 데이터입니다.
아래 데이터는 올림픽 달리기 종목의 성별 최고기록에 관한 데이터라고 합니다.

olympic_running

## # A tsibble: 312 x 4 [4Y]
## # Key:       Length, Sex [14]
##     Year Length Sex    Time
##    <int>  <int> <chr> <dbl>
##  1  1896    100 men    12  
##  2  1900    100 men    11  
##  3  1904    100 men    11  
##  4  1908    100 men    10.8
##  5  1912    100 men    10.8
##  6  1916    100 men    NA  
##  7  1920    100 men    10.8
##  8  1924    100 men    10.6
##  9  1928    100 men    10.8
## 10  1932    100 men    10.3
## # … with 302 more rows

이 데이터를 가지고 성별 달리기 최고 기록에 대한 값을 그래프로 나타낸 예제입니다.
- 참고로 1916, 1940, 1944년의 경우 세계대전으로 인해 결측 처리되었습니다.

olympic_running %>% 
  ggplot(aes(x = Year, y = Time, color = Sex, group = Sex)) +
  geom_line(size = 0.7) +
  geom_point(size = 1.2) +
  facet_wrap(. ~ Length, scales = "free_y", nrow = 2) +
  labs(x = "Year", y = "Running time (seconds)") +
  scale_color_brewer(palette = "Dark2") + 
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank())

[R] 2. tsibble

택2 — Tue, 27 Jul 2021 23:10:45 +0900

`tsibble()`

tsibble 객체는 아래와 같은 기본적인 원칙을 가집니다.
- index: 과거부터 현재까지 순서화된 자료값의 관측 시간
- key: 시간에 따른 관측값을 정의하는 변수의 집합
- 각 관측치는 index와 key를 통해 유니크하게 식별되어야 합니다.
- 각 관측치는 등간격으로 관측된 자료여야만 합니다.
즉, tsibble 포맷으로 변환하기 위해서는 데이터에서 key와 index를 명시해주어야 합니다.
아래 nycflights13 라이브러리의 weather 데이터를 활용해서 예시를 보이겠습니다.

weather_sample <- nycflights13::weather %>% 
  select(origin, time_hour, temp, humid, precip)

weather_sample

## # A tibble: 26,115 x 5
##    origin time_hour            temp humid precip
##    <chr>  <dttm>              <dbl> <dbl>  <dbl>
##  1 EWR    2013-01-01 01:00:00  39.0  59.4      0
##  2 EWR    2013-01-01 02:00:00  39.0  61.6      0
##  3 EWR    2013-01-01 03:00:00  39.0  64.4      0
##  4 EWR    2013-01-01 04:00:00  39.9  62.2      0
##  5 EWR    2013-01-01 05:00:00  39.0  64.4      0
##  6 EWR    2013-01-01 06:00:00  37.9  67.2      0
##  7 EWR    2013-01-01 07:00:00  39.0  64.4      0
##  8 EWR    2013-01-01 08:00:00  39.9  62.2      0
##  9 EWR    2013-01-01 09:00:00  39.9  62.2      0
## 10 EWR    2013-01-01 10:00:00  41    59.6      0
## # … with 26,105 more rows

origin 변수를 key로, time_hour 변수를 index로 잡습니다.
- 다중 시계열이 아닌 단일 시계열이라면 key를 명시하지 않으셔도 됩니다.
- 예시 데이터에서는 출발지(origin) 별로 관측된 다중 시계열에 해당됩니다.

weather_tsbl <- weather_sample %>% 
  as_tsibble(key = origin, index = time_hour)

weather_tsbl

## # A tsibble: 26,115 x 5 [1h] <America/New_York>
## # Key:       origin [3]
##    origin time_hour            temp humid precip
##    <chr>  <dttm>              <dbl> <dbl>  <dbl>
##  1 EWR    2013-01-01 01:00:00  39.0  59.4      0
##  2 EWR    2013-01-01 02:00:00  39.0  61.6      0
##  3 EWR    2013-01-01 03:00:00  39.0  64.4      0
##  4 EWR    2013-01-01 04:00:00  39.9  62.2      0
##  5 EWR    2013-01-01 05:00:00  39.0  64.4      0
##  6 EWR    2013-01-01 06:00:00  37.9  67.2      0
##  7 EWR    2013-01-01 07:00:00  39.0  64.4      0
##  8 EWR    2013-01-01 08:00:00  39.9  62.2      0
##  9 EWR    2013-01-01 09:00:00  39.9  62.2      0
## 10 EWR    2013-01-01 10:00:00  41    59.6      0
## # … with 26,105 more rows

인덱스의 간격은 연도(year)부터 나노 초(nanosecond)까지 숫자로 정렬되는 요소로부터 계산됩니다.
아래는 tsibble에서 인덱스의 클래스를 나타낸 표입니다.

Interval	Class
Annual	`integer`, `double`
Quaterly	`yearquarter`
Monthly	`yearmonth`
Weekly	`yearweek`
Daily	`Date`, `difftime`
Subdaily	`POSIXt`, `difftime`, `hms`

위에서 기본 가정으로 시계열의 인덱스는 등간격을 가정하였는데, 사실 tsibble() 함수는 등간격이 아닌 자료에 대해서도 적용이 가능합니다.
tsibble() 함수의 regular = FALSE arguments를 설정하면 됩니다. (기본값은 TRUE)

# make_datetime() 함수는 lubridate 라이브러리에 있는 함수로 지정된 timezone으로 timestamp를 생성하는 함수입니다.
nycflights13::flights %>% 
  mutate(sched_dep_datetime = make_datetime(year, month, day, hour, minute, tz = "America/New_York")) %>% 
  select(carrier, flight, sched_dep_datetime, air_time, distance) %>% 
  as_tsibble(
    key = c("carrier", "flight"),
    index = sched_dep_datetime,
    regular = FALSE
  )

## # A tsibble: 336,776 x 5 [!] <America/New_York>
## # Key:       carrier, flight [5,725]
##    carrier flight sched_dep_datetime  air_time distance
##    <chr>    <int> <dttm>                 <dbl>    <dbl>
##  1 9E        2900 2013-11-03 15:40:00      113      765
##  2 9E        2900 2013-11-04 15:40:00      117      765
##  3 9E        2900 2013-11-05 15:40:00      120      765
##  4 9E        2900 2013-11-06 15:40:00      118      765
##  5 9E        2900 2013-11-07 15:40:00      131      765
##  6 9E        2900 2013-11-08 15:40:00      114      765
##  7 9E        2900 2013-11-09 15:40:00      121      765
##  8 9E        2900 2013-11-10 15:40:00      115      765
##  9 9E        2900 2013-11-11 15:40:00      119      765
## 10 9E        2900 2013-11-12 15:40:00      118      765
## # … with 336,766 more rows

이렇게 등간격이 아닌 tsibble 객체의 경우 출력물에 [ ! ] 표시를 통해 확인할 수 있습니다.

암묵적 결측값 명시하기: `fill_gaps()`

시계열 데이터에서는 암묵적으로 결측값이 존재하는 케이스도 있습니다.
이렇게 암묵적인 결측치가 존재하는 경우 fill_gaps() 함수를 사용하여 처리 또는 명시할 수 있습니다.
아래 예시를 들어 살펴보겠습니다. 참고

harvest <- tsibble(
  year = c(2010, 2011, 2013, 2011, 2012, 2014),
  fruit = rep(c("kiwi", "cherry"), each = 3),
  kilo = sample(1:10, size = 6),
  key = fruit, 
  index = year
)

harvest

## # A tsibble: 6 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   <dbl> <chr>  <int>
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2014 cherry     6
## 4  2010 kiwi       7
## 5  2011 kiwi       2
## 6  2013 kiwi       4

위 데이터를 보면 체리(cherry)의 경우 2010년도와 2013년도 생산량이 없었기에 값으로 명시되어 있지 않습니다.
이는 아래 코드로 NA 결측 명시 처리를 할 수 있습니다.

harvest %>% 
  fill_gaps(.full = TRUE)

## # A tsibble: 10 x 3 [1Y]
## # Key:       fruit [2]
##     year fruit   kilo
##    <dbl> <chr>  <int>
##  1  2010 cherry    NA
##  2  2011 cherry     8
##  3  2012 cherry     9
##  4  2013 cherry    NA
##  5  2014 cherry     6
##  6  2010 kiwi       7
##  7  2011 kiwi       2
##  8  2012 kiwi      NA
##  9  2013 kiwi       4
## 10  2014 kiwi      NA

.full = FALSE로 arguments를 설정하면 각 key의 index에서 발생한 결측에 대해서만 명시가 이루어집니다. (FALSE가 디폴트 옵션입니다)

harvest %>% 
  fill_gaps(.full = FALSE)

## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   <dbl> <chr>  <int>
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry    NA
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi      NA
## 8  2013 kiwi       4

특정 값으로도 명시가 가능합니다.

harvest %>% 
  fill_gaps(kilo = 0)

## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   <dbl> <chr>  <dbl>
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry     0
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi       0
## 8  2013 kiwi       4

사칙연산 등의 함수 적용도 가능합니다.

harvest %>% 
  fill_gaps(kilo = sum(kilo))

## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   <dbl> <chr>  <int>
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry    36
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi      36
## 8  2013 kiwi       4

# 각 key 별 중앙값으로 명시하기 위해 group_by_key() 함수 적용 (tidyverse의 group_by와 동일한 개념)
harvest %>% 
  group_by_key() %>% 
  fill_gaps(kilo = median(kilo, na.rm = TRUE))

## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
## # Groups:    fruit [2]
##    year fruit   kilo
##   <dbl> <chr>  <int>
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry     8
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi       4
## 8  2013 kiwi       4

fill_gaps() 함수와 tidyr 라이브러리에 있는 fill() 함수를 함께 적용하면 암묵적 결측치를 이전 또는 다음 시점의 결측치로 대체할 수 있습니다.

harvest

## # A tsibble: 6 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   <dbl> <chr>  <int>
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2014 cherry     6
## 4  2010 kiwi       7
## 5  2011 kiwi       2
## 6  2013 kiwi       4

# 이전 시점으로 대체
harvest %>% 
  group_by_key() %>% 
  fill_gaps() %>% 
  tidyr::fill(kilo, .direction = "down")

## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
## # Groups:    fruit [2]
##    year fruit   kilo
##   <dbl> <chr>  <int>
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry     9
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi       2
## 8  2013 kiwi       4

# 다음 시점으로 대체
harvest %>% 
  group_by_key() %>% 
  fill_gaps() %>% 
  tidyr::fill(kilo, .direction = "up")

## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
## # Groups:    fruit [2]
##    year fruit   kilo
##   <dbl> <chr>  <int>
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry     6
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi       4
## 8  2013 kiwi       4

# 맨 위에서 예시로 살펴본 데이터에도 적용
weather_tsbl %>% 
  filter(origin == "EWR") %>% 
  fill_gaps(precip = 0) %>% 
  group_by_key() %>% 
  tidyr::fill(temp, humid, .direction = "down")

## # A tsibble: 8,730 x 5 [1h] <America/New_York>
## # Key:       origin [1]
## # Groups:    origin [1]
##    origin time_hour            temp humid precip
##    <chr>  <dttm>              <dbl> <dbl>  <dbl>
##  1 EWR    2013-01-01 01:00:00  39.0  59.4      0
##  2 EWR    2013-01-01 02:00:00  39.0  61.6      0
##  3 EWR    2013-01-01 03:00:00  39.0  64.4      0
##  4 EWR    2013-01-01 04:00:00  39.9  62.2      0
##  5 EWR    2013-01-01 05:00:00  39.0  64.4      0
##  6 EWR    2013-01-01 06:00:00  37.9  67.2      0
##  7 EWR    2013-01-01 07:00:00  39.0  64.4      0
##  8 EWR    2013-01-01 08:00:00  39.9  62.2      0
##  9 EWR    2013-01-01 09:00:00  39.9  62.2      0
## 10 EWR    2013-01-01 10:00:00  41    59.6      0
## # … with 8,720 more rows

특정 인덱스에 대해 함수 적용하기: `index_by()` + `summarise()`

index_by() 함수는 lubridate 계열의 날짜/시간 클래스에도 적용 가능합니다.
아래 예시처럼 월별 평균 기온과 총 강수량을 표현해보겠습니다.
map() 함수를 적용하는 것처럼 tsibble 객체의 index를 .으로 명시합니다.

weather_tsbl %>% 
  group_by_key() %>% 
  index_by(year_month = ~yearmonth(.)) %>% 
  summarise(
    avg_temp = mean(temp, na.rm = TRUE),
    total_precip = sum(precip, na.rm = TRUE)
  )

## # A tsibble: 36 x 4 [1M]
## # Key:       origin [3]
##    origin year_month avg_temp total_precip
##    <chr>       <mth>    <dbl>        <dbl>
##  1 EWR       2013  1     35.6         3.53
##  2 EWR       2013  2     34.3         3.83
##  3 EWR       2013  3     40.1         3   
##  4 EWR       2013  4     53.0         1.47
##  5 EWR       2013  5     63.3         5.44
##  6 EWR       2013  6     73.3         8.73
##  7 EWR       2013  7     80.7         3.74
##  8 EWR       2013  8     74.5         4.57
##  9 EWR       2013  9     67.3         1.54
## 10 EWR       2013 10     59.8         0.5 
## # … with 26 more rows

이러한 조합의 함수는 등간격이 아닌 인덱스에서도 사용할 수 있다고 합니다.
(This index_by() + summarise() combo can help with regularising a tsibble of irregular time space too.)

[R] 1. fpp3 간단한 소개

택2 — Tue, 27 Jul 2021 22:22:33 +0900

소개

https://tidyverts.org/
tidyverts는 시계열 데이터 분석을 tidy approach로 진행하게 하는 ecosystem 입니다.
R에서는 fpp3 라는 이름으로 tidyverts를 구성하고 있는 라이브러리들을 불러올 수 있습니다.
- 또는 install_packages("..."), install_github("tidyverts/...")와 같이 필요한 라이브러리들만 별도로 불러올 수 있습니다.
- fpp3는 Forecasting: principles and practice 3rd의 약자라고 합니다.

library(fpp3)

## ─ Attaching packages ────────────────────── fpp3 0.4.0 ─

## ✓ tibble      3.1.2      ✓ tsibble     1.0.1 
## ✓ dplyr       1.0.7      ✓ tsibbledata 0.3.0 
## ✓ tidyr       1.1.3      ✓ feasts      0.2.2 
## ✓ lubridate   1.7.10     ✓ fable       0.3.1 
## ✓ ggplot2     3.3.5

## ─ Conflicts ───────────────────────── fpp3_conflicts ─
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()

attaching packages를 보니 tidyverse를 불러들일 때 보였던 라이브러리 외 몇 가지가 더 보입니다.

`lubridate`

날짜와 시간 정보를 표시하는 변수를 다룰 때 매우 유용한 라이브러리이며, 종종 tidy data를 핸들링할 때 같이 쓰이기도 합니다.

`tsibble`

시계열 데이터를 tidy approach 접근에 기반하여 정리할 수 있는 라이브러리 입니다. github
해당 라이브러리 내장 함수인 tsibble() 함수를 통해 tsibble 이라는 객체를 생성할 수 있으며,
tsibble 객체는 아래와 같은 기본적인 원칙을 가집니다.
- index: 과거부터 현재까지 순서화된 자료값의 관측 시간
- key: 시간에 따른 관측값을 정의하는 변수의 집합
- 각 관측치는 index와 key를 통해 유니크하게 식별되어야 합니다.
- 각 관측치는 등간격으로 관측된 자료여야만 합니다.
  (나중에 언급하겠지만 이 부분은 함수의 옵션으로 어느 정도 통제할 수 있긴 합니다.)

`tsibbledata`

우리가 공부하면서 흔하게 접하는 iris와 같이 tsibble 포맷의 다양한 예제 데이터를 제공해주는 라이브러리 입니다.
주로 예제로 쓰이는 데이터로 olympic_running 이라는 데이터를 주로 쓰는 것 같습니다.

`feasts`

feasts의 의미는 Feature Extraction And Statistics for Time Series의 약자라고 합니다. (FEASTS)
시계열 데이터 분석에 필요한 여러 가지 함수들을 제공하는 라이브러리 입니다.
- 시계열 분해, 추출, 시각화 등
tsibble 객체와 함께 작동하며 fable 라이브러리와 긴밀하게 결합하여 사용됩니다.

`fable`

ARIMA, 지수평활(exponential smoothing) 등 일반적으로 사용되는 단변량/다변량 시계열 예측 모델을 제공하는 라이브러리 입니다.
모델에 대한 추정과 비교, 결합, 예측 등을 제공합니다.

[R] 6. Topic modeling

택2 — Tue, 20 Jul 2021 20:05:56 +0900

## [1] "ko_KR.UTF-8"

6. Topic modeling

토픽 모델링은 클러스터링처럼 텍스트 데이터를 대상으로하는 비지도학습 분류 방법입니다.
여러 토픽 모델들이 있는데 그 중 널리 사용되는 LDA(Latent Dirichlet Allocation)에 대해서 살펴보겠습니다.
사전에 필요한 라이브러리는 topicmodels 라이브러리로 LDA 객체를 다루는 방법에 대해 소개하겠습니다.

library(topicmodels)

6. 1. Latent Dirichlet Allocation

LDA는 토픽 모델링을 위한 가장 일반적인 알고리즘 중 하나입니다.
해당 포스팅에서는 모델의 수학적인 전개는 생략하고 아래 두 가지 원칙에 대해서만 정리하겠습니다.
- 모든 문서는 토픽이 혼합되어 있다.
  - 각 문서가 특정 비율로 여러 토픽의 단어를 포함할 수 있다고 가정
  - 예를 들어, 문서 1은 토픽A 90%, 토픽B 10%이고 문서 2는 토픽A 30%, 토픽B 70%
- 모든 토픽은 단어가 혼합되어 있다.
  - 예를 들어 하나의 토픽은 “정치”이고 또 다른 하나는 “엔터테인먼트”라고 가정했을 때
  - “정치” 토픽에서 많이 사용되는 단어는 “대통령”, “국회”, “정부” 일 수 있으며
    “엔터테인먼트” 토픽에서는 “영화”, “TV”, “배우” 등이 될 수도 있다.
  - 중요한 것은 토픽 간에 단어를 공유할 수 있다는 점
  - “예산”과 같은 단어는 두 토픽에 동등하게 나타날 수 있다.
- LDA는 위 두 가지 가정을 기반으로 추정하는 수학적인 방법입니다.
  - 지도학습에서의 선형판별분석 LDA(Linear Discriminant Analysis)와 약자가 동일하니 해석 때 주의하셔야 합니다.
- 각 토픽과 관련된 단어의 조합을 찾는 동시에 각 문서를 설명하는 토픽의 조합을 결정합니다.

예시를 위해 DocumentTermMatrix 객체인 AssociatedPress 데이터를 사용하겠습니다.
- 1988년쯤 발행된 미국 통신사의 2,246개 뉴스 기사 모음

data("AssociatedPress")

AssociatedPress

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

적용하는 함수는 topicmodels 라이브러리 함수인 LDA() 입니다.

# k는 분류하고자 하는 토픽의 갯수
ap_lda <- LDA(AssociatedPress, k = 2, control = list(seed = 20210720))
ap_lda

## A LDA_VEM topic model with 2 topics.

6. 1. 1. Word-topic probabilities

모델에서 \(\beta\) 라고 불리우는 단어-토픽 확률을 구하기 위해서 tidy() 함수를 적용합니다.

ap_topics <- ap_lda %>% 
  tidy(matrix = "beta")

ap_topics

## # A tibble: 20,946 x 3
##    topic term           beta
##    <int> <chr>         <dbl>
##  1     1 aaron      3.31e- 5
##  2     2 aaron      6.71e- 6
##  3     1 abandon    3.23e- 5
##  4     2 abandon    3.77e- 5
##  5     1 abandoned  1.11e- 4
##  6     2 abandoned  6.05e- 5
##  7     1 abandoning 2.24e- 5
##  8     2 abandoning 3.16e-16
##  9     1 abbott     9.99e- 7
## 10     2 abbott     4.61e- 5
## # … with 20,936 more rows

그 결과 모델이 토픽(topic) 당 단어(term) 당 하나의 확률(beta)값을 갖는 데이터 형태가 되었습니다.
해석을 하자면, 해당 토픽에서 해당 단어가 나올 확률은 beta가 되는 것 입니다.
- 예를 들어 “aaron” 이라는 단어는 토픽1에서 생성될 확률과 토픽2에서의 확률이 서로 다릅니다.

slice_max() 함수를 사용하여 각 토픽에서 일반적인 단어를 캐치할 수 있습니다.

ap_top10_terms <- ap_topics %>% 
  group_by(topic) %>% 
  slice_max(beta, n = 10) %>% 
  ungroup() %>% 
  arrange(topic, desc(beta))

ap_top10_terms

## # A tibble: 20 x 3
##    topic term          beta
##    <int> <chr>        <dbl>
##  1     1 i          0.00716
##  2     1 people     0.00504
##  3     1 two        0.00445
##  4     1 president  0.00426
##  5     1 police     0.00404
##  6     1 government 0.00402
##  7     1 soviet     0.00364
##  8     1 bush       0.00343
##  9     1 new        0.00338
## 10     1 years      0.00323
## 11     2 percent    0.0108 
## 12     2 million    0.00767
## 13     2 new        0.00661
## 14     2 year       0.00647
## 15     2 billion    0.00507
## 16     2 last       0.00404
## 17     2 company    0.00369
## 18     2 market     0.00358
## 19     2 federal    0.00345
## 20     2 years      0.00284

ap_top10_terms %>% 
  ggplot(aes(x = reorder_within(term, beta, within = topic), y = beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()

위 결과를 통해 두 가지 주제에 대해서 대략적인 감을 찾아볼 수 있습니다.
토픽1은 “people”, “president”, “police”, “government” 등 정치와 관련된 뉴스 기사임을 대략적으로 알 수 있으며
토픽2는 “percent”, “million”, “company”등 경제 관련 뉴스 기사임을 확인할 수 있습니다.
또한 공통적으로 “years”와 같이 하나의 단어가 두 개의 토픽(클러스터)에서 높은 확률 값을 나타내고 있습니다.
이는 기존에 우리가 익숙히 알고 있는 하드한 클러스터링과 다르게 소프트하게 클러스터링을 해볼 수 있다는 장점도 있다는 점을 알려줍니다.

다음으로는 두 토픽 간의 가장 큰 차이를 보이는 항을 고려할 수 있습니다.
이 차이를 계산하는 방법은 A 대비 B의 방식과 같이 로그비(log-ratio)를 고려합니다.

ap_topics_wider <- ap_topics %>% 
  mutate(topic = paste0("topic", topic)) %>% 
  pivot_wider(
    names_from = "topic",
    values_from = beta,
    values_fill = 0
  ) %>% 
  filter(topic1 > 0.001 | topic2 > 0.001) %>% 
  mutate(log_ratio = log2(topic2/topic1))

ap_topics_wider

## # A tibble: 209 x 4
##    term             topic1    topic2 log_ratio
##    <chr>             <dbl>     <dbl>     <dbl>
##  1 administration 1.13e- 3 0.000767     -0.562
##  2 agreement      6.29e- 4 0.00130       1.05 
##  3 aid            1.01e- 3 0.0000431    -4.54 
##  4 air            8.71e- 4 0.00134       0.628
##  5 american       1.67e- 3 0.00207       0.309
##  6 analysts       2.37e- 6 0.00116       8.94 
##  7 army           1.17e- 3 0.0000110    -6.74 
##  8 asked          1.41e- 3 0.000327     -2.11 
##  9 authorities    1.11e- 3 0.000138     -3.01 
## 10 average        7.39e-12 0.00144      27.5  
## # … with 199 more rows

ap_topics_wider %>% 
  group_by(group = ifelse(log_ratio >= 0, "+", "-")) %>% 
  slice_max(abs(log_ratio), n = 10) %>% 
  ungroup() %>% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
  geom_bar(stat = "identity", width = 0.8) +
  coord_flip() +
  labs(x = NULL, y = "Log2 ratio of beta in topic2 / topic1")

위 결과를 통해 확인해볼 수 있는 것은 토픽2는 “stock”, “dollar” 등의 단어와 같이 상대적으로 경제 관련 기사임을 확인해볼 수 있습니다.

6. 1. 2. Document-topic probabilities

모델에서 \(\gamma\) 라고 불리우는 문서-토픽 확률을 구하기 위해서 마찬가지로 tidy() 함수를 적용합니다.

ap_documents <- ap_lda %>% 
  tidy(matrix = "gamma")

ap_documents

## # A tibble: 4,492 x 3
##    document topic  gamma
##       <int> <int>  <dbl>
##  1        1     1 0.999 
##  2        2     1 0.612 
##  3        3     1 0.959 
##  4        4     1 0.797 
##  5        5     1 0.997 
##  6        6     1 1.00  
##  7        7     1 0.193 
##  8        8     1 0.997 
##  9        9     1 0.0261
## 10       10     1 0.926 
## # … with 4,482 more rows

그 결과 모델이 문서(document) 당 토픽(topic) 당 하나의 확률(gamma)값을 갖는 데이터 형태가 되었습니다.
해석을 하자면, 해당 문서에서 해당 토픽에 대한 단어의 추청 확률이 gamma가 되는 것 입니다.
- 예를 들어 문서2에 있는 단어의 약 61%만이 토픽1에서 생성된 것으로 추정합니다. (두번째 줄)
위 결과로 보면 문서6은 토픽1에 있는 단어의 거의 100%를 가져왔습니다.
따라서 해당 문서에서 가장 빈도가 높은 단어들이 어떤 것인지 확인해볼 수 있습니다.

AssociatedPress %>% 
  tidy() %>% 
  filter(document == 6) %>% 
  arrange(desc(count))

## # A tibble: 287 x 3
##    document term           count
##       <int> <chr>          <dbl>
##  1        6 noriega           16
##  2        6 panama            12
##  3        6 jackson            6
##  4        6 powell             6
##  5        6 administration     5
##  6        6 economic           5
##  7        6 general            5
##  8        6 i                  5
##  9        6 panamanian         5
## 10        6 american           4
## # … with 277 more rows

6. 2. Example: the great library heist

해당 예시를 똑같이 따라해보려고 했는데… 이상하게 파일을 불러올 수가 없네요 ㅠㅠ
원인을 확인해보고 확인되는대로 다시 재업로드 하겠습니다.
꼭 아래 링크를 통해서 한번쯤은 따라해보고 공부해보시길 권장드립니다. 꼭!
필요하신 분들은 여기를 참고해주시면 되어요!

[R] 5. Converting to and from non-tidy formats

택2 — Mon, 19 Jul 2021 12:33:04 +0900

5. Converting to and from non-tidy formats

이번 챕터에서는 텍스트 데이터를 tidy text format이 아닌 tm, quanteda 라이브러리에서 활용될 수 있는 코퍼스(corpus) 객체로 분석하는 방법에 대해서 설명합니다.

5. 1. Tidying a document-term matrix

문서 용어 행렬(DTM, Document-Term Matrix)은 텍스트 분석에서 일반적으로 쓰이는 구조 중 하나 입니다.
이는 아래와 같은 형태를 갖습니다.
- 각 행은 하나의 문서(ex. book, article, …)를 나타냅니다.
- 각 열은 하나의 단어를 나타냅니다.
- 일반적으로 각 행렬에 대한 값은 해당 문서에서 해당 단어의 출현 빈도가 됩니다.
여러 문서 안에서 문서-단어 쌍이 공통적으로 많이 발생하는 케이스는 드물기에 DTM은 일반적으로 희소 행렬(sparse matrix)로 구현됩니다.
tidytext 라이브러리는 DTM 객체를 직접 사용할 수 없지만 이를 tidy data frame 형태로 변환해주는 함수를 제공합니다.
- tidy(): DTM to tidy data (in broom library)

5. 1. 1. Tidying DocumentTermMatrix objects

R에서 가장 널리 이용되는 DTM 구현은 tm 라이브러리 내에 DocumentTermMatrix 클래스를 갖는 객체입니다.
예시를 보이기 위해 topicmodels 라이브러리에 있는 Associated Press 뉴스 기사 데이터를 참고합니다.

library(tm)

## 필요한 패키지를 로딩중입니다: NLP

## 
## 다음의 패키지를 부착합니다: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

data("AssociatedPress", package = "topicmodels")
AssociatedPress

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

위에서 보이다시피 이 데이터는 2,246개의 뉴스 기사와 10,473개의 단어로 구성된 DTM 객체이며 99%가 문서-단어 쌍의 값이 0인 희소행렬로 입니다.
Terms() 함수를 사용하여 문서의 단어에 접근할 수 있습니다.

terms <- Terms(AssociatedPress)

terms %>% 
  head(20)

##  [1] "aaron"      "abandon"    "abandoned"  "abandoning" "abbott"    
##  [6] "abboud"     "abc"        "abcs"       "abctvs"     "abdomen"   
## [11] "abducted"   "abduction"  "abductors"  "abdul"      "abide"     
## [16] "abilities"  "ability"    "ablaze"     "able"       "abm"

이를 tidy data format으로 변환하기 위해서 tidy() 함수를 사용합니다.

ap_tidy <- AssociatedPress %>% 
  tidy()

ap_tidy

## # A tibble: 302,031 x 3
##    document term       count
##       <int> <chr>      <dbl>
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # … with 302,021 more rows

이렇게 나온 결과 값은 희소행렬을 reshape2 라이브러리의 melt() 함수를 사용한 결과와 유사하다고 보면 됩니다.
또한 tidy() 함수를 적용함으로써 희소행렬 값이 0으로 취급받는 문서-단어 쌍은 결과에 출력되지 않습니다.

이제 이러한 결과를 가지고 감정 분석을 진행해볼 수 있습니다.

ap_sentiments <- ap_tidy %>% 
  inner_join(get_sentiments("bing"), by = c(term = "word"))

ap_sentiments

## # A tibble: 30,094 x 4
##    document term    count sentiment
##       <int> <chr>   <dbl> <chr>    
##  1        1 assault     1 negative 
##  2        1 complex     1 negative 
##  3        1 death       1 negative 
##  4        1 died        1 negative 
##  5        1 good        2 positive 
##  6        1 illness     1 negative 
##  7        1 killed      2 negative 
##  8        1 like        2 positive 
##  9        1 liked       1 positive 
## 10        1 miracle     1 positive 
## # … with 30,084 more rows

ap_sentiments %>% 
  count(sentiment, term, wt = count) %>% 
  ungroup() %>% 
  filter(n >= 200) %>% 
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>% 
  ggplot(aes(x = reorder(term, n), y = n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  labs(x = "Contribution to sentiment", y = NULL)

가장 흔하게 나타나는 긍정적인 단어는 “like”, “work”, “support”, “good”
가장 부정적인 단어는 “killed”, “death” 등이 잇습니다.

5. 1. 2. Tidying `dfm` objects

다른 라이브러리인 quanteda에서 dfm() 함수도 dfm 이라는 클래스로 문서-단어 행렬을 구현을 제공합니다.
- dtm, dfm
- 가운데 글자 하나에 차이가 있습니다.
- dfm에서 f는 feature를 의미한다고 합니다. (document-feature matrix)
예시를 보기 위해 quanteda 라이브러리의 data_corpus_inaugural 데이터를 참고하겠습니다. (취임연설 관련 데이터)

library(quanteda)

## Package version: 3.0.0
## Unicode version: 13.0
## ICU version: 69.1

## Parallel computing: 16 of 16 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## 다음의 패키지를 부착합니다: 'quanteda'

## The following object is masked from 'package:tm':
## 
##     stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

data("data_corpus_inaugural", pacakge = "quanteda")

## Warning in data("data_corpus_inaugural", pacakge = "quanteda"): data set
## 'quanteda' not found

data_corpus_inaugural

## Corpus consisting of 59 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
## 
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
## 
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
## 
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."
## 
## [ reached max_ndoc ... 53 more documents ]

inaug_dfm <- dfm(data_corpus_inaugural, verbose = FALSE)

## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.

inaug_dfm

## Document-feature matrix of: 59 documents, 9,439 features (91.84% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens  of the senate and house representatives :
##   1789-Washington               1  71 116      1  48     2               2 1
##   1793-Washington               0  11  13      0   2     0               0 1
##   1797-Adams                    3 140 163      1 130     0               2 0
##   1801-Jefferson                2 104 130      0  81     0               0 1
##   1805-Jefferson                0 101 143      0  93     0               0 0
##   1809-Madison                  1  69 104      0  43     0               0 0
##                  features
## docs              among vicissitudes
##   1789-Washington     1            1
##   1793-Washington     0            0
##   1797-Adams          4            0
##   1801-Jefferson      1            0
##   1805-Jefferson      7            0
##   1809-Madison        0            0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,429 more features ]

이것도 역시 tidy() 함수를 적용할 수 있습니다.

inaug_tidy <- inaug_dfm %>% 
  tidy()

inaug_tidy

## # A tibble: 45,453 x 3
##    document        term            count
##    <chr>           <chr>           <dbl>
##  1 1789-Washington fellow-citizens     1
##  2 1797-Adams      fellow-citizens     3
##  3 1801-Jefferson  fellow-citizens     2
##  4 1809-Madison    fellow-citizens     1
##  5 1813-Madison    fellow-citizens     1
##  6 1817-Monroe     fellow-citizens     5
##  7 1821-Monroe     fellow-citizens     1
##  8 1841-Harrison   fellow-citizens    11
##  9 1845-Polk       fellow-citizens     1
## 10 1849-Taylor     fellow-citizens     1
## # … with 45,443 more rows

이를 가지고 bind_tf_idf() 함수를 사용하여 TF-IDF 값을 구해볼 수 있고 이를 시각화해볼 수 있습니다.

inaug_tf_idf <- inaug_tidy %>% 
  bind_tf_idf(
    term = term,
    document = document,
    n = count
  )

inaug_tf_idf

## # A tibble: 45,453 x 6
##    document        term            count       tf   idf   tf_idf
##    <chr>           <chr>           <dbl>    <dbl> <dbl>    <dbl>
##  1 1789-Washington fellow-citizens     1 0.000651  1.13 0.000737
##  2 1797-Adams      fellow-citizens     3 0.00116   1.13 0.00132 
##  3 1801-Jefferson  fellow-citizens     2 0.00104   1.13 0.00118 
##  4 1809-Madison    fellow-citizens     1 0.000793  1.13 0.000899
##  5 1813-Madison    fellow-citizens     1 0.000768  1.13 0.000870
##  6 1817-Monroe     fellow-citizens     5 0.00136   1.13 0.00154 
##  7 1821-Monroe     fellow-citizens     1 0.000205  1.13 0.000232
##  8 1841-Harrison   fellow-citizens    11 0.00121   1.13 0.00137 
##  9 1845-Polk       fellow-citizens     1 0.000193  1.13 0.000218
## 10 1849-Taylor     fellow-citizens     1 0.000849  1.13 0.000962
## # … with 45,443 more rows

inaug_tf_idf %>% 
  filter(document %in% c("1861-Lincoln", "1933-Roosevelt", "1961-Kennedy", "2009-Obama")) %>% 
  group_by(document) %>% 
  slice_max(tf_idf, n = 10) %>% 
  ungroup() %>% 
  ggplot(aes(x = reorder(term, tf_idf), y = tf_idf, fill = document)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ document, ncol = 2, scales = "free")

또 다른 시각화의 예로, 각 문서의 이름에서 연도를 추출하고 각 연도의 총 단어 수도 게산할 수 있습니다.

year_term_counts <- inaug_tidy %>% 
  extract(
    col = document,
    into = "year",
    regex = "(\\d+)",
    convert = TRUE
  ) %>% 
  complete(year, term, fill = list(count = 0)) %>% # 단어가 문서에 나타나지 않는 케이스를 포함하기 위해
  group_by(year) %>% 
  mutate(total_count = sum(count)) %>% 
  ungroup()

year_term_counts

## # A tibble: 556,901 x 4
##     year term  count total_count
##    <int> <chr> <dbl>       <dbl>
##  1  1789 "-"       1        1537
##  2  1789 ","      70        1537
##  3  1789 ";"       8        1537
##  4  1789 ":"       1        1537
##  5  1789 "!"       0        1537
##  6  1789 "?"       0        1537
##  7  1789 "."      23        1537
##  8  1789 "…"       0        1537
##  9  1789 "'"       0        1537
## 10  1789 "\""      2        1537
## # … with 556,891 more rows

주요 특정 단어를 필터링하여 해당 단어들이 시간이 지남에 따라 빈도가 어떻게 변했는지 확인해볼 수 있습니다.

year_term_counts %>% 
  filter(term %in% c("god", "america", "foreign", "union", "constitution", "freedom")) %>% 
  mutate(ratio = count/total_count) %>% 
  ggplot(aes(x = year, y = ratio)) +
  geom_point(size = 1.2) +
  geom_smooth(formula = y ~ x, method = "loess") +
  facet_wrap(~ term, scales = "free_y") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "% frequency of word in inaugural address")

5. 2. Casting tidy text data into a matrix

어떤 함수들은 tidy format이 아닌 dtm format을 input으로 필요로 하는 케이스가 있을 수도 있습니다.
tidytext 라이브러리에서는 tidy format을 dtm format으로 변환해주는 함수 역시 존재합니다.
그 함수명은 cast_dtm() 입니다.
위에서 dtm to tidy로 변환했던 ap_tidy 객체를 활용해보겠습니다.

ap_tidy

## # A tibble: 302,031 x 3
##    document term       count
##       <int> <chr>      <dbl>
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # … with 302,021 more rows

ap_dtm <- ap_tidy %>% 
  cast_dtm(
    term = term,
    document = document,
    value = count
  )

ap_dtm

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

마찬가지로 quanteda 라이브러리의 dfm 객체 역시 cast_dfm() 함수로 변환할 수 있습니다.

ap_dfm <- ap_tidy %>% 
  cast_dfm(
    term = term,
    document = document,
    value = count
  )

ap_dfm

## Document-feature matrix of: 2,246 documents, 10,473 features (98.72% sparse) and 0 docvars.
##     features
## docs adding adult ago alcohol allegedly allen apparently appeared arrested
##    1      1     2   1       1         1     1          2        1        1
##    2      0     0   0       0         0     0          0        1        0
##    3      0     0   1       0         0     0          0        1        0
##    4      0     0   3       0         0     0          0        0        0
##    5      0     0   0       0         0     0          0        0        0
##    6      0     0   2       0         0     0          0        0        0
##     features
## docs assault
##    1       1
##    2       0
##    3       0
##    4       0
##    5       0
##    6       0
## [ reached max_ndoc ... 2,240 more documents, reached max_nfeat ... 10,463 more features ]

이러한 종류의 변환을 통해서 이전 챕터에서 예시로 보았던 Jane Austen 책 역시 dtm 객체로 만들 수 있습니다.

library(janeaustenr)

austen_dtm <- austen_books() %>% 
  unnest_tokens(input = text, output = "word") %>% 
  count(book, word) %>% 
  cast_dtm(
    term = word,
    document = book,
    value = n
  )

austen_dtm

## <<DocumentTermMatrix (documents: 6, terms: 14520)>>
## Non-/sparse entries: 40379/46741
## Sparsity           : 54%
## Maximal term length: 19
## Weighting          : term frequency (tf)

5. 3. Tidying corpus objects with metadata

Corpus라고 하는 객체는 토큰화 전에 문서 컬렉션들을 저장해놓은 객체입니다.
여기에는 각 문서의 고유 아이디나 날짜/시간 또는 제목 등 포함할 수 있는 메타데이터와 함께 텍스트를 저장합니다.
아래 예시를 들어 살펴보겠습니다 (tm 라이브러리에서 acq 데이터, 뉴스 기사 50개)

data("acq")

acq

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 50

class(acq)

## [1] "VCorpus" "Corpus"

아래와 같이 Corpus 객체는 텍스트와 메타데이터가 모두 포함됩니다.

acq[[1]]

## <<PlainTextDocument>>
## Metadata:  15
## Content:  chars: 1287

이러한 방식은 텍스트 데이터를 저장하는 유연한 방법이지만 tidytext 라이브러리로 처리하기에는 적합하지 않습니다.
따라서 tidy() 함수를 사용하여 tidy data format으로 변환시켜 분석해야 합니다.

acq_tidy <- acq %>% 
  tidy()

acq_tidy

## # A tibble: 50 x 16
##    author  datetimestamp       description heading  id    language origin topics
##    <chr>   <dttm>              <chr>       <chr>    <chr> <chr>    <chr>  <chr> 
##  1 <NA>    1987-02-26 15:18:06 ""          COMPUTE… 10    en       Reute… YES   
##  2 <NA>    1987-02-26 15:19:15 ""          OHIO MA… 12    en       Reute… YES   
##  3 <NA>    1987-02-26 15:49:56 ""          MCLEAN'… 44    en       Reute… YES   
##  4 By Cal… 1987-02-26 15:51:17 ""          CHEMLAW… 45    en       Reute… YES   
##  5 <NA>    1987-02-26 16:08:33 ""          <COFAB … 68    en       Reute… YES   
##  6 <NA>    1987-02-26 16:32:37 ""          INVESTM… 96    en       Reute… YES   
##  7 By Pat… 1987-02-26 16:43:13 ""          AMERICA… 110   en       Reute… YES   
##  8 <NA>    1987-02-26 16:59:25 ""          HONG KO… 125   en       Reute… YES   
##  9 <NA>    1987-02-26 17:01:28 ""          LIEBERT… 128   en       Reute… YES   
## 10 <NA>    1987-02-26 17:08:27 ""          GULF AP… 134   en       Reute… YES   
## # … with 40 more rows, and 8 more variables: lewissplit <chr>, cgisplit <chr>,
## #   oldid <chr>, places <named list>, people <lgl>, orgs <lgl>,
## #   exchanges <lgl>, text <chr>

이를 unnest_tokens() 함수를 사용하여 단어들을 토큰화시킬 수 있습니다.

acq_tokens <- acq_tidy %>% 
  select(-places) %>% 
  unnest_tokens(
    input = text,
    output = "word"
  ) %>% 
  anti_join(stop_words, by = "word")

## Warning: Outer names are only allowed for unnamed scalar atomic inputs

# most common words
acq_tokens %>% 
  count(word, sort = TRUE)

## # A tibble: 1,566 x 2
##    word         n
##    <chr>    <int>
##  1 dlrs       100
##  2 pct         70
##  3 mln         65
##  4 company     63
##  5 shares      52
##  6 reuter      50
##  7 stock       46
##  8 offer       34
##  9 share       34
## 10 american    28
## # … with 1,556 more rows

# tf-idf
acq_tokens %>% 
  count(id, word) %>% 
  bind_tf_idf(
    term = word,
    document = id,
    n = n
  ) %>% 
  arrange(desc(tf_idf))

## # A tibble: 2,853 x 6
##    id    word         n     tf   idf tf_idf
##    <chr> <chr>    <int>  <dbl> <dbl>  <dbl>
##  1 186   groupe       2 0.133   3.91  0.522
##  2 128   liebert      3 0.130   3.91  0.510
##  3 474   esselte      5 0.109   3.91  0.425
##  4 371   burdett      6 0.103   3.91  0.405
##  5 442   hazleton     4 0.103   3.91  0.401
##  6 199   circuit      5 0.102   3.91  0.399
##  7 162   suffield     2 0.1     3.91  0.391
##  8 498   west         3 0.1     3.91  0.391
##  9 441   rmj          8 0.121   3.22  0.390
## 10 467   nursery      3 0.0968  3.91  0.379
## # … with 2,843 more rows

[R] 4. Relationships between words: n-grams and correlations

택2 — Sun, 18 Jul 2021 19:22:40 +0900

4. Relationships between words: n-grams and correlations

4. 1. Tokenizing by n-gram

지금까지 unnset_tokens() 함수를 사용하여 단어, 또는 문장으로 토큰화를 진행했었는데,
이러한 토큰 단위는 감정 또는 빈도 관련 분석에 유용합니다.
그러나 해당 함수를 사용하여 n-grams라고 하는 연속적인 단어 시퀀스로도 토큰화를 할 수 있습니다.
즉, 어느 단어 다음에 특정 단어가 얼마나 자주 나오는 지 확인함으로써 이들 사이의 관계를 확인해볼 수도 있습니다.
방식은 간단합니다. unnest_tokens() 함수에 token = "ngrams"와 n = 2(연속되는 단어 수) arguments를 주면 됩니다.

library(janeaustenr)

austen_bigrams <- austen_books() %>% 
  unnest_tokens(input = text, output = "bigram", token = "ngrams", n = 2)

austen_bigrams

## # A tibble: 675,025 x 2
##    book                bigram         
##    <fct>               <chr>          
##  1 Sense & Sensibility sense and      
##  2 Sense & Sensibility and sensibility
##  3 Sense & Sensibility <NA>           
##  4 Sense & Sensibility by jane        
##  5 Sense & Sensibility jane austen    
##  6 Sense & Sensibility <NA>           
##  7 Sense & Sensibility <NA>           
##  8 Sense & Sensibility <NA>           
##  9 Sense & Sensibility <NA>           
## 10 Sense & Sensibility <NA>           
## # … with 675,015 more rows

4. 1. 1. Counting and filtering n-grams

이도 마찬가지로 count() 함수를 사용하여 빈도를 체크해볼 수 있습니다.

austen_bigrams %>% 
  count(bigram, sort = TRUE)

## # A tibble: 193,210 x 2
##    bigram      n
##    <chr>   <int>
##  1 <NA>    12242
##  2 of the   2853
##  3 to be    2670
##  4 in the   2221
##  5 it was   1691
##  6 i am     1485
##  7 she had  1405
##  8 of her   1363
##  9 to the   1315
## 10 she was  1309
## # … with 193,200 more rows

separate() 함수는 구분자를 기준으로 컬럼을 여러 개로 분할하는 데 쓰일 수 있는 함수 입니다.
이 함수를 가지고 위 결과를 두 개의 컬럼으로 분리할 수 있습니다.
astuen_bigrams 결과가 두 개의 단어를 띄어쓰기 공백으로 분리하였기에 구분자는 띄어쓰기 한 칸이 됩니다.

bigrams_separated <- austen_bigrams %>% 
  separate(
    col = bigram, 
    into = c("word1", "word2"),
    sep = " "
  )

bigrams_separated

## # A tibble: 675,025 x 3
##    book                word1 word2      
##    <fct>               <chr> <chr>      
##  1 Sense & Sensibility sense and        
##  2 Sense & Sensibility and   sensibility
##  3 Sense & Sensibility <NA>  <NA>       
##  4 Sense & Sensibility by    jane       
##  5 Sense & Sensibility jane  austen     
##  6 Sense & Sensibility <NA>  <NA>       
##  7 Sense & Sensibility <NA>  <NA>       
##  8 Sense & Sensibility <NA>  <NA>       
##  9 Sense & Sensibility <NA>  <NA>       
## 10 Sense & Sensibility <NA>  <NA>       
## # … with 675,015 more rows

stop_words를 활용하여 불용어를 제거한 후 빈도를 확인해보곘습니다.

data("stop_words")

bigrams_filtered <- bigrams_separated %>% 
  filter(!word1 %in% stop_words$word) %>% 
  filter(!word2 %in% stop_words$word)

bigram_counts <- bigrams_filtered %>% 
  count(word1, word2, sort = TRUE)

bigram_counts

## # A tibble: 28,975 x 3
##    word1   word2         n
##    <chr>   <chr>     <int>
##  1 <NA>    <NA>      12242
##  2 sir     thomas      266
##  3 miss    crawford    196
##  4 captain wentworth   143
##  5 miss    woodhouse   143
##  6 frank   churchill   114
##  7 lady    russell     110
##  8 sir     walter      108
##  9 lady    bertram     101
## 10 miss    fairfax      98
## # … with 28,965 more rows

Jane Austen의 책에서는 이름과 성이 가장 빈도가 높은 한 쌍임을 알 수 있습니다.

또 다른 분석에서는 재결합된 단어로 작업할 수 있습니다.
unite() 함수는 separate() 함수와 반대로 열을 하나로 다시 결합할 수 있습니다.
따라서 separate(), filter(), count(), unite() 함수를 사용하여 가장 일반적인 두 단어 쌍을 찾을 수 있습니다.

bigrams_united <- bigrams_filtered %>% 
  unite(
    col = "bigram",
    word1, word2,
    sep = " "
  )

bigrams_united

## # A tibble: 51,155 x 2
##    book                bigram     
##    <fct>               <chr>      
##  1 Sense & Sensibility NA NA      
##  2 Sense & Sensibility jane austen
##  3 Sense & Sensibility NA NA      
##  4 Sense & Sensibility NA NA      
##  5 Sense & Sensibility NA NA      
##  6 Sense & Sensibility NA NA      
##  7 Sense & Sensibility NA NA      
##  8 Sense & Sensibility NA NA      
##  9 Sense & Sensibility chapter 1  
## 10 Sense & Sensibility NA NA      
## # … with 51,145 more rows

4. 1. 2. Analyzing bigrams

우리는 각 책에 언급된 “street”라는 단어에 관심이 있다고 가정합시다.
“street” 단어 이전에 어떤 단어들이 많이 나왔는 지 EDA 관점에서 접근하고 싶다면? 아래와 같이 입력해볼 수 있습니다.

bigrams_filtered %>% 
  filter(word2 == "street") %>% 
  count(book, word1, srot = TRUE)

## # A tibble: 33 x 4
##    book                word1       srot      n
##    <fct>               <chr>       <lgl> <int>
##  1 Sense & Sensibility berkeley    TRUE     15
##  2 Sense & Sensibility bond        TRUE      4
##  3 Sense & Sensibility conduit     TRUE      4
##  4 Sense & Sensibility harley      TRUE     16
##  5 Sense & Sensibility james       TRUE      1
##  6 Sense & Sensibility park        TRUE      1
##  7 Sense & Sensibility sackville   TRUE      1
##  8 Pride & Prejudice   edward      TRUE      1
##  9 Pride & Prejudice   gracechurch TRUE      8
## 10 Pride & Prejudice   grosvenor   TRUE      2
## # … with 23 more rows

또한 ngram 역시 문장 단위 안에서 토큰으로 취급한 것이기에 TF-IDF 계산도 가능합니다.

bigram_tf_idf <- bigrams_united %>% 
  count(book, bigram) %>% 
  bind_tf_idf(
    term = bigram,
    document = book,
    n = n
  ) %>% 
  arrange(desc(tf_idf))

bigram_tf_idf %>% 
  group_by(book) %>% 
  slice_max(tf_idf, n = 10) %>% 
  ungroup() %>% 
  ggplot(aes(x = reorder(bigram, tf_idf), y = tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ book, ncol = 2, scales = "free") +
  coord_flip() +
  labs(x = "bi-gram", y = NULL)

개별 단어보다 ngram의 TF-IDF는 단일 단어를 계산할 때 보이지 않는 구조를 포착하고 토큰을 더 이해하기 쉽게 만드는 데 도움을 줍니다.

4. 1. 3. Using bigrams to provide context in sentiment analysis

챕터 2에서는 사전을 활용하여 단순히 긍정적이거나 부정적인 단어의 빈도를 계산하였습니다.
이제는 ngram을 구성하였으므로 단어 앞에 “not”과 같은 단어가 오는 빈도도 알 수 있습니다.

bigrams_separated %>% 
  filter(word1 == "not") %>% 
  count(word1, word2, sort = TRUE)

## # A tibble: 1,178 x 3
##    word1 word2     n
##    <chr> <chr> <int>
##  1 not   be      580
##  2 not   to      335
##  3 not   have    307
##  4 not   know    237
##  5 not   a       184
##  6 not   think   162
##  7 not   been    151
##  8 not   the     135
##  9 not   at      126
## 10 not   in      110
## # … with 1,168 more rows

이를 가지고 AFINN 사전을 사용하여 각 단어에 대한 감정을 수치로 표현하고자 합니다.

not_words <- bigrams_separated %>% 
  filter(word1 == "not") %>% 
  inner_join(get_sentiments("afinn"), by = c(word2 = "word")) %>% 
  count(word2, value, sort = TRUE)

not_words

## # A tibble: 229 x 3
##    word2   value     n
##    <chr>   <dbl> <int>
##  1 like        2    95
##  2 help        2    77
##  3 want        1    41
##  4 wish        1    39
##  5 allow       1    30
##  6 care        2    21
##  7 sorry      -1    20
##  8 leave      -1    17
##  9 pretend    -1    17
## 10 worth       2    17
## # … with 219 more rows

위 결과 중 하나를 보면 “not” 뒤에 오는 가장 일반적인 감정 관련 단어는 “like”이며 점수는 2입니다.
이처럼 어떤 단어가 negative에 많이 기여했는지도 확인해볼 수 있습니다.
- 단어 별 value 값에 빈도를 곱한 결과로 확인해볼 수 있습니다.

not_words %>% 
  mutate(contribution = n*value) %>% 
  arrange(desc(abs(contribution))) %>% 
  head(20) %>% 
  ggplot(aes(x = reorder(word2, contribution), y = contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(y = "Sentiment value * number of occurrences", x = "Words preceded by \"not\"")

4 1. 4. Visualizing a network of bigrams with `ggraph`

한 단어에 상위 몇 개만 표시하는 것을 넘어서 단어 간의 모든 관계를 동시에 시각화하는 데 관심이 있을 수 있습니다.
이는 네트워크 그래프를 활용하여 정렬해볼 수 있습니다. (연결된 노드의 조합을 보이는 그래프 포맷)
이 때 활용한 라이브러리는 igraph 라이브러리이며 tidy data에서 igraph 객체를 생성하는 함수인 graph_from_data_frame() 함수를 사용할 것 입니다.
또한 시각화에는 ggraph 라이브러리를 사용합니다.
- igraph 라이브러리에도 플로팅 함수가 있지만 ggplot2 문법이 익숙한 시각화 라이브러리인 ggraph 라이브러리를 사용합니다.

library(igraph)

## 
## 다음의 패키지를 부착합니다: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:purrr':
## 
##     compose, simplify

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following object is masked from 'package:tibble':
## 
##     as_data_frame

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(ggraph)

bigram_counts

## # A tibble: 28,975 x 3
##    word1   word2         n
##    <chr>   <chr>     <int>
##  1 <NA>    <NA>      12242
##  2 sir     thomas      266
##  3 miss    crawford    196
##  4 captain wentworth   143
##  5 miss    woodhouse   143
##  6 frank   churchill   114
##  7 lady    russell     110
##  8 sir     walter      108
##  9 lady    bertram     101
## 10 miss    fairfax      98
## # … with 28,965 more rows

bigram_graph <- bigram_counts %>% 
  filter(n > 20) %>% # 빈도가 20회가 넘는 두 단어의 조합 식별
  graph_from_data_frame()

## Warning in graph_from_data_frame(.): In `d' `NA' elements were replaced with
## string "NA"

bigram_graph

## IGRAPH 68c3987 DN-- 86 71 -- 
## + attr: name (v/c), n (e/n)
## + edges from 68c3987 (vertex names):
##  [1] NA      ->NA         sir     ->thomas     miss    ->crawford  
##  [4] captain ->wentworth  miss    ->woodhouse  frank   ->churchill 
##  [7] lady    ->russell    sir     ->walter     lady    ->bertram   
## [10] miss    ->fairfax    colonel ->brandon    sir     ->john      
## [13] miss    ->bates      jane    ->fairfax    lady    ->catherine 
## [16] lady    ->middleton  miss    ->tilney     miss    ->bingley   
## [19] thousand->pounds     miss    ->dashwood   dear    ->miss      
## [22] miss    ->bennet     miss    ->morland    captain ->benwick   
## + ... omitted several edges

위 객체를 가지고 ggraph() 함수를 적용하여 igraph 객체를 ggraph 객체로 변환할 수 있습니다.
이후 ggplot2에서 레이어를 추가하는 것처럼 레이어를 추가하여 진행합니다.
기본적으로 노드(node)와 가장자리(edge), 그리고 텍스트(text) 세 가지 레이어를 추가해야 합니다.

set.seed(2021)

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)

위 그래프에서 보면 “miss”, “lady”, “sir” 등의 단어를 중심으로 공통의 노드를 형성하고 종종 뒤에 이름이 붙는 것을 확인할 수 있습니다.
더 나은 모양의 그래프를 만들기 위해 아래와 같이 몇 가지 작업으로 마무리합니다.

ggraph(bigram_graph, layout = "fr") +
  geom_edge_link(
    aes(edge_alpha = n), 
    arrow = grid::arrow(type = "closed", length = unit(.15, "inches")),
    end_cap = circle(.07, "inches"),
    show.legend = FALSE
  ) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()

4. 2. Counting and correlating pairs of words with the `widyr` package

위에서 살펴보았듯 ngram으로 토큰화하는 작업은 인접한 단어 쌍을 탐색하는 유용한 방법입니다.
그러나 우리는 특정 문서나 특정 챕터에서 나란히 발생하지 않는 경우에도 함께 발생할 가능성이 있는 단어에 관심을 가질 수 있습니다.
tidy data는 변수를 비교하거나 그룹화하는 데 있어서 유용한 구조이지만 행간의 비교는 다소 어려울 수 있습니다.
예를 들어 두 단어가 동일한 문서에서 나타나는 횟수를 계산하거나 두 단어가 얼마나 상관관계가 있는 지 확인하기 위해서는 데이터를 wide format으로 변환해야 합니다. (행렬꼴)
widyr 라이브러리는 이러한 부분에 있어서 도움을 줄 수 있습니다.

4. 2. 1. Counting and correlating among sections

아래 “Pride & Prejudice” 책의 내용을 가지고 단어들을 토큰화 시켜보겠습니다.
섹션을 나누는 기준은 10줄 단위로 하겠습니다.

austen_section_words <- austen_books() %>% 
  filter(book == "Pride & Prejudice") %>% 
  mutate(section = row_number() %/% 10) %>% 
  filter(section > 0) %>% 
  unnest_tokens(
    input = text,
    output = "word",
    token = "words"
  ) %>% 
  filter(!word %in% stop_words$word)

austen_section_words

## # A tibble: 37,240 x 3
##    book              section word        
##    <fct>               <dbl> <chr>       
##  1 Pride & Prejudice       1 truth       
##  2 Pride & Prejudice       1 universally 
##  3 Pride & Prejudice       1 acknowledged
##  4 Pride & Prejudice       1 single      
##  5 Pride & Prejudice       1 possession  
##  6 Pride & Prejudice       1 fortune     
##  7 Pride & Prejudice       1 wife        
##  8 Pride & Prejudice       1 feelings    
##  9 Pride & Prejudice       1 views       
## 10 Pride & Prejudice       1 entering    
## # … with 37,230 more rows

widyr의 유용한 함수 중 하나는 pairwise_count() 함수입니다.
- pairwise_는 word 변수의 각 단어 쌍에 대해 하나의 행을 구성하는 의미입니다.
- item arguments에 단어가 들어가게 되고, feature arguments에 각 섹션이 들어가게 됩니다.

library(widyr)

word_pairs <- austen_section_words %>% 
  pairwise_count(
    item = word,
    feature = section,
    sort = TRUE
  )

## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.

word_pairs

## # A tibble: 796,008 x 3
##    item1     item2         n
##    <chr>     <chr>     <dbl>
##  1 darcy     elizabeth   144
##  2 elizabeth darcy       144
##  3 miss      elizabeth   110
##  4 elizabeth miss        110
##  5 elizabeth jane        106
##  6 jane      elizabeth   106
##  7 miss      darcy        92
##  8 darcy     miss         92
##  9 elizabeth bingley      91
## 10 bingley   elizabeth    91
## # … with 795,998 more rows

그 결과 각 섹션에서 단어의 쌍에 대해 하나의 행으로 구성된 데이터가 출력됩니다.
이를 통해 섹션 단위 기준으로 특정 단어와 함께 자주 노출되는 단어를 쉽게 찾을 수 있습니다.

# darcy는 elizabeth와 같은 주인공 인물이라고 합니다.
word_pairs %>% 
  filter(item1 == "darcy")

## # A tibble: 2,930 x 3
##    item1 item2         n
##    <chr> <chr>     <dbl>
##  1 darcy elizabeth   144
##  2 darcy miss         92
##  3 darcy bingley      86
##  4 darcy jane         46
##  5 darcy bennet       45
##  6 darcy sister       45
##  7 darcy time         41
##  8 darcy lady         38
##  9 darcy friend       37
## 10 darcy wickham      37
## # … with 2,920 more rows

4. 2. 2. Pairwise correlation

아래 표를 가지고 파이 계수를 아래와 같이 정의할 수 있습니다.

이는 binary data에서의 피어슨 상관계수와 동일한 포맷입니다.
두 단어의 상관계수를 구하려면 pairwise_col() 함수를 사용할 수 있습니다.

word_corr <- austen_section_words %>% 
  group_by(word) %>% 
  filter(n() > 20) %>% 
  ungroup() %>% 
  pairwise_cor(
    item = word,
    feature = section,
    sort = TRUE
  )

word_corr

## # A tibble: 140,250 x 3
##    item1     item2     correlation
##    <chr>     <chr>           <dbl>
##  1 bourgh    de              0.951
##  2 de        bourgh          0.951
##  3 pounds    thousand        0.701
##  4 thousand  pounds          0.701
##  5 william   sir             0.664
##  6 sir       william         0.664
##  7 catherine lady            0.663
##  8 lady      catherine       0.663
##  9 forster   colonel         0.622
## 10 colonel   forster         0.622
## # … with 140,240 more rows

이를 통해 특정 관심있는 단어를 필터링하여 가장 상관성이 높은 다른 단어를 찾을 수 있습니다.

word_corr %>% 
  filter(item1 %in% c("elizabeth", "pounds", "married", "pride")) %>% 
  group_by(item1) %>% 
  slice_max(correlation, n = 6) %>% 
  ungroup() %>% 
  ggplot(aes(x = reorder(item2, correlation), y = correlation)) +
  geom_bar(stat = "identity", colour = "black", fill = "grey85") +
  facet_wrap(~ item1, scales = "free") +
  coord_flip()

마찬가지로 ggraph 라이브러리를 활용하여 상관계수에 기반한 네트워크 그래프를 출력할 수 있습니다.

word_corr %>% 
  filter(correlation > .15) %>% # 상관계수가 0.15를 넘는 단어 쌍에 대해서만 필터링
  graph_from_data_frame() %>% 
  ggraph(layout = "fr") +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = "lightblue", size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) + # repel = TRUE arguments는 라벨링의 가독성에 도움을 줌
  theme_void()

제이드의 낙서장

[R] 7. ARIMA (AutoRegressive Integrated Moving Average)

Stationarity and differencing

Random walk model

Seasonal differencing

Unit root tests

Autoregressive models

Moving average models

Non-seasonal ARIMA models

Understanding ARIMA models

ACF and PACF plots

Estimation and order selection

Maximum likelihood estimation

Information Criteria

ARIMA modelling in fable

How does ARIMA() work?

Modelling procedure

[R] 6. Exponential smoothing

Simple exponential smoothing

Optimisation

Methods with trend

Holt’s linear trend method

Damped trend methods

Methods with seasonality

Holt-Winters’ damped method

Model selection

[R] 5. Time-series Regression

The linear model

Multiple linear model

Least Squares Estimation

Fitted values

Goodness-of-fit

Standard error of the regression

Evaluating the regression model

ACF plot of residuals & Histogram of residuals

Residual plots against predictors

Residual plots against fitted values

Outliers and influential observations

Spurious regression

Some useful predictors

Trend

Dummy variables

Seasonal dummy variables

Intervention variables

Fourier series

Selecting predictors

Adjusted R-squared

Cross-validation

Akaike’s Information Criterion

Corrected Akaike’s Information Criterion

Schwarz’s Bayesian Information Criterion

Which measure should we use?

Forecasting with regression

Nonlinear regression

[R] 4. feasts

feasts

Graphics: gg_season(), gg_subseries(), gg_lag(), ACF()

Decompositions

Decompositions: Classical decomposition

Decompositions: STL decomposition

Feature extraction and statistics

[R] 3. tsibbledata

tsibbledata

[R] 2. tsibble

tsibble()

암묵적 결측값 명시하기: fill_gaps()

특정 인덱스에 대해 함수 적용하기: index_by() + summarise()

[R] 1. fpp3 간단한 소개

소개

lubridate

tsibble

tsibbledata

feasts

fable

[R] 6. Topic modeling

6. Topic modeling

6. 1. Latent Dirichlet Allocation

6. 1. 1. Word-topic probabilities

6. 1. 2. Document-topic probabilities

6. 2. Example: the great library heist

ARIMA modelling in `fable`

How does `ARIMA()` work?

`feasts`

Graphics: `gg_season()`, `gg_subseries()`, `gg_lag()`, `ACF()`

`tsibbledata`

`tsibble()`

암묵적 결측값 명시하기: `fill_gaps()`

특정 인덱스에 대해 함수 적용하기: `index_by()` + `summarise()`

`lubridate`

`tsibble`

`tsibbledata`

`feasts`

`fable`

5. 1. 2. Tidying `dfm` objects

4 1. 4. Visualizing a network of bigrams with `ggraph`

4. 2. Counting and correlating pairs of words with the `widyr` package