<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
  <channel>
    <title>제이드의 낙서장</title>
    <link>https://rstatistics.tistory.com/</link>
    <description>R 공부한 부분들을 정리해보았습니다.</description>
    <language>ko</language>
    <pubDate>Tue, 14 Apr 2026 19:57:27 +0900</pubDate>
    <generator>TISTORY</generator>
    <ttl>100</ttl>
    <managingEditor>택2</managingEditor>
    <image>
      <title>제이드의 낙서장</title>
      <url>https://t1.daumcdn.net/cfile/tistory/22722A33595A619530</url>
      <link>https://rstatistics.tistory.com</link>
    </image>
    <item>
      <title>[R] 7. ARIMA (AutoRegressive Integrated Moving Average)</title>
      <link>https://rstatistics.tistory.com/80</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;
&lt;script type=&quot;text/x-mathjax-config&quot;&gt;
	MathJax.Hub.Config({
    	tex2jax: {
    		inlineMath: [ ['$','$'], ['\\(','\\)'] ],
    		processEscapes: true
		},
		TeX: { equationNumbers: { autoNumber: &quot;AMS&quot; } }
	});
&lt;/script&gt;
&lt;script src=&quot;https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML&quot;&gt;&lt;/script&gt;
&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(fpp3)
#library(fable)&lt;/code&gt;&lt;/pre&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Stationarity and differencing&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;정상성(Stationarity)과 차분(differencing)&lt;/li&gt;
&lt;li&gt;정상 시계열이란?
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;통계적 특징들이 관측되는 시점에 의존되지 않는 시계열을 말합니다.&lt;/li&gt;
&lt;li&gt;따라서 추세가 있는 시계열 또는 계절성이 있는 시계열들을 정상성이 있다고 보기 어려우며, 추세와 계절성은 다른 시점의 시계열 값에 영향을 줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;차분이란?
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;연속적으로 측정된 관측치 간의 차이를 계산하는 것을 차분이라고 하며, 비정상 시계열을 정상으로 만들어 주는 방법 중 하나 입니다.&lt;/li&gt;
&lt;li&gt;차분은 시계열의 추세와 계절성을 제거 또는 축소하여 시계열의 평균을 안정화하는데 도움이 될 수 있습니다.&lt;/li&gt;
&lt;li&gt;관측치 간의 차이 간격 n에 따라 n차 차분이 있을수도 있으며, 때에 따라 2차 차분까지도 진행될 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;그 외 로그 변환도 시계열의 분산을 안정화하는데 도움이 될 수 있습니다.&lt;/li&gt;
&lt;li&gt;또한 ACF 그래프는 비정상 시계열을 식별하는데 도움이 됩니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;정상 시계열의 경우 ACF는 상대적으로 빠르게 0으로 떨어지는 반면 비정상 시계열은 천천히 감소합니다.&lt;/li&gt;
&lt;li&gt;또한 비정상 시계열의 경우 값이 더 크고 양수인 경우가 많습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Random walk model&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} - y_{t-1} = \epsilon_{t}\]&lt;/span&gt; - &lt;span class=&quot;math inline&quot;&gt;\(\epsilon_{t}\)&lt;/span&gt;를 white noise 라고 정의할 때 아래 정의를 랜덤 워크 모형이라고 부릅니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} = y_{t-1} + \epsilon_{t}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;랜덤 워크 모형은 비정상 시계열의 대표적인 예이며 재무 및 경제 분야에서도 널리 사용됩니다.&lt;/li&gt;
&lt;li&gt;이 모형의 예측은 마지막 관측치와 동일합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;미래의 움직임은 예측할 수 없고, 오르거나 내리거나 할 가능성도 동일하기 떄문&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Seasonal differencing&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;차분도 특정 시점 전의 관측치를 뺴는 것처럼 계절 단위로도 차분이 가능합니다.&lt;/li&gt;
&lt;li&gt;시계열 데이터가 정상성을 얻기 위해서는 1차 차분과 계절 차분 모두 취해야하는 경우가 있습니다.&lt;/li&gt;
&lt;li&gt;아래 예시로 확인하시길 바랍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;PBS %&amp;gt;% 
  filter(ATC2 == &quot;H02&quot;) %&amp;gt;% 
  summarise(Cost = sum(Cost)/1000000) %&amp;gt;% 
  transmute(
    `Sales ($million)` = Cost,
    `Log sales` = log(Cost),
    `Annual change in log sales` = difference(log(Cost), 12), # Annual data이기에 12
    `Doubly differenced log sales` = difference(difference(log(Cost), 12), 1)
  ) %&amp;gt;% 
  pivot_longer(
    cols = -Month,
    names_to = &quot;Type&quot;,
    values_to = &quot;Sales&quot;
  ) %&amp;gt;% 
  mutate(Type = factor(Type, levels = c(&quot;Sales ($million)&quot;, 
                                        &quot;Log sales&quot;,
                                        &quot;Annual change in log sales&quot;, 
                                        &quot;Doubly differenced log sales&quot;))) %&amp;gt;% 
  ggplot(aes(x = Month, y = Sales)) +
  geom_line(size = 0.7) +
  facet_wrap(. ~ Type, scales = &quot;free_y&quot;, nrow = 4) +
  labs(title = &quot;Coricosteroid durg sales&quot;, y = NULL)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.10.40.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qawVv/btrbrVAL6hQ/HoH0RQqpzTcUD7h5D6gnO1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qawVv/btrbrVAL6hQ/HoH0RQqpzTcUD7h5D6gnO1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qawVv/btrbrVAL6hQ/HoH0RQqpzTcUD7h5D6gnO1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqawVv%2FbtrbrVAL6hQ%2FHoH0RQqpzTcUD7h5D6gnO1%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.10.40.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;계절 차분을 먼저 한 이후 1차 차분을 진행하는 것이 좋습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;1차 차분 먼저 수행하게되면 여전히 계절성이 존재할 가능성이 있기 때문입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Unit root tests&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;차분이 필요한지 아닌지를 판단할 수 있는 객관적인 방법 중 하나로 단위 루트 검정을 사용하는 것 입니다.&lt;/li&gt;
&lt;li&gt;이는 정상성에 대한 통계적 가설 검정 기법 중 하나입니다.&lt;/li&gt;
&lt;li&gt;여기에서는 KPSS(Kwiatkowski-Phillips-Schmidt-Shin) 테스트를 사용합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이 검정에서 귀무 가설은 &amp;ldquo;데이터가 정상성을 만족한다&amp;rdquo; 입니다.&lt;/li&gt;
&lt;li&gt;결과적으로 p-value가 유의수준 이하이면 귀무가설을 기각하므로 차분이 필요함을 나타냅니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이 검정법은 &lt;code&gt;features()&lt;/code&gt; 함수를 같이 사용하여 &lt;code&gt;unitroot_kpss()&lt;/code&gt; 함수로 진행할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_total_retail &amp;lt;- aus_retail %&amp;gt;% 
  summarise(Turnover = sum(Turnover))

aus_total_retail %&amp;gt;% 
  mutate(log_turnover = log(Turnover)) %&amp;gt;% 
  features(.var = log_turnover, feature = unitroot_kpss)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1 x 2
##   kpss_stat kpss_pvalue
##       &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;
## 1      7.35        0.01&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;통계량의 유의수준이 0.05보다 작으므로 귀무가설을 기각합니다. 즉, 차분을 할 필요가 있다는 뜻입니다.&lt;/li&gt;
&lt;li&gt;그렇다면 몇차 차분이 필요할까요?&lt;/li&gt;
&lt;li&gt;그에 대한 답변은 &lt;code&gt;unitroot_nsdiffs()&lt;/code&gt; 또는 &lt;code&gt;unitroot_ndiffs()&lt;/code&gt; 함수를 통해 구할 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;unitroot_nsdiffs()&lt;/code&gt;는 계절 차분에 대해서&lt;/li&gt;
&lt;li&gt;&lt;code&gt;unitroot_ndiffs()&lt;/code&gt;는 일반 n차 차분에 대해서&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_total_retail %&amp;gt;% 
  mutate(log_turnover = log(Turnover)) %&amp;gt;%
  features(.var = log_turnover, feature = unitroot_nsdiffs)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1 x 1
##   nsdiffs
##     &amp;lt;int&amp;gt;
## 1       1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_total_retail %&amp;gt;% 
  mutate(log_turnover = difference(log(Turnover), 12)) %&amp;gt;%
  features(.var = log_turnover, feature = unitroot_ndiffs)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1 x 1
##   ndiffs
##    &amp;lt;int&amp;gt;
## 1      1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_total_retail %&amp;gt;% 
  mutate(log_turnover = difference(difference(log(Turnover), 12), 1)) %&amp;gt;%
  features(.var = log_turnover, feature = unitroot_ndiffs)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1 x 1
##   ndiffs
##    &amp;lt;int&amp;gt;
## 1      0&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;즉, 위 결과를 통해 해당 시계열 데이터는 1차 계절차분과 1차 차분이 필요했음을 확인할 수 있었습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Autoregressive models&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이전에 다중 회귀 모형에서는 변수들의 선형 조합( linear combination)을 이용하여 예측했습니다.&lt;/li&gt;
&lt;li&gt;자기 회귀 모형에서는 변수들의 과거 값들의 선형 조합을 이용하여 예측합니다.&lt;/li&gt;
&lt;li&gt;자기회귀(autoregressive)라는 단어에는 자기 자신에 대한 변수의 회귀라는 의미가 있습니다.&lt;/li&gt;
&lt;li&gt;차수 &lt;span class=&quot;math inline&quot;&gt;\(p\)&lt;/span&gt;에 대한 자기회귀 모형은 아래와 같이 정의할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} = c + \phi_{1}y_{t-1} + \phi_{2}y_{t-2} + ... + \phi_{p}y_{t-p} + \epsilon_{t}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이를 &lt;span class=&quot;math inline&quot;&gt;\(p\)&lt;/span&gt;차 자기회귀 모형인 &lt;span class=&quot;math inline&quot;&gt;\(\text{AR}(p)\)&lt;/span&gt; 모형이라고 부릅니다.&lt;/li&gt;
&lt;li&gt;모형에 따른 제한 조건들이 필요한데 상세한 부분은 &lt;a href=&quot;https://otexts.com/fpp3/AR.html#AR&quot;&gt;여기&lt;/a&gt; 페이지 하단 부분을 참고하시면 됩니다!&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Moving average models&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이동 평균 모형에서는 다중 회귀 모형과 비슷해보이나 과거의 예측 오차(forecast error)를 활용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} = c + \epsilon_{t} + \theta_{1}\epsilon_{t-1} + \theta_{2}\epsilon_{t-2} + ... + \theta_{q}\epsilon_{t-q}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이를 &lt;span class=&quot;math inline&quot;&gt;\(q\)&lt;/span&gt;차 이동 평균 모형인 &lt;span class=&quot;math inline&quot;&gt;\(\text{MA}(q)\)&lt;/span&gt; 모형이라고 부릅니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(y_t{t}\)&lt;/span&gt;의 값을 과거 몇 개의 예측 오차의 가중 이동 평균으로 고려해볼 수 있다는 점에 주목하시면 됩니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이동 평균 평활(smoothing)과 헷갈리시면 안됩니다.&lt;/li&gt;
&lt;li&gt;이동 평균 모형은 미래 값을 예측할 때 사용하지만 평활은 과거 추세-주기를 측정할 떄 사용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;마찬가지로 모형에 필요한 제한 사항들은 &lt;a href=&quot;https://otexts.com/fpp3/MA.html#MA&quot;&gt;여기&lt;/a&gt; 페이지 하단 부분을 읽어주세요!&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Non-seasonal ARIMA models&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;차분하는 과정과 자귀회귀(AR), 그리고 이동평균(MA) 모형을 결합하면 비계절성 ARIMA 모형을 얻을 수 있습니다.&lt;/li&gt;
&lt;li&gt;ARIMA는 AutoRegressive Integrated Moving Average (이동 평균을 누적한 자기 회귀)의 약자입니다.&lt;/li&gt;
&lt;li&gt;모형 식은 아래와 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t}^{'} = c + \phi_{1}+y_{t-1}^{'} + ... + \phi_{p}+y_{t-[]}^{'} + \theta_{1}\epsilon_{t-1} + ... + \theta_{q}\epsilon_{t-q} + \epsilon_{t}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 식에서 좌변 &lt;span class=&quot;math inline&quot;&gt;\(y_{t}^{'}\)&lt;/span&gt;는 차분을 통해 얻은 시계열 입니다.&lt;/li&gt;
&lt;li&gt;우변은 시차 값과 시차 오차(lagged error) 둘 다를 포함합니다.&lt;/li&gt;
&lt;li&gt;이를 &lt;span class=&quot;math inline&quot;&gt;\(\text{ARIMA}(p, d, q)\)&lt;/span&gt; 모형이라고 부릅니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt;는 몇차 차분이 들어갔는 지를 의미 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이제부터 예시를 통해 살펴보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 그래프는 1960년부터 2017년까지 이집트 국가의 GDP 대비 수출의 비율을 보여줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;global_economy %&amp;gt;% 
  filter(Code == &quot;EGY&quot;) %&amp;gt;% 
  autoplot(Exports) +
  labs(y = &quot;% of GDP&quot;, title = &quot;Egyptian Exports&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.10.51.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/beb0ck/btrbh2nDFTS/e2QIceHJgKehvlqIfzDIA1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/beb0ck/btrbh2nDFTS/e2QIceHJgKehvlqIfzDIA1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/beb0ck/btrbh2nDFTS/e2QIceHJgKehvlqIfzDIA1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbeb0ck%2Fbtrbh2nDFTS%2Fe2QIceHJgKehvlqIfzDIA1%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.10.51.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ARIMA 모형을 적용하는 함수는 &lt;code&gt;ARIMA()&lt;/code&gt; 입니다. ARIMA 모형의 각 차수 값을 자동으로 선택해줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit &amp;lt;- global_economy %&amp;gt;% 
  filter(Code == &quot;EGY&quot;) %&amp;gt;% 
  model(ARIMA(Exports))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Exports 
## Model: ARIMA(2,0,1) w/ mean 
## 
## Coefficients:
##          ar1      ar2      ma1  constant
##       1.6764  -0.8034  -0.6896    2.5623
## s.e.  0.1111   0.0928   0.1492    0.1161
## 
## sigma^2 estimated as 8.046:  log likelihood=-141.57
## AIC=293.13   AICc=294.29   BIC=303.43&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 결과와 같이 &lt;span class=&quot;math inline&quot;&gt;\(\text{ARIMA}(2, 0, 1)\)&lt;/span&gt; 모형이 적합되었습니다.&lt;/li&gt;
&lt;li&gt;이를 기반으로 하는 예측은 아래와 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit %&amp;gt;% 
  forecast(h = 10) %&amp;gt;% 
  autoplot(global_economy) +
  labs(y = &quot;% of GDP&quot;, title = &quot;Egyptian Exports&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.10.58.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cMNOe5/btrblkIEWi0/iBgdtRK4K9pACKy9QzsxK1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cMNOe5/btrblkIEWi0/iBgdtRK4K9pACKy9QzsxK1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cMNOe5/btrblkIEWi0/iBgdtRK4K9pACKy9QzsxK1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcMNOe5%2FbtrblkIEWi0%2FiBgdtRK4K9pACKy9QzsxK1%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.10.58.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Understanding ARIMA models&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;모든 차수들을 코드에 의해 자동으로 결정하게 두면 편리하긴 하지만 그래도 모델이 대략적으로 작동하는 방식은 익혀둘 필요가 있습니다.&lt;/li&gt;
&lt;li&gt;모형에서 상수 값 &lt;span class=&quot;math inline&quot;&gt;\(c\)&lt;/span&gt;는 장기 예측값에 중요한 영향을 줍니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(c=0\)&lt;/span&gt;이고 &lt;span class=&quot;math inline&quot;&gt;\(d=0\)&lt;/span&gt;이면, 장기 예측값이 0에 가까워질 가능성이 높습니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(c=0\)&lt;/span&gt;이고 &lt;span class=&quot;math inline&quot;&gt;\(d=1\)&lt;/span&gt;이면, 장기 예측값이 0이 아닌 상수에 가까워질 가능성이 높습니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(c=0\)&lt;/span&gt;이고 &lt;span class=&quot;math inline&quot;&gt;\(d=2\)&lt;/span&gt;이면, 장기 예측값이 직선 형태로 나타날 가능성이 높습니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(c\neq0\)&lt;/span&gt;이고 &lt;span class=&quot;math inline&quot;&gt;\(d=0\)&lt;/span&gt;이면, 장기 예측값이 데이터의 평균에 가까워질 가능성이 높습니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(c\neq0\)&lt;/span&gt;이고 &lt;span class=&quot;math inline&quot;&gt;\(d=1\)&lt;/span&gt;이면, 장기 예측값이 직선 형태로 나타날 가까워질 가능성이 높습니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(c\neq0\)&lt;/span&gt;이고 &lt;span class=&quot;math inline&quot;&gt;\(d=2\)&lt;/span&gt;이면, 장기 예측값이 2차 곡선 추세로 나타날 가능성이 높습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt; 값은 예측 구간(prediction interval)에도 영향을 줍니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt; 값이 클수록 예측 구간의 크기가 더욱 급격하게 늘어납니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;ACF and PACF plots&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;보통 단순하게 시계열 그래프만 보고나서 &lt;span class=&quot;math inline&quot;&gt;\(p\)&lt;/span&gt;, &lt;span class=&quot;math inline&quot;&gt;\(q\)&lt;/span&gt; 값이 데이터에 맞았는지 이야기하기 어렵습니다.&lt;/li&gt;
&lt;li&gt;따라서 ACF, PACF 그래프를 참고할 필요도 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(y_{t}\)&lt;/span&gt;와 &lt;span class=&quot;math inline&quot;&gt;\(y_{t-k}\)&lt;/span&gt;의 관계를 측정하는 자기상관 ACF 그래프를 생각해본다면..&lt;br /&gt;&lt;span class=&quot;math inline&quot;&gt;\(y_{t}\)&lt;/span&gt;와 &lt;span class=&quot;math inline&quot;&gt;\(y_{t-1}\)&lt;/span&gt;이 상관관계가 있다면, &lt;span class=&quot;math inline&quot;&gt;\(y_{t-1}\)&lt;/span&gt;, &lt;span class=&quot;math inline&quot;&gt;\(y_{t-2}\)&lt;/span&gt;에도 상관관계가 있어야 합니다.&lt;/li&gt;
&lt;li&gt;하지만 &lt;span class=&quot;math inline&quot;&gt;\(y_{t}\)&lt;/span&gt;와 &lt;span class=&quot;math inline&quot;&gt;\(y_{t-2}\)&lt;/span&gt;는 값에 종속보다 두 값 모두 &lt;span class=&quot;math inline&quot;&gt;\(y_{t-1}\)&lt;/span&gt;과 관련이 있어서 상관관계를 가질 수도 있습니다.&lt;/li&gt;
&lt;li&gt;이러한 문제를 극복하기 위해 부분 자기상관(partial autocorrelations)을 이용할 수 있습니다.&lt;/li&gt;
&lt;li&gt;이 값은 시차의 효과를 제거한 후 &lt;span class=&quot;math inline&quot;&gt;\(y_{t}\)&lt;/span&gt;와 &lt;span class=&quot;math inline&quot;&gt;\(y_{t-k}\)&lt;/span&gt; 사이의 관계를 측정합니다.&lt;/li&gt;
&lt;li&gt;이 두 개의 값을 구하는 함수는 &lt;code&gt;ACF()&lt;/code&gt;와 &lt;code&gt;PACF()&lt;/code&gt; 입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;global_economy %&amp;gt;% 
  filter(Code == &quot;EGY&quot;) %&amp;gt;% 
  ACF(Exports) %&amp;gt;% 
  autoplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.05.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cqSBrK/btrbq27H4LO/NKpUXBlZmuOHAoKgZEKI6K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cqSBrK/btrbq27H4LO/NKpUXBlZmuOHAoKgZEKI6K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cqSBrK/btrbq27H4LO/NKpUXBlZmuOHAoKgZEKI6K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcqSBrK%2Fbtrbq27H4LO%2FNKpUXBlZmuOHAoKgZEKI6K%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.05.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;global_economy %&amp;gt;% 
  filter(Code == &quot;EGY&quot;) %&amp;gt;% 
  PACF(Exports) %&amp;gt;% 
  autoplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.11.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Z67zw/btrblkhzhO8/bLPixXCh0GWdhMveQ3Szr1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Z67zw/btrblkhzhO8/bLPixXCh0GWdhMveQ3Szr1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Z67zw/btrblkhzhO8/bLPixXCh0GWdhMveQ3Szr1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FZ67zw%2FbtrblkhzhO8%2FbLPixXCh0GWdhMveQ3Szr1%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.11.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;한 번에 ACF, PACF 두 개의 그래프를 그리고 싶다면 아래 코드처럼 &lt;code&gt;gg_tsdisplay()&lt;/code&gt; 함수를 이용하시면 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;global_economy %&amp;gt;% 
  filter(Code == &quot;EGY&quot;) %&amp;gt;% 
  gg_tsdisplay(Exports, plot_type = &quot;partial&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bEDGXL/btrboslPUiV/l05mYJ3IEkAL4xuLWo3VF1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bEDGXL/btrboslPUiV/l05mYJ3IEkAL4xuLWo3VF1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bEDGXL/btrboslPUiV/l05mYJ3IEkAL4xuLWo3VF1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbEDGXL%2FbtrboslPUiV%2Fl05mYJ3IEkAL4xuLWo3VF1%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이러한 그래프 해석은 다음과 같이 해볼 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래와 같이 나타날 경우 데이터는 &lt;span class=&quot;math inline&quot;&gt;\(\text{ARIMA}(p, d, 0)\)&lt;/span&gt; 모형을 따를 수도 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ACF가 지수적으로 감소하거나 &lt;span class=&quot;math inline&quot;&gt;\(\sin(x)\)&lt;/span&gt; 모양인 경우&lt;/li&gt;
&lt;li&gt;또는 PACF에서 시차 &lt;span class=&quot;math inline&quot;&gt;\(p\)&lt;/span&gt;에 뾰족한 막대가 유의미하게 있지만 시차 &lt;span class=&quot;math inline&quot;&gt;\(p\)&lt;/span&gt; 이후에는 없을 때&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;아래와 같이 나타날 경우 데이터는 &lt;span class=&quot;math inline&quot;&gt;\(\text{ARIMA}(0, d, q)\)&lt;/span&gt; 모형을 따를 수도 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;PACF가 지수적으로 감소하거나 &lt;span class=&quot;math inline&quot;&gt;\(\sin(x)\)&lt;/span&gt; 모양인 경우&lt;/li&gt;
&lt;li&gt;또는 ACF에서 시차 &lt;span class=&quot;math inline&quot;&gt;\(q\)&lt;/span&gt;에 뾰족한 막대가 유의미하게 있지만 시차 &lt;span class=&quot;math inline&quot;&gt;\(q\)&lt;/span&gt; 이후에는 없을 때&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이런 관점에서 다시 위의 ACF, PACF 그래프를 본다면..
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ACF는 &lt;span class=&quot;math inline&quot;&gt;\(\sin(x)\)&lt;/span&gt; 함수 형태와 유사한 편이고&lt;/li&gt;
&lt;li&gt;PACF는 4번째 시차에서 마지막으로 유의미한 뾰족 막대를 보입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;따라서 이 데이터는 &lt;span class=&quot;math inline&quot;&gt;\(\text{ARIMA}(4, 0, 0)\)&lt;/span&gt; 모형을 기대해볼 수도 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;ARIMA()&lt;/code&gt;에서 자동이 아닌 수동으로 차수를 지정하여 모형을 적합시킬 때에는 &lt;code&gt;pdq()&lt;/code&gt;를 활용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit2 &amp;lt;- global_economy %&amp;gt;% 
  filter(Code == &quot;EGY&quot;) %&amp;gt;% 
  model(ARIMA(Exports ~ pdq(4, 0, 0)))

fit2 %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Exports 
## Model: ARIMA(4,0,0) w/ mean 
## 
## Coefficients:
##          ar1      ar2     ar3      ar4  constant
##       0.9861  -0.1715  0.1807  -0.3283    6.6922
## s.e.  0.1247   0.1865  0.1865   0.1273    0.3562
## 
## sigma^2 estimated as 7.885:  log likelihood=-140.53
## AIC=293.05   AICc=294.7   BIC=305.41&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그 결과 AICc 값은 &lt;span class=&quot;math inline&quot;&gt;\(\text{ARIMA}(2, 0, 1)\)&lt;/span&gt; 모형이 조금 더 낮으므로 더 나은 모형이라고 생각할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Estimation and order selection&lt;/h2&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Maximum likelihood estimation&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;모형의 차수(&lt;span class=&quot;math inline&quot;&gt;\(p\)&lt;/span&gt;, &lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt;, &lt;span class=&quot;math inline&quot;&gt;\(q\)&lt;/span&gt;)를 찾은 다음 상수항 &lt;span class=&quot;math inline&quot;&gt;\(c\)&lt;/span&gt;와 &lt;span class=&quot;math inline&quot;&gt;\(\theta\)&lt;/span&gt;, &lt;span class=&quot;math inline&quot;&gt;\(\phi\)&lt;/span&gt; 매개 변수를 추정해야 합니다.&lt;/li&gt;
&lt;li&gt;R에서는 ARIMA 모형을 계산할 때 MLE(Maximum Likelihood Estimation)를 사용합니다.&lt;/li&gt;
&lt;li&gt;아래와 같이 오차항을 최소화하는 최소제곱 추정과 비슷합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\sum_{t=1}^{T} \epsilon_{t}^{2}\]&lt;/span&gt; - 해당 라이브러리에서 쓰인 &lt;code&gt;ARIMA()&lt;/code&gt; 함수는 매개변수 값을 추정할 때 로그가능도함수를 최대화하는 값을 기반으로 찾습니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Information Criteria&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아카이케 정보 기준(AIC)은 ARIMA 모델에서 차수를 결정할 때도 유용합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(L\)&lt;/span&gt;은 바로 위에서 언급한 가능도함수&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(c\neq0\)&lt;/span&gt;이면 &lt;span class=&quot;math inline&quot;&gt;\(k=1\)&lt;/span&gt;, &lt;span class=&quot;math inline&quot;&gt;\(c=0\)&lt;/span&gt;이면 &lt;span class=&quot;math inline&quot;&gt;\(k=0\)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\text{AIC} = -2\log(L) + 2(p+q+k+1)\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ARIMA에서 수정된 AIC(AICc)는 아래와 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\text{AIC}_{c}=\text{AIC}+\frac{2(p+q+k+1)(p+q+k+2)}{T-p-q-k-2}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;참고로 AICc를 활용한 모형 선택은 같은 차수로 차분을 한 ARIMA 모형에서만 의미가 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;ARIMA modelling in &lt;code&gt;fable&lt;/code&gt;&lt;/h2&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;How does &lt;code&gt;ARIMA()&lt;/code&gt; work?&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;fable&lt;/code&gt; 라이브러리에서 &lt;code&gt;ARIMA()&lt;/code&gt; 함수는 AICc 및 MLE를 최소화를 결합한 Hyndman-Khandakar 알고리즘을 사용한다고 합니다.&lt;/li&gt;
&lt;li&gt;자동화된 ARIMA 모델링을 위한 힌드만-칸다카르 알고리즘을 정말 아주 간략하게만 요약해보면 아래 과정입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li style=&quot;list-style-type: none;&quot;&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;KPSS 검정을 반복하여 차분 횟수를 결정합니다. (&lt;span class=&quot;math inline&quot;&gt;\(0 \le d \le 2\)&lt;/span&gt;)&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; start=&quot;2&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;데이터를 &lt;span class=&quot;math inline&quot;&gt;\(d\)&lt;/span&gt;번 차분한 후 AICc를 최소화하여 &lt;span class=&quot;math inline&quot;&gt;\(p\)&lt;/span&gt;와 &lt;span class=&quot;math inline&quot;&gt;\(q\)&lt;/span&gt;를 결정합니다. (stepwise exploration)&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Modelling procedure&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;일반적으로 ARIMA 모형에 시계열 데이터 세트를 적합할 때 아래와 같은 절차를 가집니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li style=&quot;list-style-type: none;&quot;&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;데이터를 시각화해보고 이상치들을 식별합니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; start=&quot;2&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;필요한 경우 데이터를 변환하여 분산을 안정화합니다. (Box-Cox 변환 사용)&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; start=&quot;3&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;데이터가 정상성을 나타내지 않는다면, 데이터가 정상성을 나타날 때까지 데이터를 가지고 1차 차분을 계산합니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; start=&quot;4&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ACF, PACF를 살펴봅니다. &lt;span class=&quot;math inline&quot;&gt;\(\text{ARIMA}(p, d, 0)\)&lt;/span&gt; 또는 &lt;span class=&quot;math inline&quot;&gt;\(\text{ARIMA}(0, d, q)\)&lt;/span&gt; 어느 것이 적절한지..&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; start=&quot;5&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그외 다른 모형도 적합해보고 더 나은 모형을 찾기 위해 AICc를 활용하여 점검합니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;ol style=&quot;list-style-type: decimal;&quot; start=&quot;6&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;잔차의 ACF를 그려보고 검정하여 확인합니다. 잔차가 특별한 패턴을 보이지 않는다면 (백색잡음) 예측 값을 계산합니다.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;본격 예시입니다. 비슷하게 &lt;code&gt;global_economy&lt;/code&gt; 데이터를 활용하여 중앙아프리카의 수출량을 확인해보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;global_economy %&amp;gt;% 
  filter(Code == &quot;CAF&quot;) %&amp;gt;% 
  autoplot(Exports) +
  labs(y = &quot;% of GDP&quot;, title = &quot;Central African Republic exports&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.25.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cROetb/btrbrUaPmLu/W9aUeOEg6H6bIEfgxUSkkK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cROetb/btrbrUaPmLu/W9aUeOEg6H6bIEfgxUSkkK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cROetb/btrbrUaPmLu/W9aUeOEg6H6bIEfgxUSkkK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcROetb%2FbtrbrUaPmLu%2FW9aUeOEg6H6bIEfgxUSkkK%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.25.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 그래프는 감소 추세와 더불어 일부 비정상성을 보입니다.&lt;/li&gt;
&lt;li&gt;또한 분산의 변화 증거가 없으므로 Box-Cox 변환은 스킵합니다.&lt;/li&gt;
&lt;li&gt;비정상성을 해결하기 위해 &lt;code&gt;gg_tsdisplay()&lt;/code&gt; 함수를 적용하여 1차 차분을 적용한 후 ACF, PACF 그래프를 확인해봅니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;global_economy %&amp;gt;% 
  filter(Code == &quot;CAF&quot;) %&amp;gt;% 
  gg_tsdisplay(difference(Exports, 1), plot_type = &quot;partial&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;pgsql&quot;&gt;&lt;code&gt;## Warning: Removed 1 row(s) containing missing values (geom_path).&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;pgsql&quot;&gt;&lt;code&gt;## Warning: Removed 1 rows containing missing values (geom_point).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.36.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/RhKJo/btrbrTXidoh/UE0S76UELhHsW6rjFkegbK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/RhKJo/btrbrTXidoh/UE0S76UELhHsW6rjFkegbK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/RhKJo/btrbrTXidoh/UE0S76UELhHsW6rjFkegbK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FRhKJo%2FbtrbrTXidoh%2FUE0S76UELhHsW6rjFkegbK%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.36.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;어느 정도 비정상성이 해소된 것처럼 보입니다.&lt;br /&gt;ACF 그래프를 보면 &lt;span class=&quot;math inline&quot;&gt;\(q=3\)&lt;/span&gt;인 ARIMA(0, 1, 3), PACF 그래프를 보면 &lt;span class=&quot;math inline&quot;&gt;\(p=2\)&lt;/span&gt;인 ARIMA(2, 1, 0)가 적당할 것 같습니다.&lt;/li&gt;
&lt;li&gt;따라서 두 개의 모형과 더불어 또 다른 하나는 차수를 자동 선택하게끔(stepwise), 또 다른 하나는 전반적인 탐색을 위한 모형(search)을 적합시킵니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;caf_fit &amp;lt;- global_economy %&amp;gt;% 
  filter(Code == &quot;CAF&quot;) %&amp;gt;% 
  model(
    arima013 = ARIMA(Exports ~ pdq(0, 1, 3)),
    arima210 = ARIMA(Exports ~ pdq(2, 1, 0)),
    stepwise = ARIMA(Exports),
    search = ARIMA(Exports, stepwise = FALSE)
  )

caf_fit %&amp;gt;% 
  pivot_longer(
    cols = 2:5,
    names_to = &quot;ModelName&quot;,
    values_to = &quot;Orders&quot;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A mable: 4 x 3
## # Key:     Country, ModelName [4]
##   Country                  ModelName         Orders
##   &amp;lt;fct&amp;gt;                    &amp;lt;chr&amp;gt;            &amp;lt;model&amp;gt;
## 1 Central African Republic arima013  &amp;lt;ARIMA(0,1,3)&amp;gt;
## 2 Central African Republic arima210  &amp;lt;ARIMA(2,1,0)&amp;gt;
## 3 Central African Republic stepwise  &amp;lt;ARIMA(2,1,2)&amp;gt;
## 4 Central African Republic search    &amp;lt;ARIMA(3,1,0)&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;caf_fit %&amp;gt;% 
  select(stepwise) %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Exports 
## Model: ARIMA(2,1,2) 
## 
## Coefficients:
##           ar1      ar2     ma1     ma2
##       -0.6741  -0.7142  0.2468  0.4831
## s.e.   0.1821   0.2037  0.2531  0.2576
## 
## sigma^2 estimated as 6.416:  log likelihood=-132.1
## AIC=274.2   AICc=275.37   BIC=284.41&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;caf_fit %&amp;gt;% 
  select(search) %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Exports 
## Model: ARIMA(3,1,0) 
## 
## Coefficients:
##           ar1      ar2     ar3
##       -0.4419  -0.1850  0.2055
## s.e.   0.1295   0.1385  0.1274
## 
## sigma^2 estimated as 6.519:  log likelihood=-133
## AIC=274   AICc=274.77   BIC=282.18&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;caf_fit %&amp;gt;% 
  glance() %&amp;gt;% 
  arrange(AICc) %&amp;gt;% 
  select(.model, AICc)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 4 x 2
##   .model    AICc
##   &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt;
## 1 search    275.
## 2 arima210  275.
## 3 arima013  275.
## 4 stepwise  275.&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;네 가지 모형은 거의 동일한 AICc 값을 같습니다.&lt;br /&gt;그나마 근소하게 search 이름으로 적합된 ARIMA(3, 1, 0)이 가장 낮은 AICc 값을 보입니다.&lt;/li&gt;
&lt;li&gt;따라서 해당 모형을 가지고 &lt;code&gt;gg_tsresiduals()&lt;/code&gt; 함수를 사용하여 잔차의 백색잡음 여부와 잔차의 ACF를 확인해봅니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;caf_fit %&amp;gt;% 
  select(search) %&amp;gt;% 
  gg_tsresiduals()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.45.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/lbFNY/btrbkhyAO69/DsnEyGa5vapApfG4dwbebk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/lbFNY/btrbkhyAO69/DsnEyGa5vapApfG4dwbebk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/lbFNY/btrbkhyAO69/DsnEyGa5vapApfG4dwbebk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FlbFNY%2FbtrbkhyAO69%2FDsnEyGa5vapApfG4dwbebk%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.45.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;ljung_box&lt;/code&gt;를 적용하여 포트멘토 검정(portmanteau)도 해볼 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;lag&lt;/code&gt;는 계산에 쓰일 시차 자기상관 계수의 수&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dof&lt;/code&gt;는 적합된 모형의 자유도&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;caf_fit %&amp;gt;% 
  augment() %&amp;gt;% 
  filter(.model == &quot;search&quot;) %&amp;gt;% 
  features(
    .var = .innov,
    features = ljung_box,
    lag = 10,
    dof = 3
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1 x 4
##   Country                  .model lb_stat lb_pvalue
##   &amp;lt;fct&amp;gt;                    &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
## 1 Central African Republic search    5.75     0.569&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;p-value가 유의수준 보다 크게 나왔습니다. 해당 잔차는 백색잡음(유의미한 패턴이 없음)임을 확인할 수 있습니다.&lt;/li&gt;
&lt;li&gt;마지막으로 &lt;code&gt;forecast()&lt;/code&gt; 함수를 사용하여 예측값을 확인할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;caf_fit %&amp;gt;% 
  select(Country, search) %&amp;gt;% 
  forecast(h = 5) %&amp;gt;% 
  autoplot(global_economy)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.51.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/x79Id/btrbuI8HWkL/4hs2ytTgn1YghjrjptrGH1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/x79Id/btrbuI8HWkL/4hs2ytTgn1YghjrjptrGH1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/x79Id/btrbuI8HWkL/4hs2ytTgn1YghjrjptrGH1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fx79Id%2FbtrbuI8HWkL%2F4hs2ytTgn1YghjrjptrGH1%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;477&quot; data-filename=&quot;스크린샷 2021-08-06 오후 6.11.51.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;&amp;nbsp;&lt;/h3&gt;</description>
      <category>time-series (tidy approach)</category>
      <category>AR</category>
      <category>ARIMA</category>
      <category>difference</category>
      <category>Fable</category>
      <category>fpp3</category>
      <category>Ma</category>
      <category>Time Series</category>
      <category>시계열</category>
      <category>정상성</category>
      <category>차분</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/80</guid>
      <comments>https://rstatistics.tistory.com/80#entry80comment</comments>
      <pubDate>Fri, 6 Aug 2021 18:14:10 +0900</pubDate>
    </item>
    <item>
      <title>[R] 6. Exponential smoothing</title>
      <link>https://rstatistics.tistory.com/79</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;
&lt;script type=&quot;text/x-mathjax-config&quot;&gt;
	MathJax.Hub.Config({
    	tex2jax: {
    		inlineMath: [ ['$','$'], ['\\(','\\)'] ],
    		processEscapes: true
		},
		TeX: { equationNumbers: { autoNumber: &quot;AMS&quot; } }
	});
&lt;/script&gt;
&lt;script src=&quot;https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML&quot;&gt;&lt;/script&gt;
&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(fpp3)
#library(fable)&lt;/code&gt;&lt;/pre&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Simple exponential smoothing&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;SES(Simple Exponential Smoothing)는 지수 평활화 방법 중 가장 단순한 방법입니다.&lt;/li&gt;
&lt;li&gt;이 방법은 뚜렷한 추세나 계절적 패턴이 없는 데이터를 예측하는데 적합합니다.&lt;/li&gt;
&lt;li&gt;SES와 같이 나이브한 방법을 사용하면 미래에 대한 모든 예측값은 시계열의 마지막 관측값과 같습니다.&lt;/li&gt;
&lt;li&gt;따라서 예측 시점을 기준으로 가장 최근의 관측치가 유일하게 중요한 관측치이면서 그 이전 관측치는 미래에 대한 정보를 제공하지 않는다고 가정합니다.&lt;/li&gt;
&lt;li&gt;즉, 모든 가중치가 마지막 관측치에 주어지는 가중 평균으로 생각할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\hat{y}_{T+h|T} = \frac{1}{T}\sum^{T}_{t=1}y_{t}\]&lt;/span&gt; - 그러나 먼 과거의 관측치보다 최그 관측치에 더 큰 가중치를 부여하는 것이 현명할 수 있습니다. - 이것이 바로 SES의 개념입니다. - 예측값은 가중 평균을 사용하여 계산하되, 관측치가 더 먼 과거에서 올수록 가중치가 지수적으로 감소하게 되는 형태입니다.&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\hat{Y}_{T+1|T} = \alpha y_{T} + \alpha(1-\alpha)y_{T-1} + \alpha(1-\alpha)^{2}y_{T-2} + ...\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 식에서 &lt;span class=&quot;math inline&quot;&gt;\(\alpha\)&lt;/span&gt;는 0과 1 사이의 값을 갖는 smoohting parameter 이며 가중치가 감소하는 비율은 이 매개변수에 의해 제어됩니다.&lt;/li&gt;
&lt;li&gt;더 자세한 수식 전개는 다루지 않겠습니다.&lt;/li&gt;
&lt;li&gt;SES는 평평한 예측(Flat forecasts)이 가능합니다.&lt;br /&gt;즉, 모든 예측은 마지막 관측치 요소와 동일한 수준의 값을 취합니다.&lt;/li&gt;
&lt;li&gt;이러한 예측은 시계열의 추세나 계절 성분이 없는 경우에만 적합하다는 것을 참고하세요!&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\hat{y}_{T+h|T} = \hat{y}_{T+1|T} = l_{T}\]&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Optimisation&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;회귀 모형에서 잔차제곱합을 최소화하여 회귀 모델의 계수를 추정하는 것처럼 이도 유사하게 아래 SSE 값을 최소화하여 추정합니다.&lt;/li&gt;
&lt;li&gt;물론 unknown parameter인 &lt;span class=&quot;math inline&quot;&gt;\(\alpha\)&lt;/span&gt;는 분석가가 사전에 셋팅해야합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\text{SSE} = \sum^{T}_{t=1}\bigg(y_{t} - \hat{y}_{t|t-1}\bigg)^{2} = \sum^{T}_{t=1}\epsilon_{t}^{2}\]&lt;/span&gt;&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;예시를 위해 아래와 같이 &lt;code&gt;tsibbledata&lt;/code&gt; 라이브러리의 &lt;code&gt;global_economy&lt;/code&gt; 데이터를 사용할 것입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;algeria_economy &amp;lt;- global_economy %&amp;gt;% 
  filter(Country == &quot;Algeria&quot;)

algeria_economy %&amp;gt;% 
  autoplot(Exports) +
  labs(y = &quot;% of GDP&quot;, title = &quot;Exports: Algeria&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.52.44.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/kOi2U/btraLaZVxMI/UqH9RcGdsVEOWLVsmDblv1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/kOi2U/btraLaZVxMI/UqH9RcGdsVEOWLVsmDblv1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/kOi2U/btraLaZVxMI/UqH9RcGdsVEOWLVsmDblv1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FkOi2U%2FbtraLaZVxMI%2FUqH9RcGdsVEOWLVsmDblv1%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.52.44.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;지수평활화를 적용하는 함수는 &lt;code&gt;ETS()&lt;/code&gt; 입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;추세, 계절, 에러 등 각 항별로 수식의 폼을 정할 수 있습니다.&lt;/li&gt;
&lt;li&gt;&amp;ldquo;A&amp;rdquo;: Additive (가법) &quot; &amp;ldquo;M&amp;rdquo;: Multiplicative (승법)&lt;/li&gt;
&lt;li&gt;&amp;ldquo;Ad&amp;rdquo;, &amp;ldquo;Md&amp;rdquo;: 감쇠방법&lt;/li&gt;
&lt;li&gt;&amp;ldquo;N&amp;rdquo;: None&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# 추세와 계절성이 없기에 method arguments는 &quot;N&quot;
fit &amp;lt;- algeria_economy %&amp;gt;% 
  model(ETS(formula = Exports ~ error(&quot;A&quot;) + trend(&quot;N&quot;) + season(&quot;N&quot;)))

fit %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## Series: Exports 
## Model: ETS(A,N,N) 
##   Smoothing parameters:
##     alpha = 0.8399875 
## 
##   Initial states:
##    l[0]
##  39.539
## 
##   sigma^2:  35.6301
## 
##      AIC     AICc      BIC 
## 446.7154 447.1599 452.8968&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;smoothing parameter &lt;span class=&quot;math inline&quot;&gt;\(\alpha\)&lt;/span&gt;는 대략 0.84이며 SSE를 최소화하면서 얻어진 초기값은 39.5입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fc &amp;lt;- fit %&amp;gt;% 
  forecast(h = 5)

fc %&amp;gt;% 
  autoplot(algeria_economy) +
  geom_line(data = fit %&amp;gt;% augment(), aes(y = .fitted), color = &quot;#D55E00&quot;) +
  labs(y = &quot;% of GDP&quot;, title = &quot;Exports: Algeria&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.52.53.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bCTDKu/btraJArrQi0/MZlRZ6IjLXvY6cp53hYYj1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bCTDKu/btraJArrQi0/MZlRZ6IjLXvY6cp53hYYj1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bCTDKu/btraJArrQi0/MZlRZ6IjLXvY6cp53hYYj1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbCTDKu%2FbtraJArrQi0%2FMZlRZ6IjLXvY6cp53hYYj1%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.52.53.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Methods with trend&lt;/h2&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Holt&amp;rsquo;s linear trend method&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;추세가 있는 시계열 데이터를 단순 지수 평활에 적용하기 위해 확장한 방법이 있습니다.&lt;/li&gt;
&lt;li&gt;이 방법에는 예측 방정시과 두 개의 평활 방정식이 포함됩니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Forecast equation = &lt;span class=&quot;math inline&quot;&gt;\(\hat{y}_{t+h|t} = l_t + hb_{t}\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Level equation = &lt;span class=&quot;math inline&quot;&gt;\(l_{t} = \alpha y_{t} + (1-\alpha)(l_{t-1}+b_{t-1})\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Trend equation = &lt;span class=&quot;math inline&quot;&gt;\(b_{t} = \beta^{*}(l_{t} - l_{t-1})+(1-\beta^{*})b_{t-1}\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(l_{t}\)&lt;/span&gt;는 &lt;span class=&quot;math inline&quot;&gt;\(t\)&lt;/span&gt;시점에서의 level 추정치이고, &lt;span class=&quot;math inline&quot;&gt;\(b_{t}\)&lt;/span&gt;는 트렌드(기울기) 추정치입니다.&lt;/li&gt;
&lt;li&gt;마찬가지로 각각의 smoothing parameter인 &lt;span class=&quot;math inline&quot;&gt;\(\alpha\)&lt;/span&gt;와 &lt;span class=&quot;math inline&quot;&gt;\(\beta^{*}\)&lt;/span&gt;는 0과 1사이의 값을 갖습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이렇게 추정하면 예측 함수는 더 이상 평평하지 않고 어느 정도 트렌드를 타게됩니다.&lt;/li&gt;
&lt;li&gt;아래 예시는 1960년부터 2017년까지 호주의 연간 인구수를 보여줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_economy &amp;lt;- global_economy %&amp;gt;% 
  filter(Code == &quot;AUS&quot;) %&amp;gt;% 
  mutate(Pop = Population / 1000000)

aus_economy&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 58 x 10 [1Y]
## # Key:       Country [1]
##    Country   Code   Year       GDP Growth   CPI Imports Exports Population   Pop
##    &amp;lt;fct&amp;gt;     &amp;lt;fct&amp;gt; &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
##  1 Australia AUS    1960   1.86e10  NA     7.96    14.1    13.0   10276477  10.3
##  2 Australia AUS    1961   1.96e10   2.49  8.14    15.0    12.4   10483000  10.5
##  3 Australia AUS    1962   1.99e10   1.30  8.12    12.6    13.9   10742000  10.7
##  4 Australia AUS    1963   2.15e10   6.21  8.17    13.8    13.0   10950000  11.0
##  5 Australia AUS    1964   2.38e10   6.98  8.40    13.8    14.9   11167000  11.2
##  6 Australia AUS    1965   2.59e10   5.98  8.69    15.3    13.2   11388000  11.4
##  7 Australia AUS    1966   2.73e10   2.38  8.98    15.1    12.9   11651000  11.7
##  8 Australia AUS    1967   3.04e10   6.30  9.29    13.9    12.9   11799000  11.8
##  9 Australia AUS    1968   3.27e10   5.10  9.52    14.5    12.3   12009000  12.0
## 10 Australia AUS    1969   3.66e10   7.04  9.83    13.3    12.0   12263000  12.3
## # &amp;hellip; with 48 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_economy %&amp;gt;% 
  autoplot(Pop) +
  labs(y = &quot;Millions&quot;, title = &quot;Australian population&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.00.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/pQ6Ou/btraKD2yEaO/WZL5iDfyUPDXWyqHeAf9ik/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/pQ6Ou/btraKD2yEaO/WZL5iDfyUPDXWyqHeAf9ik/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/pQ6Ou/btraKD2yEaO/WZL5iDfyUPDXWyqHeAf9ik/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FpQ6Ou%2FbtraKD2yEaO%2FWZL5iDfyUPDXWyqHeAf9ik%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.00.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;여기의 Holt&amp;rsquo;s linear trend method를 적용하겠습니다.&lt;/li&gt;
&lt;li&gt;마찬가지로 &lt;code&gt;ETS()&lt;/code&gt; 함수를 쓰되 추세항에 Additive 옵션을 줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit &amp;lt;- aus_economy %&amp;gt;% 
  model(AAN = ETS(formula = Pop ~ error(&quot;A&quot;) + trend(&quot;A&quot;) + season(&quot;N&quot;)))

fit %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## Series: Pop 
## Model: ETS(A,A,N) 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.3266366 
## 
##   Initial states:
##      l[0]      b[0]
##  10.05414 0.2224818
## 
##   sigma^2:  0.0041
## 
##       AIC      AICc       BIC 
## -76.98569 -75.83184 -66.68347&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(\hat{\alpha} = 0.9999\)&lt;/span&gt;, &lt;span class=&quot;math inline&quot;&gt;\(\hat{\beta^{*}} = 0.3266\)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fc &amp;lt;- fit %&amp;gt;% 
  forecast(h = 5)

fc %&amp;gt;% 
  autoplot(aus_economy) +
  geom_line(data = fit %&amp;gt;% augment(), aes(y = .fitted), color = &quot;#D55E00&quot;) +
  labs(y = &quot;Millions&quot;, title = &quot;Australian population&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.06.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/eJt1qx/btraMqn7jzh/ifg8zfz6jk27U3eVFKxq11/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/eJt1qx/btraMqn7jzh/ifg8zfz6jk27U3eVFKxq11/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/eJt1qx/btraMqn7jzh/ifg8zfz6jk27U3eVFKxq11/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FeJt1qx%2FbtraMqn7jzh%2Fifg8zfz6jk27U3eVFKxq11%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.06.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Damped trend methods&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위의 Holt&amp;rsquo;s linear trend method에 의해 예측되는 값은 조금 더 긴 예측 기간에 대해서 과도하게 예측하는 경향이 있다고 합니다.&lt;/li&gt;
&lt;li&gt;이러한 부분을 어느 정도 보안하는 방법으로 미래의 어느 시점에서 추세를 평평하게 완화하는 매개변수를 도입한 방법이 있습니다.&lt;/li&gt;
&lt;li&gt;Gradner &amp;amp; McKenize (1985)는 아래와 같이 0과 1 사이 값인 감쇠 매개변수(dampens parameter)를 도입하였습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\hat{y}_{t+h|t} = l_{t} + (\phi + \phi^2 + ... + \phi^{h})b_{t}\]&lt;/span&gt; &lt;span class=&quot;math display&quot;&gt;\[l_{t} = \alpha y_{t} + (1-\alpha)(l_{t-1} + \phi b_{t-1})\]&lt;/span&gt; &lt;span class=&quot;math display&quot;&gt;\[b_{t} = \beta^{*}(l_{t}-l_{t-1}) + (1-\beta^{*})\phi b_{t-1}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;여기서 &lt;span class=&quot;math inline&quot;&gt;\(\phi = 1\)&lt;/span&gt;이면 이 방법은 holt&amp;rsquo;s의 방법과 동일합니다.&lt;/li&gt;
&lt;li&gt;이 감쇠 매개변수로 인해 단기 예측은 일정한 추세는 살리지만 장기로 갈수록 일정하게 만듭니다.&lt;/li&gt;
&lt;li&gt;일반적으로 이 감쇠 매개변수는 최소 0.8에서 최대 0.98까지 둔다고 합니다.&lt;/li&gt;
&lt;li&gt;아래 예시 코드로 비교해보겠습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;감쇠 방법에는 trend term에 &amp;ldquo;Ad&amp;rdquo;가 들어갑니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_economy %&amp;gt;% 
  model(
    `Holt's method` = ETS(Pop ~ error(&quot;A&quot;) + trend(&quot;A&quot;) + season(&quot;N&quot;)),
    `Damped Holt's method` = ETS(Pop ~ error(&quot;A&quot;) + trend(&quot;Ad&quot;, phi = 0.9) + season(&quot;N&quot;))
  ) %&amp;gt;% 
  forecast(h = 15) %&amp;gt;% 
  autoplot(aus_economy, level = NULL) +
  labs(y = &quot;Millions&quot;, title = &quot;Australian population&quot;) +
  guides(color = guide_legend(title = &quot;Forecast method&quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.11.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bp8R7X/btraIn61FlV/3WX1XAmCE56llAXOadltcK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bp8R7X/btraIn61FlV/3WX1XAmCE56llAXOadltcK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bp8R7X/btraIn61FlV/3WX1XAmCE56llAXOadltcK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbp8R7X%2FbtraIn61FlV%2F3WX1XAmCE56llAXOadltcK%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.11.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;또 다른 예시입니다. 100분 동안 관찰된 1분 당 인터넷 사용자 수 데이터입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;www_usage &amp;lt;- WWWusage %&amp;gt;% 
  as_tsibble()

www_usage&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 100 x 2 [1]
##    index value
##    &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
##  1     1    88
##  2     2    84
##  3     3    85
##  4     4    85
##  5     5    84
##  6     6    85
##  7     7    83
##  8     8    85
##  9     9    88
## 10    10    89
## # &amp;hellip; with 90 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;www_usage %&amp;gt;% 
  autoplot(value) +
  labs(x = &quot;Minute&quot;, y = &quot;Number of users&quot;, title = &quot;Internet usage per minute&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/2buXA/btraLs7AGbQ/0KByrCXO89woWyNAmbRsP0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/2buXA/btraLs7AGbQ/0KByrCXO89woWyNAmbRsP0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/2buXA/btraLs7AGbQ/0KByrCXO89woWyNAmbRsP0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F2buXA%2FbtraLs7AGbQ%2F0KByrCXO89woWyNAmbRsP0%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 교차검증을 사용하여 아래 세 가지 방법의 1단계 예측 정확도를 비교해봅니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;stretch_tsibble()&lt;/code&gt; 함수는 관측치들을 여러 조각들로 롤링(rolling) 해주는 역할을 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;www_usage %&amp;gt;% 
  stretch_tsibble(.init = 10) %&amp;gt;% 
  model(
    `SES` = ETS(value ~ error(&quot;A&quot;) + trend(&quot;N&quot;) + season(&quot;N&quot;)),
    `Holt` = ETS(value ~ error(&quot;A&quot;) + trend(&quot;A&quot;) + season(&quot;N&quot;)),
    `Damped` = ETS(value ~ error(&quot;A&quot;) + trend(&quot;Ad&quot;) + season(&quot;N&quot;))
  ) %&amp;gt;% 
  forecast(h = 1) %&amp;gt;% 
  accuracy(www_usage)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## Warning: The future dataset is incomplete, incomplete out-of-sample data will be treated as missing. 
## 1 observation is missing at 101&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 3 x 10
##   .model .type     ME  RMSE   MAE   MPE  MAPE  MASE RMSSE  ACF1
##   &amp;lt;chr&amp;gt;  &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1 Damped Test  0.288   3.69  3.00 0.347  2.26 0.663 0.636 0.336
## 2 Holt   Test  0.0610  3.87  3.17 0.244  2.38 0.701 0.668 0.296
## 3 SES    Test  1.46    6.05  4.81 0.904  3.55 1.06  1.04  0.803&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;RMSE 값으로 비교해본 결과 Damped Holt&amp;rsquo;s method가 가장 좋습니다.&lt;/li&gt;
&lt;li&gt;따라서 감쇠 방법으로 예측 추이를 확인해보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit &amp;lt;- www_usage %&amp;gt;% 
  model(`Damped` = ETS(value ~ error(&quot;A&quot;) + trend(&quot;Ad&quot;) + season(&quot;N&quot;)))

fit %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## Series: value 
## Model: ETS(A,Ad,N) 
##   Smoothing parameters:
##     alpha = 0.9999 
##     beta  = 0.9966439 
##     phi   = 0.814958 
## 
##   Initial states:
##      l[0]        b[0]
##  90.35177 -0.01728234
## 
##   sigma^2:  12.2244
## 
##      AIC     AICc      BIC 
## 717.7310 718.6342 733.3620&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit %&amp;gt;% 
  tidy()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 5 x 3
##   .model term  estimate
##   &amp;lt;chr&amp;gt;  &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt;
## 1 Damped alpha   1.00  
## 2 Damped beta    0.997 
## 3 Damped phi     0.815 
## 4 Damped l[0]   90.4   
## 5 Damped b[0]   -0.0173&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;기울기에 대한 평활화 매개변수 &lt;span class=&quot;math inline&quot;&gt;\(\beta\)&lt;/span&gt;는 거의 1에 가깝게 추정되며, &lt;span class=&quot;math inline&quot;&gt;\(\alpha\)&lt;/span&gt; 또한 1에 가까운 수준으로 새로운 관측치에 대해 강력하게 반응하는 것으로 보여집니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit %&amp;gt;% 
  forecast(h = 10) %&amp;gt;% 
  autoplot(www_usage) +
  labs(x = &quot;Minute&quot;, y = &quot;Number of users&quot;, title = &quot;Internet usage per minute&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.23.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/KLYer/btraGPvq4OE/uuT8RjscCC8q0gkx1mbjfk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/KLYer/btraGPvq4OE/uuT8RjscCC8q0gkx1mbjfk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/KLYer/btraGPvq4OE/uuT8RjscCC8q0gkx1mbjfk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FKLYer%2FbtraGPvq4OE%2FuuT8RjscCC8q0gkx1mbjfk%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.23.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Methods with seasonality&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;추세성 뿐만 아니라 계절성까지 확보하기 위해 위의 Holt 방법을 확장했습니다. (Holt &amp;amp; Winters, 1960)&lt;/li&gt;
&lt;li&gt;여기에는 위의 매개변수 &lt;span class=&quot;math inline&quot;&gt;\(\alpha\)&lt;/span&gt;, &lt;span class=&quot;math inline&quot;&gt;\(\beta^{*}\)&lt;/span&gt;에 추가로 계절성에 대한 smoothing parameter &lt;span class=&quot;math inline&quot;&gt;\(\gamma\)&lt;/span&gt;까지 고려가 됩니다.&lt;/li&gt;
&lt;li&gt;이 방법에는 계절 성분의 특성에 따라 두 가지로 나누어서 적용해볼 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;계절적 변동이 전반적으로 거의 일정한 편일 때는 가산법(additive)이 선호되고&lt;br /&gt;비례하여 변동이 있을 때는 승법(multiplicative)이 선호됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;수식에 대한 접근은 아래 링크를 통해서 확인 부탁드립니다!
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://otexts.com/fpp3/holt-winters.html#holt-winters-additive-method&quot;&gt;Holt-Winters&amp;rsquo; additive method&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://otexts.com/fpp3/holt-winters.html#holt-winters-multiplicative-method&quot;&gt;Holt-Winters&amp;rsquo; multiplicative method&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;바로 예시로 접근하겠습니다. 마찬가지로 &lt;code&gt;tsibbledata&lt;/code&gt; 라이브러리 내 &lt;code&gt;tourism&lt;/code&gt; 데이터를 활용합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;분기별 호주 관광객 수를 나타낸 데이터로, 예시에서는 휴가철에 방문객 수 예측을 목적으로 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_holidays &amp;lt;- tourism %&amp;gt;% 
  filter(Purpose == &quot;Holiday&quot;) %&amp;gt;% 
  summarise(Trips = sum(Trips/1000))

aus_holidays&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 80 x 2 [1Q]
##    Quarter Trips
##      &amp;lt;qtr&amp;gt; &amp;lt;dbl&amp;gt;
##  1 1998 Q1 11.8 
##  2 1998 Q2  9.28
##  3 1998 Q3  8.64
##  4 1998 Q4  9.30
##  5 1999 Q1 11.2 
##  6 1999 Q2  9.61
##  7 1999 Q3  8.91
##  8 1999 Q4  9.03
##  9 2000 Q1 11.1 
## 10 2000 Q2  9.20
## # &amp;hellip; with 70 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit &amp;lt;- aus_holidays %&amp;gt;% 
  model(
    `additive` = ETS(Trips ~ error(&quot;A&quot;) + trend(&quot;A&quot;) + season(&quot;A&quot;)),
    `multiplicative` = ETS(Trips ~ error(&quot;M&quot;) + trend(&quot;A&quot;) + season(&quot;M&quot;))
  )

fc &amp;lt;- fit %&amp;gt;% 
  forecast(h = &quot;3 years&quot;)

fc %&amp;gt;% 
  autoplot(aus_holidays, level = NULL) +
  labs(y = &quot;Overnight trips (millions)&quot;, title = &quot;Australian domestic tourism&quot;) +
  guides(color = guide_legend(title = &quot;Forecast&quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.28.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cFgVh1/btraHYeBsRI/NNFZo4wnlWQKZMbGXaaKK1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cFgVh1/btraHYeBsRI/NNFZo4wnlWQKZMbGXaaKK1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cFgVh1/btraHYeBsRI/NNFZo4wnlWQKZMbGXaaKK1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcFgVh1%2FbtraHYeBsRI%2FNNFZo4wnlWQKZMbGXaaKK1%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.28.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 &lt;code&gt;glance()&lt;/code&gt; 함수를 사용하여 모형 평가 메트릭을 확인해본 결과 MSE가 더 낮은 것은 승법모형으로 보여집니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit %&amp;gt;% 
  glance()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 2 x 9
##   .model          sigma2 log_lik   AIC  AICc   BIC   MSE  AMSE    MAE
##   &amp;lt;chr&amp;gt;            &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
## 1 additive       0.193     -105.  229.  231.  250. 0.174 0.184 0.321 
## 2 multiplicative 0.00212   -104.  227.  229.  248. 0.170 0.183 0.0328&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Holt-Winters&amp;rsquo; damped method&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마찬가지로 이 방법 또한 감쇠법이 가능합니다. 바로 예시로 살펴보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;sth_cross_ped &amp;lt;- pedestrian %&amp;gt;% 
  filter(Date &amp;gt;= &quot;2016-07-01&quot;, Sensor == &quot;Southern Cross Station&quot;) %&amp;gt;% 
  index_by(Date) %&amp;gt;% 
  summarise(Count = sum(Count)/1000)

sth_cross_ped&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 184 x 2 [1D]
##    Date       Count
##    &amp;lt;date&amp;gt;     &amp;lt;dbl&amp;gt;
##  1 2016-07-01 17.6 
##  2 2016-07-02  2.52
##  3 2016-07-03  1.73
##  4 2016-07-04 17.3 
##  5 2016-07-05 16.5 
##  6 2016-07-06 16.8 
##  7 2016-07-07 17.0 
##  8 2016-07-08 17.2 
##  9 2016-07-09  3.21
## 10 2016-07-10  1.75
## # &amp;hellip; with 174 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;sth_cross_ped %&amp;gt;%
  filter(Date &amp;lt;= &quot;2016-07-31&quot;) %&amp;gt;%
  model(hw = ETS(Count ~ error(&quot;M&quot;) + trend(&quot;Ad&quot;) + season(&quot;M&quot;))) %&amp;gt;%
  forecast(h = &quot;2 weeks&quot;) %&amp;gt;%
  autoplot(sth_cross_ped %&amp;gt;% filter(Date &amp;lt;= &quot;2016-08-14&quot;)) +
  labs(y = &quot;Pedestrians ('000)&quot;, title = &quot;Daily traffic: Southern Cross&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.37.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Rqeh2/btraRyem4Au/AEkRP3C4f8KT3CqSG1EsuK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Rqeh2/btraRyem4Au/AEkRP3C4f8KT3CqSG1EsuK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Rqeh2/btraRyem4Au/AEkRP3C4f8KT3CqSG1EsuK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FRqeh2%2FbtraRyem4Au%2FAEkRP3C4f8KT3CqSG1EsuK%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.37.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;모형은 데이터의 끝부분에서 주간 계절성 패턴과 증가 추세를 충분히 식별했다는 점을 확인할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Model selection&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ETS 통계 프레임워크의 가장 큰 장점은 정보 기준을 모형 선택에 사용할 수 있다는 점입니다.&lt;/li&gt;
&lt;li&gt;AIC, AICs 및 BIC 등이 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(k\)&lt;/span&gt;는 모형에서 주어진 매개변수의 갯수이고 &lt;span class=&quot;math inline&quot;&gt;\(L\)&lt;/span&gt;은 모형의 우도값(likelihood)라고 할 때 아래와 같이 표현할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\text{AIC} = -2\log{L} + 2k\]&lt;/span&gt; &lt;span class=&quot;math display&quot;&gt;\[\text{AIC}_{c} = AIC + \frac{2k(k+1)}{T-k-1}\]&lt;/span&gt; &lt;span class=&quot;math display&quot;&gt;\[BIC = AIC + k(\log{T}-2)\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 예시를 통해 살펴보겠습니다.&lt;/li&gt;
&lt;li&gt;기존 위에서 보였던 예시들은 &lt;code&gt;ETS()&lt;/code&gt; 함수내에 &lt;code&gt;formula&lt;/code&gt; 항을 모두 직접 정의하였지만 단순히 종속변수만 인자로 받을 경우 &lt;code&gt;ETS()&lt;/code&gt; 함수는 &lt;span class=&quot;math inline&quot;&gt;\(\text{AIC}_{c}\)&lt;/span&gt;를 최소로하는 모델을 선택해줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit &amp;lt;- aus_holidays %&amp;gt;% 
  model(ETS(Trips))

fit %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## Series: Trips 
## Model: ETS(M,N,A) 
##   Smoothing parameters:
##     alpha = 0.3484054 
##     gamma = 0.0001000018 
## 
##   Initial states:
##      l[0]       s[0]      s[-1]      s[-2]    s[-3]
##  9.727072 -0.5376106 -0.6884343 -0.2933663 1.519411
## 
##   sigma^2:  0.0022
## 
##      AIC     AICc      BIC 
## 226.2289 227.7845 242.9031&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;components()&lt;/code&gt; 함수를 사용하여 각 모형의 성분들 값을 확인할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit %&amp;gt;% 
  components() %&amp;gt;% 
  autoplot() +
  labs(title = &quot;ETS(M, N, A) components&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;pgsql&quot;&gt;&lt;code&gt;## Warning: Removed 4 row(s) containing missing values (geom_path).&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.45.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/byZJie/btraMq9wPU3/NPKKKJ72vxNPSTdifVPZn1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/byZJie/btraMq9wPU3/NPKKKJ72vxNPSTdifVPZn1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/byZJie/btraMq9wPU3/NPKKKJ72vxNPSTdifVPZn1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbyZJie%2FbtraMq9wPU3%2FNPKKKJ72vxNPSTdifVPZn1%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.45.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit %&amp;gt;% 
  augment() %&amp;gt;% 
  autoplot(.resid)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.51.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dhuBdW/btraP9y9FPB/LzSs4aZekAk3KxXfPw8qKK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dhuBdW/btraP9y9FPB/LzSs4aZekAk3KxXfPw8qKK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dhuBdW/btraP9y9FPB/LzSs4aZekAk3KxXfPw8qKK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdhuBdW%2FbtraP9y9FPB%2FLzSs4aZekAk3KxXfPw8qKK%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.51.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit %&amp;gt;% 
  augment() %&amp;gt;% 
  autoplot(.innov)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.56.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ckZmqY/btraPuDzs2E/EIEcFkhx7qBQaKTvxPe6SK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ckZmqY/btraPuDzs2E/EIEcFkhx7qBQaKTvxPe6SK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ckZmqY/btraPuDzs2E/EIEcFkhx7qBQaKTvxPe6SK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FckZmqY%2FbtraPuDzs2E%2FEIEcFkhx7qBQaKTvxPe6SK%2Fimg.png&quot; data-origin-width=&quot;671&quot; data-origin-height=&quot;479&quot; data-filename=&quot;스크린샷 2021-07-29 오후 7.53.56.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;</description>
      <category>time-series (tidy approach)</category>
      <category>error</category>
      <category>ETS</category>
      <category>Exponential Smoothing</category>
      <category>holt's linear trend</category>
      <category>Season</category>
      <category>SES</category>
      <category>trend</category>
      <category>지수평활법</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/79</guid>
      <comments>https://rstatistics.tistory.com/79#entry79comment</comments>
      <pubDate>Thu, 29 Jul 2021 19:55:43 +0900</pubDate>
    </item>
    <item>
      <title>[R] 5. Time-series Regression</title>
      <link>https://rstatistics.tistory.com/78</link>
      <description>&lt;p data-ke-size=&quot;size16&quot;&gt;
&lt;script type=&quot;text/x-mathjax-config&quot;&gt;
	MathJax.Hub.Config({
    	tex2jax: {
    		inlineMath: [ ['$','$'], ['\\(','\\)'] ],
    		processEscapes: true
		},
		TeX: { equationNumbers: { autoNumber: &quot;AMS&quot; } }
	});
&lt;/script&gt;
&lt;script src=&quot;https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js?config=TeX-MML-AM_CHTML&quot;&gt;&lt;/script&gt;
&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(fpp3)
#library(fable)&lt;/code&gt;&lt;/pre&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;The linear model&lt;/h2&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} = \beta_0 + \beta_1 x_{t} + \epsilon_{t}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;우리가 흔히 알고 있는 단순선형회귀 모형입니다.&lt;/li&gt;
&lt;li&gt;이는 시계열 데이터에 적용할때도 마찬가지로 오차항에 대한 가정을 합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;iid(independent identically distributed)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이를 &lt;code&gt;fable&lt;/code&gt; 라이브러리 내 함수를 활용하여 살펴보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;예시 데이터는 &lt;code&gt;us_change&lt;/code&gt;로 &lt;code&gt;tsibbledata&lt;/code&gt; 라이브러리 내에 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;1970년 1분기부터 2019년 2분기까지 미국의 개인 소비 지출(personal consumption expenditure)과 개인 소득(personal disposable income)의 분기별 변화(성장률) 시계열 데이터입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 198 x 6 [1Q]
##    Quarter Consumption Income Production Savings Unemployment
##      &amp;lt;qtr&amp;gt;       &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
##  1 1970 Q1       0.619  1.04      -2.45    5.30         0.9  
##  2 1970 Q2       0.452  1.23      -0.551   7.79         0.5  
##  3 1970 Q3       0.873  1.59      -0.359   7.40         0.5  
##  4 1970 Q4      -0.272 -0.240     -2.19    1.17         0.700
##  5 1971 Q1       1.90   1.98       1.91    3.54        -0.100
##  6 1971 Q2       0.915  1.45       0.902   5.87        -0.100
##  7 1971 Q3       0.794  0.521      0.308  -0.406        0.100
##  8 1971 Q4       1.65   1.16       2.29   -1.49         0    
##  9 1972 Q1       1.31   0.457      4.15   -4.29        -0.200
## 10 1972 Q2       1.89   1.03       1.89   -4.69        -0.100
## # &amp;hellip; with 188 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;각 변수별 분기 단위의 추이를 살펴보기 위해 &lt;code&gt;tidyr::pivot_longer()&lt;/code&gt; 함수를 사용하여 적절히 가공한 후 시각화해보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change %&amp;gt;% 
  pivot_longer(
    cols = c(&quot;Consumption&quot;, &quot;Income&quot;),
    names_to = &quot;Series&quot;,
    values_to = &quot;value&quot;
  ) %&amp;gt;% 
  autoplot(value) +
  labs(y = &quot;% change&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.17.59.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bWH2F8/btraGOXcqUI/m4UGJDiaWR3RgwniT8Ojk1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bWH2F8/btraGOXcqUI/m4UGJDiaWR3RgwniT8Ojk1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bWH2F8/btraGOXcqUI/m4UGJDiaWR3RgwniT8Ojk1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbWH2F8%2FbtraGOXcqUI%2Fm4UGJDiaWR3RgwniT8Ojk1%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.17.59.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;소득과 지출간의 선형관계를 살펴보겠습니다. (&lt;code&gt;geom_smooth(method = &quot;lm&quot;)&lt;/code&gt; 활용)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change %&amp;gt;% 
  ggplot(aes(x = Income, y = Consumption)) +
  geom_point(size = 1.2) +
  geom_smooth(formula = y ~ x, method = &quot;lm&quot;, se = FALSE) + # method = &quot;lm&quot;은 linear regerssion을 의미합니다.
  labs(
    x = &quot;Income (quarterly % change)&quot;,
    y = &quot;Consumption (quaterly % change)&quot;
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.18.09.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qDMxh/btraGPuZpwL/WJgJ2rArfisf7bfUYHgV30/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qDMxh/btraGPuZpwL/WJgJ2rArfisf7bfUYHgV30/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qDMxh/btraGPuZpwL/WJgJ2rArfisf7bfUYHgV30/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqDMxh%2FbtraGPuZpwL%2FWJgJ2rArfisf7bfUYHgV30%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.18.09.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;같은 방식을 &lt;code&gt;TSLM()&lt;/code&gt; 함수를 사용하여 추정해보겠습니다.&lt;/li&gt;
&lt;li&gt;결과물 출력은 &lt;code&gt;report()&lt;/code&gt; 함수를 활용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change %&amp;gt;% 
  model(TSLM(Consumption ~ Income)) %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.58236 -0.27777  0.01862  0.32330  1.42229 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(&amp;gt;|t|)    
## (Intercept)  0.54454    0.05403  10.079  &amp;lt; 2e-16 ***
## Income       0.27183    0.04673   5.817  2.4e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5905 on 196 degrees of freedom
## Multiple R-squared: 0.1472,  Adjusted R-squared: 0.1429
## F-statistic: 33.84 on 1 and 196 DF, p-value: 2.4022e-08&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그 결과 &lt;span class=&quot;math inline&quot;&gt;\(\text{Consumption} = 0.54454 + 0.27183\times\text{Income}\)&lt;/span&gt; 선형식이 도출되었습니다.&lt;/li&gt;
&lt;li&gt;변수의 p-value 값은 유의미하며, 해석을 하자면 소득이 한 단위 증가할 때 지출은 평균적으로 0.27 단위가 증가한다고 볼 수 있습니다.&lt;br /&gt;(절편까지 고려한다면 소득의 1%p 증가는 평균적으로 지출 0.27+0.54 = 0.82%p 증가로 이어짐)&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Multiple linear model&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} = \beta_0 + \beta_1 x_{1, t} + \beta_2 x_{2, t} + ... + \beta_k x_{k, t} + \epsilon_{t}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;독립변수가 두 개 이상인 선형 회귀 모형입니다.&lt;/li&gt;
&lt;li&gt;바로 예시로 적용해보겠습니다. 소비 지출을 예측하는 데 있어서 개인 소득뿐만 아니라 산업 생산과 개인 저축, 실업률을 포함하여 보고자합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change %&amp;gt;% 
  pivot_longer(
    cols = c(&quot;Production&quot;, &quot;Savings&quot;, &quot;Unemployment&quot;),
    names_to = &quot;Series&quot;,
    values_to = &quot;value&quot;
  ) %&amp;gt;% 
  autoplot(value, show.legend = FALSE) +
  labs(y = &quot;% change&quot;) +
  facet_wrap(. ~ Series, scales = &quot;free_y&quot;, nrow = 3)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.18.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qOdkq/btraPuXiXi7/uHZUhyVtE30dsUKERQVh71/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qOdkq/btraPuXiXi7/uHZUhyVtE30dsUKERQVh71/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qOdkq/btraPuXiXi7/uHZUhyVtE30dsUKERQVh71/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqOdkq%2FbtraPuXiXi7%2FuHZUhyVtE30dsUKERQVh71%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.18.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;먼저 각 변수들간의 상관관계를 확인하고 싶습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;GGally&lt;/code&gt; 라이브러리의 &lt;code&gt;ggpairs()&lt;/code&gt; 함수를 사용하여 확인할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(GGally)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;oxygene&quot;&gt;&lt;code&gt;## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change %&amp;gt;% 
  ggpairs(columns = 2:6)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.18.23.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dRkWtp/btraJoX3bQL/SfQZlCI3FBgaqiK0cp4Mt0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dRkWtp/btraJoX3bQL/SfQZlCI3FBgaqiK0cp4Mt0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dRkWtp/btraJoX3bQL/SfQZlCI3FBgaqiK0cp4Mt0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdRkWtp%2FbtraJoX3bQL%2FSfQZlCI3FBgaqiK0cp4Mt0%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.18.23.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Least Squares Estimation&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;선형 회귀에서 각 회귀계수 &lt;span class=&quot;math inline&quot;&gt;\(\beta_{k}\)&lt;/span&gt;의 추정은 아래 오차항의 제곱합을 최소화하는 과정에서 얻어집니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\sum^{T}_{t=1}\epsilon_{t}^{2} = \sum^{T}_{t=1}\bigg(y_{t} - \beta_0 - \beta_1x_{1,t}-...-\beta_{k}x_{k, t} \bigg)^{2}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이 과정을 최소제곱추정이라고 합니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TSLM()&lt;/code&gt; 함수는 시계열 데이터를 선형회귀모형에 적합시켜줍니다.&lt;/li&gt;
&lt;li&gt;물론 가장 널리쓰이는 &lt;code&gt;lm()&lt;/code&gt; 함수와 유사하지만 &lt;code&gt;TSLM()&lt;/code&gt;은 시계열 처리를 위한 부가 기능이 더 있다고 보시면 될 것 같습니다.&lt;/li&gt;
&lt;li&gt;위 예시를 그대로 이어가겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR &amp;lt;- us_change %&amp;gt;% 
  model(tslm = TSLM(formula = Consumption ~ Income + Production + Unemployment + Savings))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90555 -0.15821 -0.03608  0.13618  1.15471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(&amp;gt;|t|)    
## (Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
## Income        0.740583   0.040115  18.461  &amp;lt; 2e-16 ***
## Production    0.047173   0.023142   2.038   0.0429 *  
## Unemployment -0.174685   0.095511  -1.829   0.0689 .  
## Savings      -0.052890   0.002924 -18.088  &amp;lt; 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683,  Adjusted R-squared: 0.7635
## F-statistic:   160 on 4 and 193 DF, p-value: &amp;lt; 2.22e-16&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;계수에 대한 해석은 다들 아실것이라 생각하고 스킵하겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Fitted values&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;augment()&lt;/code&gt; 함수를 사용하여 적합된 값을 출력할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  augment()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 198 x 6 [1Q]
## # Key:       .model [1]
##    .model Quarter Consumption .fitted  .resid  .innov
##    &amp;lt;chr&amp;gt;    &amp;lt;qtr&amp;gt;       &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;
##  1 tslm   1970 Q1       0.619   0.474  0.145   0.145 
##  2 tslm   1970 Q2       0.452   0.635 -0.183  -0.183 
##  3 tslm   1970 Q3       0.873   0.931 -0.0583 -0.0583
##  4 tslm   1970 Q4      -0.272  -0.212 -0.0603 -0.0603
##  5 tslm   1971 Q1       1.90    1.64   0.264   0.264 
##  6 tslm   1971 Q2       0.915   1.07  -0.158  -0.158 
##  7 tslm   1971 Q3       0.794   0.658  0.137   0.137 
##  8 tslm   1971 Q4       1.65    1.30   0.347   0.347 
##  9 tslm   1972 Q1       1.31    1.05   0.262   0.262 
## 10 tslm   1972 Q2       1.89    1.37   0.513   0.513 
## # &amp;hellip; with 188 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  augment() %&amp;gt;% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Consumption, colour = &quot;Data&quot;)) +
  geom_line(aes(y = .fitted, colour = &quot;Fitted&quot;)) +
  labs(y = NULL, title = &quot;Percent change in US consumption expenditure&quot;) +
  scale_colour_manual(values = c(Data = &quot;black&quot;, Fitted = &quot;#D55E00&quot;)) +
  guides(colour = guide_legend(title = NULL))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.18.32.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/VFGhG/btraJoDH9Py/C5MnKZvPGv1hVuuWG7gFl1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/VFGhG/btraJoDH9Py/C5MnKZvPGv1hVuuWG7gFl1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/VFGhG/btraJoDH9Py/C5MnKZvPGv1hVuuWG7gFl1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FVFGhG%2FbtraJoDH9Py%2FC5MnKZvPGv1hVuuWG7gFl1%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.18.32.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 그래프는 예상 소비 지출 대비 실제 소비 지출을 나타낸 산점도 입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  augment() %&amp;gt;% 
  ggplot(aes(x = Consumption, y = .fitted)) +
  geom_point(size = 1.2) +
  labs(
    x = &quot;Data (actual values)&quot;,
    y = &quot;Fitted (predicted values)&quot;,
    title = &quot;Percent change in US consumption expenditure&quot;
  ) +
  geom_abline(intercept = 0, slope = 1, color = &quot;grey50&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.20.43.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zbPQR/btraKFS6hpP/LmbJEe7JFYtJifwFOBiMkk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zbPQR/btraKFS6hpP/LmbJEe7JFYtJifwFOBiMkk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zbPQR/btraKFS6hpP/LmbJEe7JFYtJifwFOBiMkk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzbPQR%2FbtraKFS6hpP%2FLmbJEe7JFYtJifwFOBiMkk%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.20.43.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;&amp;nbsp;&lt;/h3&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Goodness-of-fit&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;선형 회귀 모형이 데이터에 얼마나 잘 맞는지 요약하는 일반적인 방법은 결정계수(coefficient of determination, &lt;span class=&quot;math inline&quot;&gt;\(R^{2}\)&lt;/span&gt;) 입니다.&lt;/li&gt;
&lt;li&gt;이는 실제 값과 예측 값 사이의 상관관계 제곱으로 계산하거나 아래와 같이 계산할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[R^{2} = \frac{\sum(\hat{y_{t}} - \bar{y})^2}{\sum(y_{t} - \bar{y})^2}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;해당 값은 0과 1 사이에 위치하여 예측이 실제값과 가까울수록 1에 가까워지는 값을 가집니다.&lt;/li&gt;
&lt;li&gt;이는 &lt;code&gt;report()&lt;/code&gt; 함수를 통해 결과물을 출력할 때 &lt;code&gt;Adjusted R-squared&lt;/code&gt; 값으로 확인할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Consumption 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90555 -0.15821 -0.03608  0.13618  1.15471 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(&amp;gt;|t|)    
## (Intercept)   0.253105   0.034470   7.343 5.71e-12 ***
## Income        0.740583   0.040115  18.461  &amp;lt; 2e-16 ***
## Production    0.047173   0.023142   2.038   0.0429 *  
## Unemployment -0.174685   0.095511  -1.829   0.0689 .  
## Savings      -0.052890   0.002924 -18.088  &amp;lt; 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3102 on 193 degrees of freedom
## Multiple R-squared: 0.7683,  Adjusted R-squared: 0.7635
## F-statistic:   160 on 4 and 193 DF, p-value: &amp;lt; 2.22e-16&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Standard error of the regression&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;모형이 데이터를 얼마나 잘 적합시켰는지에 대한 또 다른 평가는 잔차의 표준 오차로 알려진 잔차 표준 편차입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\hat{\sigma_{e}} = \sqrt{\frac{1}{T-k-1}\sum^{T}_{t=1}e_{t}^2}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이 또한 &lt;code&gt;report()&lt;/code&gt; 함수를 통해 결과물을 출력할 때 &lt;code&gt;Residual standard error&lt;/code&gt; 값으로 확인할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Evaluating the regression model&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;회귀 모형 적합 이후에는 모형의 가정이 충족되었는지 확인하기 위해 잔차항을 체크해야합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;ACF plot of residuals &amp;amp; Histogram of residuals&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 데이터의 경우 현재 t시점의 값이 이전 t-1 시점 또는 그 이전 기간의 값과 유사하거나 영향을 받을 가능성이 매우 높습니다.&lt;/li&gt;
&lt;li&gt;따라서 시계열 데이터를 회귀 모형에 적합시킬 때는 잔차항에서 자기 상관을 찾는 것이 일반적입니다.&lt;/li&gt;
&lt;li&gt;자기 상관이 있는 모델의 예측은 편향은 없기에 틀림이라고 볼 수는 없지만, 일반적으로 더 갭이 큰 예측 오차를 가질 가능성이 높습니다.&lt;/li&gt;
&lt;li&gt;또한 잔차가 정규 분포를 따르는 지 확인해야 합니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;gg_tsresiduals()&lt;/code&gt; 함수를 통해 잔차항에 대한 진단을 해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  gg_tsresiduals()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.20.52.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bjZjt0/btraNI9mbuP/NtFHkpulED0P5Nn0O4qd10/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bjZjt0/btraNI9mbuP/NtFHkpulED0P5Nn0O4qd10/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bjZjt0/btraNI9mbuP/NtFHkpulED0P5Nn0O4qd10/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbjZjt0%2FbtraNI9mbuP%2FNtFHkpulED0P5Nn0O4qd10%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.20.52.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  augment() %&amp;gt;% 
  features(
    .var = .innov,
    features = ljung_box,
    lag = 10, 
    dof = 5
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1 x 3
##   .model lb_stat lb_pvalue
##   &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
## 1 tslm      18.9   0.00204&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;자기상관함수 그래프에서 시차 7에 벗어나는 점을 보이지만 크게 영향을 미치지 않을 수도 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Residual plots against predictors&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;각 변수에 대해 산점도를 체크해볼 필요도 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  residuals()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 198 x 3 [1Q]
## # Key:       .model [1]
##    .model Quarter  .resid
##    &amp;lt;chr&amp;gt;    &amp;lt;qtr&amp;gt;   &amp;lt;dbl&amp;gt;
##  1 tslm   1970 Q1  0.145 
##  2 tslm   1970 Q2 -0.183 
##  3 tslm   1970 Q3 -0.0583
##  4 tslm   1970 Q4 -0.0603
##  5 tslm   1971 Q1  0.264 
##  6 tslm   1971 Q2 -0.158 
##  7 tslm   1971 Q3  0.137 
##  8 tslm   1971 Q4  0.347 
##  9 tslm   1972 Q1  0.262 
## 10 tslm   1972 Q2  0.513 
## # &amp;hellip; with 188 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change %&amp;gt;% 
  left_join(residuals(fit_consMR), by = &quot;Quarter&quot;) %&amp;gt;% 
  pivot_longer(
    cols = Income:Unemployment,
    names_to = &quot;regressor&quot;,
    values_to = &quot;x&quot;
  ) %&amp;gt;% 
  ggplot(aes(x = x, y = .resid)) +
  geom_point(size = 1.2) +
  facet_wrap(. ~ regressor, scales = &quot;free_x&quot;) +
  labs(x = NULL, y = &quot;Residuals&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.01.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/qZrIt/btraJowUUpD/kCxS3P30kG0EiYurQSdYr1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/qZrIt/btraJowUUpD/kCxS3P30kG0EiYurQSdYr1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/qZrIt/btraJowUUpD/kCxS3P30kG0EiYurQSdYr1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FqZrIt%2FbtraJowUUpD%2FkCxS3P30kG0EiYurQSdYr1%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.01.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 그래프에서 보이다시피 각 변수별 잔차가 무작위로 흩어져 분포되어 있는 것처럼 보이므로 어느 정도 가정을 만족한다고 볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Residual plots against fitted values&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;적합된 값과 잔차간의 분포에도 특별한 패턴이 보이면 안됩니다.&lt;/li&gt;
&lt;li&gt;패턴이 관찰되면 오차에 등분산성 가정이 위배될 가능성이 있기 때문입니다.&lt;/li&gt;
&lt;li&gt;이렇게 등분산성 가정이 만족하지 못할 경우 로그 또는 제곱근과 같이 변수의 변환을 주어야할 수도 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  augment() %&amp;gt;% 
  ggplot(aes(x = .fitted, y = .resid)) +
  geom_point(size = 1.2) +
  labs(x = &quot;Fitted&quot;, y = &quot;Residuals&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.08.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/d1IjYn/btraGbrx3x7/aw4cNTxq33iKbKIQIbg77K/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/d1IjYn/btraGbrx3x7/aw4cNTxq33iKbKIQIbg77K/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/d1IjYn/btraGbrx3x7/aw4cNTxq33iKbKIQIbg77K/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fd1IjYn%2FbtraGbrx3x7%2Faw4cNTxq33iKbKIQIbg77K%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.08.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;어느 특정한 패턴이 보이지 않으므로 등분산성을 만족한다고 볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Outliers and influential observations&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;대부분의 관측값과 비교하여 극단적인 값을 취하는 값을 이상치라고 합니다.&lt;/li&gt;
&lt;li&gt;회귀 모델의 추정 계수에 큰 영향을 미치는 관측치는 영향력 있는 관측치(influential observations)라고 합니다.&lt;/li&gt;
&lt;li&gt;일반적으로 영향력 있는 관측치는 극단의 이상치일 가능성도 있습니다.&lt;/li&gt;
&lt;li&gt;이러한 이상치들을 무조건 제거한다고 그게 올바르다고 보기는 어렵습니다.&lt;br /&gt;이상치여도 의미 있는 값이 있을수도 있기에, 그 값이 가능한 이유에 대해서 분석하는 것이 중요합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Spurious regression&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 데이터는 종종 정상성(stationarity)을 보이지 않는 경우도 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시간이 지나도 분산이 일정한.. 즉, 시계열의 변동이 시간의 흐름에 따라 일정한 것을 정상성이라고 이해하시면 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이렇게 비정상 시계열은 정상 시계열로 변환해줄 필요가 있습니다.&lt;/li&gt;
&lt;li&gt;비정상 시계열을 회귀 모형에 적합하면 부정확한 모형이 될 가능성이 있기 때문입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;temp_fit &amp;lt;- aus_airpassengers %&amp;gt;% 
  filter(Year &amp;lt;= 2011) %&amp;gt;% 
  left_join(guinea_rice, by = &quot;Year&quot;) %&amp;gt;% 
  model(TSLM(Passengers ~ Production))

temp_fit %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Passengers 
## Model: TSLM 
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9448 -1.8917 -0.3272  1.8620 10.4210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(&amp;gt;|t|)    
## (Intercept)   -7.493      1.203  -6.229 2.25e-07 ***
## Production    40.288      1.337  30.135  &amp;lt; 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.239 on 40 degrees of freedom
## Multiple R-squared: 0.9578,  Adjusted R-squared: 0.9568
## F-statistic: 908.1 on 1 and 40 DF, p-value: &amp;lt; 2.22e-16&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;temp_fit %&amp;gt;% 
  gg_tsresiduals()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.16.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/zcKAr/btraJzS7Q91/yseOxlzTLYPcIVEFrgHZo1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/zcKAr/btraJzS7Q91/yseOxlzTLYPcIVEFrgHZo1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/zcKAr/btraJzS7Q91/yseOxlzTLYPcIVEFrgHZo1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FzcKAr%2FbtraJzS7Q91%2FyseOxlzTLYPcIVEFrgHZo1%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.16.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Some useful predictors&lt;/h2&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Trend&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} = \beta_{0} + \beta_{1} t + \epsilon_{t}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;TSLM()&lt;/code&gt; 함수에는 &lt;code&gt;trend()&lt;/code&gt;를 적용하여 추세 변수를 지정할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Dummy variables&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;예를 들어 시계열 데이터에서 어떤 특정 날짜가 공휴일인지 여부를 고려할 때 변수에 공휴일이면 1, 아니면 0 과 같이 값을 취할 수 있습니다.&lt;/li&gt;
&lt;li&gt;이를 더미 변수라고 하며 Indicator variable이라고도 불리웁니다.&lt;/li&gt;
&lt;li&gt;범주가 세 개 이상인 k개일 경우에는 (k-1)개의 변수를 더미처리하여 사용할 수 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;TSLM()&lt;/code&gt; 함수에서는 더미 지정하면 이를 자동으로 처리해줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Seasonal dummy variables&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;계절성을 갖는 더미 변수(요일, 분기 등)도 마찬가지로 &lt;code&gt;TSLM()&lt;/code&gt; 함수 내 &lt;code&gt;season()&lt;/code&gt;을 적용하여 지정할 수 있습니다.&lt;/li&gt;
&lt;li&gt;아래 예시를 들어 확인해보겠습니다. (호주의 분기별 맥주 생산량 데이터)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;recent_production &amp;lt;- aus_production %&amp;gt;% 
  filter(year(Quarter) &amp;gt;= 1992)

recent_production %&amp;gt;% 
  autoplot(Beer) +
  labs(y = &quot;Megalitres&quot;, title = &quot;Australian quarterly beer production&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.26.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/0YH1K/btraDzzuRNi/qfPSx74kKKSYif4WdtreB0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/0YH1K/btraDzzuRNi/qfPSx74kKKSYif4WdtreB0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/0YH1K/btraDzzuRNi/qfPSx74kKKSYif4WdtreB0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F0YH1K%2FbtraDzzuRNi%2FqfPSx74kKKSYif4WdtreB0%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.26.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;미래의 맥주 생산량을 확인해보고 싶습니다.&lt;/li&gt;
&lt;li&gt;선형 추세(trend)와 더미 변수가 있는 회귀 모형을 사용하여 모델링할 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;분기 데이터이기에 더미는 세 개가 됩니다. (4-1 = 3)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} = \beta_{0} + \beta_{1}t + \beta_{2}t_{2, t} + \beta_{3}t_{3, t} + \beta_{4}t_{4, t} + \epsilon_t\]&lt;/span&gt;&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# trend() 함수와 season() 함수는 디폴트로 들어가는 표준함수가 아닙니다. 필요 시 TSLM() 함수 안에서 적용합니다.
fit_beer &amp;lt;- recent_production %&amp;gt;% 
  model(TSLM(formula = Beer ~ trend() + season()))

fit_beer %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -42.9029  -7.5995  -0.4594   7.9908  21.7895 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(&amp;gt;|t|)    
## (Intercept)   441.80044    3.73353 118.333  &amp;lt; 2e-16 ***
## trend()        -0.34027    0.06657  -5.111 2.73e-06 ***
## season()year2 -34.65973    3.96832  -8.734 9.10e-13 ***
## season()year3 -17.82164    4.02249  -4.430 3.45e-05 ***
## season()year4  72.79641    4.02305  18.095  &amp;lt; 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243,  Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: &amp;lt; 2.22e-16&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;적합된 결과를 확인해보면 분기당 평균 -0.34 감소 추세가 있습니다.&lt;/li&gt;
&lt;li&gt;평균적으로 2분기는 1분기보다 -34.7, 3분기는 1분기보다 -17.8, 반면 4분기는 1분기보다 72.8 정도 생산량이 많습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_beer %&amp;gt;% 
  augment() %&amp;gt;% 
  ggplot(aes(x = Quarter)) +
  geom_line(aes(y = Beer, colour = &quot;Data&quot;)) +
  geom_line(aes(y = .fitted, colour = &quot;Fitted&quot;)) +
  scale_colour_manual(values = c(Data = &quot;black&quot;, Fitted = &quot;#D55E00&quot;)) +
  labs(y = &quot;Megalitres&quot;, title = &quot;Australian quarterly beer production&quot;) +
  guides(colour = guide_legend(title = &quot;Series&quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.33.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cnfLoK/btraHsfTRXv/UY1O8raKzkOKU28KTaSx41/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cnfLoK/btraHsfTRXv/UY1O8raKzkOKU28KTaSx41/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cnfLoK/btraHsfTRXv/UY1O8raKzkOKU28KTaSx41/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcnfLoK%2FbtraHsfTRXv%2FUY1O8raKzkOKU28KTaSx41%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.33.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 그래프와 같이 분기별 실제 맥주 생산량과 예상치를 표현해볼 수도 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_beer %&amp;gt;% 
  augment() %&amp;gt;% 
  ggplot(aes(x = Beer, y = .fitted, colour = factor(quarter(Quarter)))) +
  geom_point(size = 1.2) +
  labs(x = &quot;Actual values&quot;, y = &quot;Fitted&quot;, title = &quot;Australian quarterly beer production&quot;) +
  geom_abline(intercept = 0, slope = 1) +
  guides(colour = guide_legend(title = &quot;Quarter&quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.41.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/diYdjr/btraFEHUzIr/MBol8lkQxLVfB5Kc7s4i4k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/diYdjr/btraFEHUzIr/MBol8lkQxLVfB5Kc7s4i4k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/diYdjr/btraFEHUzIr/MBol8lkQxLVfB5Kc7s4i4k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdiYdjr%2FbtraFEHUzIr%2FMBol8lkQxLVfB5Kc7s4i4k%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.41.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Intervention variables&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;종속 변수에 영향을 줄 수 있는 변수들을 개입하는 것이 종종 필요할 수도 있습니다.&lt;/li&gt;
&lt;li&gt;예를 들자면 특정 프로모션 기간 전후에 차이가 있다고 가정할 때 프로모션 전에는 0, 후에는 1과 같이 더미처럼 변수 개입을 주는 것입니다.&lt;/li&gt;
&lt;li&gt;이러한 개입은 piecewise linear trend가 되기에 기울기의 변화가 발생하게 되고, 이는 즉 비선형에 해당되게 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Fourier series&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;긴 계절 기간 동안 계절 더미 변수를 사용할 때 대안 중 하나는 푸리에(Fourier) 항을 사용하는 것입니다.&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(m\)&lt;/span&gt; = seasonal period
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(x_{1, t} = \sin\big(\frac{2\pi t}{m}\big)\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(x_{2, t} = \cos\big(\frac{2\pi t}{m}\big)\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(x_{3, t} = \sin\big(\frac{4\pi t}{m}\big)\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(x_{4, t} = \cos\big(\frac{4\pi t}{m}\big)\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(x_{5, t} = \sin\big(\frac{6\pi t}{m}\big)\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(x_{6, t} = \cos\big(\frac{6\pi t}{m}\big)\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;&amp;hellip;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;푸리에 항을 사용하면 더미 변수가 줄어들 수 있는 이점을 갖습니다.&lt;/li&gt;
&lt;li&gt;예를 들자면 시계열이 주간 단위로 되어 있는 케이스 등이 될 것 같습니다. (&lt;span class=&quot;math inline&quot;&gt;\(m = 52\)&lt;/span&gt;)&lt;/li&gt;
&lt;li&gt;이러한 푸리에 항은 &lt;code&gt;fourier()&lt;/code&gt; 함수로 적용할 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;K&lt;/code&gt; arguments는 몇개의 sin, cos 항을 포함시킬지 결정하는 부분&lt;/li&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(m\)&lt;/span&gt;이 계절 주기라고 할 때 허용되는 &lt;span class=&quot;math inline&quot;&gt;\(K\)&lt;/span&gt;의 최대값은 &lt;span class=&quot;math inline&quot;&gt;\(K = m/2\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;분기별 데이터이므로 &lt;span class=&quot;math inline&quot;&gt;\(m=4\)&lt;/span&gt;이기에 여기서는 &lt;span class=&quot;math inline&quot;&gt;\(K=2\)&lt;/span&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;recent_production %&amp;gt;% 
  model(TSLM(formula = Beer ~ trend() + fourier(K = 2))) %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -42.9029  -7.5995  -0.4594   7.9908  21.7895 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(&amp;gt;|t|)    
## (Intercept)        446.87920    2.87321 155.533  &amp;lt; 2e-16 ***
## trend()             -0.34027    0.06657  -5.111 2.73e-06 ***
## fourier(K = 2)C1_4   8.91082    2.01125   4.430 3.45e-05 ***
## fourier(K = 2)S1_4 -53.72807    2.01125 -26.714  &amp;lt; 2e-16 ***
## fourier(K = 2)C2_4 -13.98958    1.42256  -9.834 9.26e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243,  Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: &amp;lt; 2.22e-16&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Selecting predictors&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;회귀 모형 적합 이후 변수를 선별할 때 우리는 흔히 p-value 값이 일정 유의수준 이하인 경우를 채택했습니다.&lt;/li&gt;
&lt;li&gt;하지만 이렇게 p-value의 통계적 유의성이 항상 맞고 옳다고는 보기 어렵습니다.&lt;/li&gt;
&lt;li&gt;두 개 이상의 변수가 서로 상관 관계가 있을 경우에 p-value로의 판단은 잘못된 판단이 될 수도 있기 때문입니다.&lt;/li&gt;
&lt;li&gt;이러한 오차를 최소화하기 위해 변수 클렌징이 다 된 최종 모형에 대한 평가를 고려해볼 수 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;glance()&lt;/code&gt; 함수는 이러한 값을 제공해줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_consMR %&amp;gt;% 
  glance() %&amp;gt;% 
  select(adj_r_squared, CV, AIC, AICc, BIC)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1 x 5
##   adj_r_squared    CV   AIC  AICc   BIC
##           &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;
## 1         0.763 0.104 -457. -456. -437.&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Adjusted R-squared&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;일반적인 &lt;span class=&quot;math inline&quot;&gt;\(R^2\)&lt;/span&gt; 값은 모형이 과거 데이터에 얼마나 잘 맞는지를 측정하지만, 미래에 예측을 얼마나 잘하는지는 측정하지 않습니다.&lt;/li&gt;
&lt;li&gt;또한 자유도(degree of freedom) 항을 고려하지 않기에 변수가 많을수록 &lt;span class=&quot;math inline&quot;&gt;\(R^2\)&lt;/span&gt; 값은 증가하는 경향이 있습니다.&lt;/li&gt;
&lt;li&gt;따라서 이렇게 자유도 항을 고려한 것이 Adjusted R-squared 입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;우리말로는 조정된 결정계수 또는 수정된 결정계수 라고도 부르는 것 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;이 값을 최대화하는 것은 모델 오차항의 표준오차 &lt;span class=&quot;math inline&quot;&gt;\(\sigma_{e}\)&lt;/span&gt;를 최소화하는 것과 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Cross-validation&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;설명이 길어지기에 일단 &lt;a href=&quot;https://ko.wikipedia.org/wiki/%EA%B5%90%EC%B0%A8%ED%83%80%EB%8B%B9%EB%8F%84&quot;&gt;교차검증&lt;/a&gt;에 대한 위키 백과 링크를 첨부드립니다.&lt;/li&gt;
&lt;li&gt;시계열에서도 마찬가지로 교차 검증은 모델의 예측력을 평가하기 위한 일반적인 방법입니다.&lt;/li&gt;
&lt;li&gt;이 값이 작을수록 좋은 모형이라고 평가할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Akaike&amp;rsquo;s Information Criterion&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\text{AIC} = T \log\bigg(\frac{\text{SSE}}{T}\bigg) + 2(k+2)\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;span class=&quot;math inline&quot;&gt;\(T\)&lt;/span&gt;는 관측치의 수, &lt;span class=&quot;math inline&quot;&gt;\(k\)&lt;/span&gt;는 변수의 수 입니다.&lt;/li&gt;
&lt;li&gt;마찬가지로 작은 값을 가질수록 가장 적합한 모형이라고 평가할 수 있습니다.&lt;/li&gt;
&lt;li&gt;AIC를 최소화하는 것은 CV를 최소화하는 것과 동일한 맥락이라고 볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Corrected Akaike&amp;rsquo;s Information Criterion&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;관측치의 수 &lt;span class=&quot;math inline&quot;&gt;\(T\)&lt;/span&gt;가 작을 때 AIC는 너무 많은 예측 변수를 선택하는 경향이 있으므로 아래와 같이 수정된 버전입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\text{AIC}_{c} = \text{AIC} + \frac{2(k+2)(k+3)}{T-k-3}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마찬가지로 작을수록 베스트입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Schwarz&amp;rsquo;s Bayesian Information Criterion&lt;/h3&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[\text{BIC} = T\log\bigg(\frac{\text{SSE}}{T}\bigg) + (k+2)\log(T)\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;AIC와 동일한 맥락으로 BIC를 최소화하는 것이 좋습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Which measure should we use?&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a href=&quot;https://otexts.com/fpp3/selecting-predictors.html#which-measure-should-we-use&quot;&gt;관련 문서&lt;/a&gt;에서는 AICc, AIC 또는 CV 통계량 중 하나를 사용하는 것을 추천합니다.&lt;/li&gt;
&lt;li&gt;길지 않으니 관련 문서를 꼭 한 번 읽어보시길 권장드립니다.&lt;/li&gt;
&lt;li&gt;대부분의 예시들에서도 AICc 값으로 예측 모형을 선택하겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Forecasting with regression&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;TSLM()&lt;/code&gt; 함수에 적합된 객체를 가지고 &lt;code&gt;forecast()&lt;/code&gt; 함수를 적용하여 예측을 해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_beer %&amp;gt;% 
  report()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Series: Beer 
## Model: TSLM 
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -42.9029  -7.5995  -0.4594   7.9908  21.7895 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(&amp;gt;|t|)    
## (Intercept)   441.80044    3.73353 118.333  &amp;lt; 2e-16 ***
## trend()        -0.34027    0.06657  -5.111 2.73e-06 ***
## season()year2 -34.65973    3.96832  -8.734 9.10e-13 ***
## season()year3 -17.82164    4.02249  -4.430 3.45e-05 ***
## season()year4  72.79641    4.02305  18.095  &amp;lt; 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.23 on 69 degrees of freedom
## Multiple R-squared: 0.9243,  Adjusted R-squared: 0.9199
## F-statistic: 210.7 on 4 and 69 DF, p-value: &amp;lt; 2.22e-16&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fc_beer &amp;lt;- fit_beer %&amp;gt;% 
  forecast()

fc_beer&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A fable: 8 x 4 [1Q]
## # Key:     .model [1]
##   .model                                    Quarter        Beer .mean
##   &amp;lt;chr&amp;gt;                                       &amp;lt;qtr&amp;gt;      &amp;lt;dist&amp;gt; &amp;lt;dbl&amp;gt;
## 1 TSLM(formula = Beer ~ trend() + season()) 2010 Q3 N(398, 164)  398.
## 2 TSLM(formula = Beer ~ trend() + season()) 2010 Q4 N(489, 164)  489.
## 3 TSLM(formula = Beer ~ trend() + season()) 2011 Q1 N(416, 165)  416.
## 4 TSLM(formula = Beer ~ trend() + season()) 2011 Q2 N(381, 165)  381.
## 5 TSLM(formula = Beer ~ trend() + season()) 2011 Q3 N(397, 166)  397.
## 6 TSLM(formula = Beer ~ trend() + season()) 2011 Q4 N(487, 166)  487.
## 7 TSLM(formula = Beer ~ trend() + season()) 2012 Q1 N(414, 166)  414.
## 8 TSLM(formula = Beer ~ trend() + season()) 2012 Q2 N(379, 166)  379.&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그 결과를 시각화로 표현한 결과는 아래와 같습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;음영으로 된 구간중 어두운 음영은 예측 80% 신뢰구간, 조금 더 밝은 음영은 95% 신뢰구간을 나타냅니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fc_beer %&amp;gt;% 
  autoplot(recent_production) +
  labs(y = &quot;megalitres&quot;, title = &quot;Forecasts of beer production using regression&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.51.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/b4QGnm/btraKEUgPW5/jBOB5rVhz0xQwuSn5vVb7k/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/b4QGnm/btraKEUgPW5/jBOB5rVhz0xQwuSn5vVb7k/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/b4QGnm/btraKEUgPW5/jBOB5rVhz0xQwuSn5vVb7k/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fb4QGnm%2FbtraKEUgPW5%2FjBOB5rVhz0xQwuSn5vVb7k%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.51.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이번엔 다른 데이터인 &lt;code&gt;us_change&lt;/code&gt;로 시나리오 기반 예측을 해보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 198 x 6 [1Q]
##    Quarter Consumption Income Production Savings Unemployment
##      &amp;lt;qtr&amp;gt;       &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
##  1 1970 Q1       0.619  1.04      -2.45    5.30         0.9  
##  2 1970 Q2       0.452  1.23      -0.551   7.79         0.5  
##  3 1970 Q3       0.873  1.59      -0.359   7.40         0.5  
##  4 1970 Q4      -0.272 -0.240     -2.19    1.17         0.700
##  5 1971 Q1       1.90   1.98       1.91    3.54        -0.100
##  6 1971 Q2       0.915  1.45       0.902   5.87        -0.100
##  7 1971 Q3       0.794  0.521      0.308  -0.406        0.100
##  8 1971 Q4       1.65   1.16       2.29   -1.49         0    
##  9 1972 Q1       1.31   0.457      4.15   -4.29        -0.200
## 10 1972 Q2       1.89   1.03       1.89   -4.69        -0.100
## # &amp;hellip; with 188 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;여기서 시나리오는 아래와 같이 셋팅합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;고용률(Unemployment 관련)의 변화 없이 소득(Income)과 저축(Savings)이 각각 1%와 0.5%로 증가하거나 감소&lt;/li&gt;
&lt;li&gt;예측을 위한 시나리오 셋팅은 &lt;code&gt;scenarios()&lt;/code&gt; 함수를 통해 구성해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tsibble&lt;/code&gt; 라이브러리 내 &lt;code&gt;new_data()&lt;/code&gt; 함수는 key-index 조합으로 원하는 시점만큼의 미래 값을 생성해주는 함수입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# 회귀모형 적합
fit_consBest &amp;lt;- us_change %&amp;gt;% 
  model(lm = TSLM(formula = Consumption ~ Income + Savings + Unemployment))

# 시나리오 구성
future_scenarios &amp;lt;- scenarios(
  Increase = new_data(us_change, 4) %&amp;gt;% 
    mutate(
      Income = 1,
      Savings = 0.5,
      Unemployment = 0
    ),
  Decrease = new_data(us_change, 4) %&amp;gt;% 
    mutate(
      Income = -1,
      Savings = -0.5,
      Unemployment = 0
    ),
  names_to = &quot;Scenario&quot;
)

future_scenarios&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## $Increase
## # A tsibble: 4 x 4 [1Q]
##   Quarter Income Savings Unemployment
##     &amp;lt;qtr&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
## 1 2019 Q3      1     0.5            0
## 2 2019 Q4      1     0.5            0
## 3 2020 Q1      1     0.5            0
## 4 2020 Q2      1     0.5            0
## 
## $Decrease
## # A tsibble: 4 x 4 [1Q]
##   Quarter Income Savings Unemployment
##     &amp;lt;qtr&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
## 1 2019 Q3     -1    -0.5            0
## 2 2019 Q4     -1    -0.5            0
## 3 2020 Q1     -1    -0.5            0
## 4 2020 Q2     -1    -0.5            0
## 
## attr(,&quot;names_to&quot;)
## [1] &quot;Scenario&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이러한 시나리오 구성을 &lt;code&gt;forecast()&lt;/code&gt; 함수 내 &lt;code&gt;new_data&lt;/code&gt; argument에 적용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fc &amp;lt;- fit_consBest %&amp;gt;% 
  forecast(new_data = future_scenarios)

fc&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A fable: 8 x 8 [1Q]
## # Key:     Scenario, .model [2]
##   Scenario .model Quarter   Consumption  .mean Income Savings Unemployment
##   &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;    &amp;lt;qtr&amp;gt;        &amp;lt;dist&amp;gt;  &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
## 1 Increase lm     2019 Q3   N(1, 0.098)  0.996      1     0.5            0
## 2 Increase lm     2019 Q4   N(1, 0.098)  0.996      1     0.5            0
## 3 Increase lm     2020 Q1   N(1, 0.098)  0.996      1     0.5            0
## 4 Increase lm     2020 Q2   N(1, 0.098)  0.996      1     0.5            0
## 5 Decrease lm     2019 Q3 N(-0.46, 0.1) -0.464     -1    -0.5            0
## 6 Decrease lm     2019 Q4 N(-0.46, 0.1) -0.464     -1    -0.5            0
## 7 Decrease lm     2020 Q1 N(-0.46, 0.1) -0.464     -1    -0.5            0
## 8 Decrease lm     2020 Q2 N(-0.46, 0.1) -0.464     -1    -0.5            0&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이를 가지고 &lt;code&gt;autolayer()&lt;/code&gt; 함수에 시나리오 예측 적용 객체를 씌워주어 아래와 같이 시각화 해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;us_change %&amp;gt;% 
  autoplot(Consumption) +
  autolayer(fc) +
  labs(title = &quot;US consumption&quot;, y = &quot;% change&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.58.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dTSJWS/btraPuQxyP4/zY9ikj0hg9WCmA464tfIhk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dTSJWS/btraPuQxyP4/zY9ikj0hg9WCmA464tfIhk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dTSJWS/btraPuQxyP4/zY9ikj0hg9WCmA464tfIhk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdTSJWS%2FbtraPuQxyP4%2FzY9ikj0hg9WCmA464tfIhk%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.21.58.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Nonlinear regression&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;일반적인 모형의 수식 형태는 아래와 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;span class=&quot;math display&quot;&gt;\[y_{t} = f(x_{t})+\epsilon_{t}\]&lt;/span&gt;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;여기서 함수 &lt;span class=&quot;math inline&quot;&gt;\(f()\)&lt;/span&gt;는 비선형함수를 고려합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;가장 단순한 형태의 &lt;span class=&quot;math inline&quot;&gt;\(f()\)&lt;/span&gt; 함수는 &lt;a href=&quot;https://en.wikipedia.org/wiki/Piecewise_linear_function&quot;&gt;piecewise linear&lt;/a&gt; 포맷이 있을 수 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://en.wikipedia.org/wiki/Multivariate_adaptive_regression_spline&quot;&gt;regression splines&lt;/a&gt; 스플라인 회귀도 있습니다.&lt;/li&gt;
&lt;li&gt;2차항 이상의 고차항 함수를 적용하거나 자연로그 (변수 값들이 양수 이상 일때) 등을 씌우는 방법도 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;아래 예시를 들어 보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;boston_men &amp;lt;- boston_marathon %&amp;gt;% 
  filter(Year &amp;gt;= 1924 &amp;amp; Event == &quot;Men's open division&quot;) %&amp;gt;% 
  mutate(Minutes = as.numeric(Time)/60)

boston_men&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## # A tsibble: 96 x 6 [1Y]
## # Key:       Event [1]
##    Event              Year Champion               Country      Time      Minutes
##    &amp;lt;fct&amp;gt;             &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;                  &amp;lt;chr&amp;gt;        &amp;lt;drtn&amp;gt;      &amp;lt;dbl&amp;gt;
##  1 Men's open divis&amp;hellip;  1924 Clarence H. DeMar      United Stat&amp;hellip;  8980 se&amp;hellip;    150.
##  2 Men's open divis&amp;hellip;  1925 Charles L. (Chuck) Me&amp;hellip; United Stat&amp;hellip;  9180 se&amp;hellip;    153 
##  3 Men's open divis&amp;hellip;  1926 John C. Miles          Canada        8740 se&amp;hellip;    146.
##  4 Men's open divis&amp;hellip;  1927 Clarence H. DeMar      United Stat&amp;hellip;  9622 se&amp;hellip;    160.
##  5 Men's open divis&amp;hellip;  1928 Clarence H. DeMar      United Stat&amp;hellip;  9427 se&amp;hellip;    157.
##  6 Men's open divis&amp;hellip;  1929 John C. Miles          Canada        9188 se&amp;hellip;    153.
##  7 Men's open divis&amp;hellip;  1930 Clarence H. DeMar      United Stat&amp;hellip;  9288 se&amp;hellip;    155.
##  8 Men's open divis&amp;hellip;  1931 James P. Henigan       United Stat&amp;hellip; 10005 se&amp;hellip;    167.
##  9 Men's open divis&amp;hellip;  1932 Paul de Bruyn          Germany       9216 se&amp;hellip;    154.
## 10 Men's open divis&amp;hellip;  1933 Leslie S. Pawson       United Stat&amp;hellip;  9061 se&amp;hellip;    151.
## # &amp;hellip; with 86 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;boston_men %&amp;gt;% 
  autoplot(Minutes) +
  geom_smooth(formula = y ~ x, se = FALSE, method = &quot;lm&quot;) +
  labs(x = &quot;Year&quot;, y = &quot;Minutes&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.22.07.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bufVXU/btraLbKQOBg/hzHjoLNfz254MIXICH1SqK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bufVXU/btraLbKQOBg/hzHjoLNfz254MIXICH1SqK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bufVXU/btraLbKQOBg/hzHjoLNfz254MIXICH1SqK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbufVXU%2FbtraLbKQOBg%2FhzHjoLNfz254MIXICH1SqK%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.22.07.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;boston_men %&amp;gt;% 
  model(TSLM(formula = Minutes ~ trend())) %&amp;gt;% 
  gg_tsresiduals()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.22.13.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/nRfkE/btraNJNYCFo/IEuw9TSA1Gabbxi6lK3mVK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/nRfkE/btraNJNYCFo/IEuw9TSA1Gabbxi6lK3mVK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/nRfkE/btraNJNYCFo/IEuw9TSA1Gabbxi6lK3mVK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FnRfkE%2FbtraNJNYCFo%2FIEuw9TSA1Gabbxi6lK3mVK%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.22.13.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 그래프에서 시간이 지남에 따라 감소하는 듯한 선형 추이를 보이지만,&lt;br /&gt;선형추세로 나온 잔차를 보면 비선형 패턴이 보여지게 됩니다.&lt;/li&gt;
&lt;li&gt;따라서 아래 코드와 같이 자연로그 또는 piecewise reg. 등을 적합시켜 비교해보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;fit_trends &amp;lt;- boston_men %&amp;gt;% 
  model(
    linear = TSLM(formula = Minutes ~ trend()),
    exponential = TSLM(formula = log(Minutes) ~ trend()),
    piecewise = TSLM(formula = Minutes ~ trend(knots = c(1950, 1980)))
  )

fc_trends &amp;lt;- fit_trends %&amp;gt;% 
  forecast(h = 10)

boston_men %&amp;gt;% 
  autoplot(Minutes) +
  geom_line(data = fit_trends %&amp;gt;% fitted(), aes(y = .fitted, colour = .model)) +
  autolayer(fc_trends, alpha = 0.5, level = 95) +
  labs(y = &quot;Minutes&quot;, title = &quot;Boston marathon winning times&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.22.19.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/4hi7K/btraGFmflFK/9ZjNqhfUftJFvJgabmGfz1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/4hi7K/btraGFmflFK/9ZjNqhfUftJFvJgabmGfz1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/4hi7K/btraGFmflFK/9ZjNqhfUftJFvJgabmGfz1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F4hi7K%2FbtraGFmflFK%2F9ZjNqhfUftJFvJgabmGfz1%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-29 오전 11.22.19.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;가장 베스트는 piecewise reg.에서 도출될 것 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;</description>
      <category>time-series (tidy approach)</category>
      <category>Augment</category>
      <category>autoplot</category>
      <category>fpp3</category>
      <category>GGally</category>
      <category>ggpairs</category>
      <category>regression</category>
      <category>tidyverts</category>
      <category>time-series</category>
      <category>trend</category>
      <category>TSLM</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/78</guid>
      <comments>https://rstatistics.tistory.com/78#entry78comment</comments>
      <pubDate>Thu, 29 Jul 2021 11:50:42 +0900</pubDate>
    </item>
    <item>
      <title>[R] 4. feasts</title>
      <link>https://rstatistics.tistory.com/77</link>
      <description>&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;code&gt;feasts&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;feasts&lt;/code&gt;의 의미는 Feature Extraction And Statistics for Time Series의 약자라고 합니다. (FEASTS)&lt;/li&gt;
&lt;li&gt;시계열 데이터 분석에 필요한 여러 가지 함수들을 제공하는 라이브러리 입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 분해, 추출, 시각화 등&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Graphics: &lt;code&gt;gg_season()&lt;/code&gt;, &lt;code&gt;gg_subseries()&lt;/code&gt;, &lt;code&gt;gg_lag()&lt;/code&gt;, &lt;code&gt;ACF()&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 데이터의 패턴을 이해하기 위해 첫 단계로 시각화로 접근을 합니다.&lt;/li&gt;
&lt;li&gt;먼저 &lt;code&gt;gg_season()&lt;/code&gt; 함수를 사용하여 계절성(seasonality)을 확인해볼 수 있습니다.&lt;/li&gt;
&lt;li&gt;예시로 &lt;code&gt;tsibbledata&lt;/code&gt; 라이브러리 내 &lt;code&gt;aus_production&lt;/code&gt; 데이터를 사용하겠습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;해당 데이터는 호주의 맥주, 담배 등 여러 품목별 생산지표 추정치에 관한 데이터입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_production %&amp;gt;% 
  gg_season(Beer)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.51.33.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/ermgXY/btrazuE5uAE/dE2PZtLMR2dhkPo2KeXxrK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/ermgXY/btrazuE5uAE/dE2PZtLMR2dhkPo2KeXxrK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/ermgXY/btrazuE5uAE/dE2PZtLMR2dhkPo2KeXxrK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FermgXY%2FbtrazuE5uAE%2FdE2PZtLMR2dhkPo2KeXxrK%2Fimg.png&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.51.33.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;gg_subseries()&lt;/code&gt; 함수를 사용하면 시계열의 각 season별로 시각화를 보일 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_production %&amp;gt;% 
  gg_subseries(Beer)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.51.48.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/beU1S5/btraBT53Sfd/eBj7rG6d3DXoxmkcsBzLgk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/beU1S5/btraBT53Sfd/eBj7rG6d3DXoxmkcsBzLgk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/beU1S5/btraBT53Sfd/eBj7rG6d3DXoxmkcsBzLgk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbeU1S5%2FbtraBT53Sfd%2FeBj7rG6d3DXoxmkcsBzLgk%2Fimg.png&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.51.48.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;gg_lag()&lt;/code&gt;를 이용하면 원 데이터와 그 시점의 시차(lag)에 대한 산점도를 season별로 시각화할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_production %&amp;gt;% 
  filter(year(Quarter) &amp;gt; 1991) %&amp;gt;% 
  gg_lag(Beer, geom = &quot;point&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;497&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.51.55.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dciAPx/btraJoQF7fw/DwBWXZ0Tn0zg9agcQFp5z0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dciAPx/btraJoQF7fw/DwBWXZ0Tn0zg9agcQFp5z0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dciAPx/btraJoQF7fw/DwBWXZ0Tn0zg9agcQFp5z0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdciAPx%2FbtraJoQF7fw%2FDwBWXZ0Tn0zg9agcQFp5z0%2Fimg.png&quot; data-origin-width=&quot;497&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.51.55.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;분기 단위의 데이터이기에 lag 4와 lag 8을 보면 각 season별로 원 데이터(x축)와 lag(y축)간의 선형관계가 잘 놓여져있는 것을 확인할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ACF(자기상관함수, Auto Correlation Function)도 &lt;code&gt;ACF()&lt;/code&gt; 함수와 &lt;code&gt;autoplot()&lt;/code&gt; 함수를 사용하여 그릴 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;자기 상관 함수란 {i} 시점과 {i+k} 시점간에 상관계수 값이라고 이해하시면 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_production %&amp;gt;% 
  ACF()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Response variable not specified, automatically selected `var = Beer`&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 23 x 2 [1Q]
##      lag   acf
##    &amp;lt;lag&amp;gt; &amp;lt;dbl&amp;gt;
##  1    1Q 0.684
##  2    2Q 0.500
##  3    3Q 0.667
##  4    4Q 0.940
##  5    5Q 0.644
##  6    6Q 0.458
##  7    7Q 0.621
##  8    8Q 0.887
##  9    9Q 0.598
## 10   10Q 0.410
## # &amp;hellip; with 13 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_production %&amp;gt;% 
  ACF() %&amp;gt;% 
  autoplot()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Response variable not specified, automatically selected `var = Beer`&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.05.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/buOzko/btraLbjefoo/YfVby4EmE8iHKZTBEGhmmk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/buOzko/btraLbjefoo/YfVby4EmE8iHKZTBEGhmmk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/buOzko/btraLbjefoo/YfVby4EmE8iHKZTBEGhmmk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbuOzko%2FbtraLbjefoo%2FYfVby4EmE8iHKZTBEGhmmk%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.05.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Decompositions&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 분해(decomposition)는 시계열 데이터 분석에서 흔히 수행되는 작업 중 하나 입니다.&lt;/li&gt;
&lt;li&gt;시계열에 대한 패턴을 이해하는데 도움을 주며, 추후 예측 모델링 적용 시 정교성에 도움을 주기도 합니다.&lt;/li&gt;
&lt;li&gt;즉, 시계열의 패턴을 조금 더 정교하게 하고 예측 성능을 향상시키기 위한 목적으로 필수적인 사전 전처리입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Decompositions: Classical decomposition&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;접근 방식에 따라 크게 가법(additive), 승법(multiplicative) 두 가지로 분류됩니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;보통 가법은 계절성이 추세에 따라 무관하게 일정한 크기나 수준을 유지하는 케이스일 때&lt;/li&gt;
&lt;li&gt;승법은 계절성의 크기가 추세의 크기에 따라 변화하는 케이스일 때&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;원리에 대해서 해당 포스팅에서는 간단하게만 설명하겠습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;1단계: 최소제곱법으로 추세선(trend)을 적합하여 추정하고 이를 원래 데이터에서 뺴줌으로써 추세가 조정된 시계열을 만들 수 있습니다.&lt;/li&gt;
&lt;li&gt;2단계: 1단계에서 구한 조정된 추세선을 가지고 계절성의 길이만큼 이동평균을 구하여 계절성을 제거합니다.&lt;/li&gt;
&lt;li&gt;3단계: 2단계에서 계절성을 정리한 시계열을 가지고 다시 이동평균을 나누어 줌으로써 일차적으로 계절성을 추정합니다.&lt;/li&gt;
&lt;li&gt;4단계: 3단계에서 구한 계절성들의 각 계절별 평균을 구하고 이 계절들 평균들의 합이 계절성의 길이가 되도록 조정한 지수를 만듭니다. 이 지수가 계절지수가 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;위 원리는 승법 분해에 대해서 정리한것인데 가법모형은 위 과정 3단계에서 비율을 이용해 나누어주는 것이 아닌 뺄셈을 통해 진행합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 예시 데이터를 가지고 가법 분해를 적용하는 사례입니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;classical_decomposition()&lt;/code&gt; 함수 내 &lt;code&gt;type = &quot;additive&quot;&lt;/code&gt; 옵션을 적용하여 분해해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;dcmp &amp;lt;- aus_production %&amp;gt;% 
  model(classical_decomposition(Beer, type = &quot;additive&quot;))&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;분해된 시계열 요소들은 &lt;code&gt;components()&lt;/code&gt; 함수로 불러올 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;dcmp %&amp;gt;% 
  components()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## # A dable: 218 x 7 [1Q]
## # Key:     .model [1]
## # :        Beer = trend + seasonal + random
##    .model                     Quarter  Beer trend seasonal  random season_adjust
##    &amp;lt;chr&amp;gt;                        &amp;lt;qtr&amp;gt; &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;         &amp;lt;dbl&amp;gt;
##  1 &quot;classical_decomposition(&amp;hellip; 1956 Q1   284   NA      2.13  NA              282.
##  2 &quot;classical_decomposition(&amp;hellip; 1956 Q2   213   NA    -42.5   NA              256.
##  3 &quot;classical_decomposition(&amp;hellip; 1956 Q3   227  255.   -28.5    0.256          256.
##  4 &quot;classical_decomposition(&amp;hellip; 1956 Q4   308  254.    68.9  -15.3            239.
##  5 &quot;classical_decomposition(&amp;hellip; 1957 Q1   262  257.     2.13   2.49           260.
##  6 &quot;classical_decomposition(&amp;hellip; 1957 Q2   228  260    -42.5   10.5            271.
##  7 &quot;classical_decomposition(&amp;hellip; 1957 Q3   236  263.   -28.5    1.76           265.
##  8 &quot;classical_decomposition(&amp;hellip; 1957 Q4   320  265.    68.9  -13.5            251.
##  9 &quot;classical_decomposition(&amp;hellip; 1958 Q1   272  265.     2.13   4.49           270.
## 10 &quot;classical_decomposition(&amp;hellip; 1958 Q2   233  265.   -42.5   10.9            276.
## # &amp;hellip; with 208 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그리고 이 분해된 요소들을 &lt;code&gt;autoplot()&lt;/code&gt; 함수에 적용하면 아래와 같은 시각화를 볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;dcmp %&amp;gt;% 
  components() %&amp;gt;% 
  autoplot() %&amp;gt;% 
  labs(title = &quot;Classical additive decomposition of Quaterly production of Beer in Australia&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.20.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dPHNQ6/btrazvYlmAk/AhKvZUuB2gQgBIwr5B2CT1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dPHNQ6/btrazvYlmAk/AhKvZUuB2gQgBIwr5B2CT1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dPHNQ6/btrazvYlmAk/AhKvZUuB2gQgBIwr5B2CT1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdPHNQ6%2FbtrazvYlmAk%2FAhKvZUuB2gQgBIwr5B2CT1%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.20.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마찬가지로 승법 분해는 &lt;code&gt;classical_decomposition()&lt;/code&gt; 함수 내 &lt;code&gt;type = &quot;multiplicative&quot;&lt;/code&gt; 옵션을 적용할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;Decompositions: STL decomposition&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;STL은 Seasonal and Trend decomposition using Loess의 줄임말로 robust한 시계열 분해 방법에 해당됩니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;여기서 loess란 Local regression, 우리가 흔히 알고 있는 선형회귀말고 비선형회귀에 해당합니다.&lt;/li&gt;
&lt;li&gt;계절성(S) + 추세성(T) + Remainder component 로 분해&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;계절성에 대해 다른 분해 방법보다 조금 더 자유도가 높은 편이며, 시간에 따라 변화하는 계절성의 변화율을 분석가가 직접 조절할 수 있다는 장점이 있습니다.&lt;/li&gt;
&lt;li&gt;STL 분해에 대해 조금 더 자세히 알고 싶다면 &lt;a href=&quot;https://otexts.com/fppkr/stl.html&quot;&gt;여기&lt;/a&gt;를 참고하는 것도 좋을 것 같습니다!&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 예시를 들어 설명하겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;aus_production %&amp;gt;% 
  model(STL(formula = Beer ~ trend(window = 4) + season(window = &quot;periodic&quot;), robust = TRUE)) %&amp;gt;% 
  components() %&amp;gt;% 
  autoplot()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.27.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bJgRFw/btraJnRL00L/QgZJ0Xs4wmJctIkgS9qXK1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bJgRFw/btraJnRL00L/QgZJ0Xs4wmJctIkgS9qXK1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bJgRFw/btraJnRL00L/QgZJ0Xs4wmJctIkgS9qXK1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbJgRFw%2FbtraJnRL00L%2FQgZJ0Xs4wmJctIkgS9qXK1%2Fimg.png&quot; data-origin-width=&quot;667&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.27.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 코드에서 보셨듯, 시계열의 추세 요소는 &lt;code&gt;window&lt;/code&gt; 옵션을 주어 flexible하게 추정할 수 있고&lt;br /&gt;계절성은 &lt;code&gt;window = &quot;periodic&quot;&lt;/code&gt;으로 하여 고정시켰습니다.&lt;/li&gt;
&lt;li&gt;더 자세한 옵션은 &lt;code&gt;?STL&lt;/code&gt;로 확인하시면 됩니다!&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;Feature extraction and statistics&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;features()&lt;/code&gt; 함수를 통해 여러 가지 통계량이나 ACF 등을 추출할 수 있습니다.&lt;/li&gt;
&lt;li&gt;예시 데이터는 &lt;code&gt;tourism&lt;/code&gt; 데이터를 활용하겠습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;해당 데이터는 분기별 호주 애형객 수에 관한 자료입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;먼저 단순히 평균, 분위수 값을 뽑는 방법은 아래와 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;#features() 함수에 대상 변수와 해당 통계량 함수들을 적용하시면 됩니다.
tourism %&amp;gt;% 
  features(
    .var = Trips, 
    features = list(avg = mean, quantile)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 304 x 9
##    Region      State         Purpose    avg    `0%`  `25%`   `50%`  `75%` `100%`
##    &amp;lt;chr&amp;gt;       &amp;lt;chr&amp;gt;         &amp;lt;chr&amp;gt;    &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;   &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
##  1 Adelaide    South Austra&amp;hellip; Busine&amp;hellip; 156.    68.7   134.   153.    177.   242.  
##  2 Adelaide    South Austra&amp;hellip; Holiday 157.   108.    135.   154.    172.   224.  
##  3 Adelaide    South Austra&amp;hellip; Other    56.6   25.9    43.9   53.8    62.5  107.  
##  4 Adelaide    South Austra&amp;hellip; Visiti&amp;hellip; 205.   137.    179.   206.    229.   270.  
##  5 Adelaide H&amp;hellip; South Austra&amp;hellip; Busine&amp;hellip;   2.66   0       0      1.26    3.92  28.6 
##  6 Adelaide H&amp;hellip; South Austra&amp;hellip; Holiday  10.5    0       5.77   8.52   14.1   35.8 
##  7 Adelaide H&amp;hellip; South Austra&amp;hellip; Other     1.40   0       0      0.908   2.09   8.95
##  8 Adelaide H&amp;hellip; South Austra&amp;hellip; Visiti&amp;hellip;  14.2    0.778   8.91  12.2    16.8   81.1 
##  9 Alice Spri&amp;hellip; Northern Ter&amp;hellip; Busine&amp;hellip;  14.6    1.01    9.13  13.3    18.5   34.1 
## 10 Alice Spri&amp;hellip; Northern Ter&amp;hellip; Holiday  31.9    2.81   16.9   31.5    44.8   76.5 
## # &amp;hellip; with 294 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ACF에 관한 정보는 &lt;code&gt;feat_acf()&lt;/code&gt; 함수를 이용합니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;feat_acf()&lt;/code&gt;는 기본적으로 ACF와 관련된 값들을 제공합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;acf1&lt;/code&gt;: 원래 시계열 데이터의 1차 자기상관계수&lt;/li&gt;
&lt;li&gt;&lt;code&gt;acf10&lt;/code&gt;: 1~10차 자기상관계수 제곱합&lt;/li&gt;
&lt;li&gt;&lt;code&gt;diff1_acf1&lt;/code&gt;: 1차 차분(lag) 시계열의 1차 자기상관계수&lt;/li&gt;
&lt;li&gt;&lt;code&gt;diff1_acf10&lt;/code&gt;: 1차 차분 시계열의 1~10차 자기상관계수 제곱합&lt;/li&gt;
&lt;li&gt;&lt;code&gt;diff2_acf1&lt;/code&gt;: 2차 차분 시계열의 1차 자기상관계수&lt;/li&gt;
&lt;li&gt;&lt;code&gt;diff2_acf10&lt;/code&gt;: 2차 차분 시계열의 1~10차 자기상관계수 제곱합&lt;/li&gt;
&lt;li&gt;&lt;code&gt;season_acf1&lt;/code&gt;: 첫 번째 계절 시차(seasonal lag)에서의 자기상관계수&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;tourism %&amp;gt;% 
  features(
    .var = Trips,
    features = feat_acf
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 304 x 10
##    Region    State      Purpose     acf1 acf10 diff1_acf1 diff1_acf10 diff2_acf1
##    &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;      &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;
##  1 Adelaide  South Aus&amp;hellip; Busine&amp;hellip;  0.0333  0.131     -0.520       0.463     -0.676
##  2 Adelaide  South Aus&amp;hellip; Holiday  0.0456  0.372     -0.343       0.614     -0.487
##  3 Adelaide  South Aus&amp;hellip; Other    0.517   1.15      -0.409       0.383     -0.675
##  4 Adelaide  South Aus&amp;hellip; Visiti&amp;hellip;  0.0684  0.294     -0.394       0.452     -0.518
##  5 Adelaide&amp;hellip; South Aus&amp;hellip; Busine&amp;hellip;  0.0709  0.134     -0.580       0.415     -0.750
##  6 Adelaide&amp;hellip; South Aus&amp;hellip; Holiday  0.131   0.313     -0.536       0.500     -0.716
##  7 Adelaide&amp;hellip; South Aus&amp;hellip; Other    0.261   0.330     -0.253       0.317     -0.457
##  8 Adelaide&amp;hellip; South Aus&amp;hellip; Visiti&amp;hellip;  0.139   0.117     -0.472       0.239     -0.626
##  9 Alice Sp&amp;hellip; Northern &amp;hellip; Busine&amp;hellip;  0.217   0.367     -0.500       0.381     -0.658
## 10 Alice Sp&amp;hellip; Northern &amp;hellip; Holiday -0.00660 2.11      -0.153       2.11      -0.274
## # &amp;hellip; with 294 more rows, and 2 more variables: diff2_acf10 &amp;lt;dbl&amp;gt;,
## #   season_acf1 &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 결과에서 &lt;code&gt;season_acf1&lt;/code&gt;는 첫번째 계절 시차에서의 ACF를 나타내는데, 해당 데이터는 분기단위이기에 계절 주기는 4입니다.&lt;/li&gt;
&lt;li&gt;즉, 해당 값은 원계열 시차 4에서의 ACF값을 나타낸다고도 볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;feat_stl()&lt;/code&gt; 함수를 사용하여 STL 분해 요소를 출력할 수도 있습니다.&lt;/li&gt;
&lt;li&gt;해당 함수는 추세와 계절성의 강도를 표현해주면서 아래 요소들도 같이 출력해줍니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;seasonal_peak_year&lt;/code&gt;: 계절성이 가장 큰 시점 (분기 등)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;seasonal_trough_year&lt;/code&gt;: 계절성이 가장 작은 시점&lt;/li&gt;
&lt;li&gt;&lt;code&gt;spikiness&lt;/code&gt;: Remainder component의 분산. 그냥 쉽게 말하자면 오차항의 분산 정도라고 생각하면 됩니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;linearity&lt;/code&gt;: STL 분해의 추세(trend) 성분의 선형성&lt;/li&gt;
&lt;li&gt;&lt;code&gt;curvature&lt;/code&gt;: STL 분해의 추세 성분의 곡률(curvature)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stl_e_acf1&lt;/code&gt;: 계절성과 추세성분을 제외한 나머지 계열들의 1차 자기상관계수&lt;/li&gt;
&lt;li&gt;&lt;code&gt;stl_e_acf10&lt;/code&gt;: 계절성과 추세성분을 제외한 나머지 계열들의 1~10차 자기상관계수 제곱합&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;tourism %&amp;gt;% 
  features(
    .var = Trips,
    features = feat_stl
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 304 x 12
##    Region    State     Purpose trend_strength seasonal_strengt&amp;hellip; seasonal_peak_y&amp;hellip;
##    &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;            &amp;lt;dbl&amp;gt;             &amp;lt;dbl&amp;gt;            &amp;lt;dbl&amp;gt;
##  1 Adelaide  South Au&amp;hellip; Busine&amp;hellip;          0.464             0.407                3
##  2 Adelaide  South Au&amp;hellip; Holiday          0.554             0.619                1
##  3 Adelaide  South Au&amp;hellip; Other            0.746             0.202                2
##  4 Adelaide  South Au&amp;hellip; Visiti&amp;hellip;          0.435             0.452                1
##  5 Adelaide&amp;hellip; South Au&amp;hellip; Busine&amp;hellip;          0.464             0.179                3
##  6 Adelaide&amp;hellip; South Au&amp;hellip; Holiday          0.528             0.296                2
##  7 Adelaide&amp;hellip; South Au&amp;hellip; Other            0.593             0.404                2
##  8 Adelaide&amp;hellip; South Au&amp;hellip; Visiti&amp;hellip;          0.488             0.254                0
##  9 Alice Sp&amp;hellip; Northern&amp;hellip; Busine&amp;hellip;          0.534             0.251                0
## 10 Alice Sp&amp;hellip; Northern&amp;hellip; Holiday          0.381             0.832                3
## # &amp;hellip; with 294 more rows, and 6 more variables: seasonal_trough_year &amp;lt;dbl&amp;gt;,
## #   spikiness &amp;lt;dbl&amp;gt;, linearity &amp;lt;dbl&amp;gt;, curvature &amp;lt;dbl&amp;gt;, stl_e_acf1 &amp;lt;dbl&amp;gt;,
## #   stl_e_acf10 &amp;lt;dbl&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;해당 결과를 아래와 같이 시각화하여 어떤 유형이 가장 트렌드(x축)하고 계절적(y축)인지도 확인해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;tourism %&amp;gt;% 
  features(.var = Trips, features = feat_stl) %&amp;gt;% 
  ggplot(aes(x = trend_strength, y = seasonal_strength_year, color = Purpose)) +
  geom_point() +
  facet_wrap(~ State)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;662&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.36.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/oIvT8/btraLcCr15j/lCc0FtxgXcIMREuACc8kjk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/oIvT8/btraLcCr15j/lCc0FtxgXcIMREuACc8kjk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/oIvT8/btraLcCr15j/lCc0FtxgXcIMREuACc8kjk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FoIvT8%2FbtraLcCr15j%2FlCc0FtxgXcIMREuACc8kjk%2Fimg.png&quot; data-origin-width=&quot;662&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.36.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;휴가 등을 목적으로 하는 관광은 가장 계절적 패턴을 보입니다.&lt;/li&gt;
&lt;li&gt;경향성은 Western Australia에서 가장 강하게 나타납니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;tourism %&amp;gt;% 
  features(.var = Trips, features = feat_stl) %&amp;gt;% 
  filter(seasonal_strength_year == max(seasonal_strength_year)) %&amp;gt;% 
  select(Region, State, Purpose)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1 x 3
##   Region          State           Purpose
##   &amp;lt;chr&amp;gt;           &amp;lt;chr&amp;gt;           &amp;lt;chr&amp;gt;  
## 1 Snowy Mountains New South Wales Holiday&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;tourism %&amp;gt;% 
  features(.var = Trips, features = feat_stl) %&amp;gt;% 
  filter(seasonal_strength_year == max(seasonal_strength_year)) %&amp;gt;% 
  select(Region, State, Purpose) %&amp;gt;% 
  left_join(tourism, by = c(&quot;Region&quot;, &quot;State&quot;, &quot;Purpose&quot;)) %&amp;gt;% 
  ggplot(aes(x = Quarter, y = Trips)) +
  geom_line(size = 0.7) &lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;672&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.46.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Ecob5/btraGOoO92T/9mxbYdy1ZPMLyAntC0Kye1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Ecob5/btraGOoO92T/9mxbYdy1ZPMLyAntC0Kye1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Ecob5/btraGOoO92T/9mxbYdy1ZPMLyAntC0Kye1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FEcob5%2FbtraGOoO92T%2F9mxbYdy1ZPMLyAntC0Kye1%2Fimg.png&quot; data-origin-width=&quot;672&quot; data-origin-height=&quot;475&quot; data-filename=&quot;스크린샷 2021-07-28 오후 1.52.46.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;</description>
      <category>time-series (tidy approach)</category>
      <category>ACF</category>
      <category>auto correlation function</category>
      <category>Decomposition</category>
      <category>feasts</category>
      <category>features</category>
      <category>gg_lag</category>
      <category>gg_season</category>
      <category>gg_subseries</category>
      <category>STL</category>
      <category>STL decomposition</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/77</guid>
      <comments>https://rstatistics.tistory.com/77#entry77comment</comments>
      <pubDate>Wed, 28 Jul 2021 13:56:35 +0900</pubDate>
    </item>
    <item>
      <title>[R] 3. tsibbledata</title>
      <link>https://rstatistics.tistory.com/76</link>
      <description>&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;code&gt;tsibbledata&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;tsibbledata&lt;/code&gt; 라이브러리는 시계열 데이터로 적절한 예시로 쓰일 수 있는 데이터들을 제공해줍니다.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/tidyverts/tsibbledata&quot;&gt;github&lt;/a&gt;에서 나와 있는 예시는 &lt;code&gt;olympic_running&lt;/code&gt; 데이터입니다.&lt;/li&gt;
&lt;li&gt;아래 데이터는 올림픽 달리기 종목의 성별 최고기록에 관한 데이터라고 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;olympic_running&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 312 x 4 [4Y]
## # Key:       Length, Sex [14]
##     Year Length Sex    Time
##    &amp;lt;int&amp;gt;  &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt;
##  1  1896    100 men    12  
##  2  1900    100 men    11  
##  3  1904    100 men    11  
##  4  1908    100 men    10.8
##  5  1912    100 men    10.8
##  6  1916    100 men    NA  
##  7  1920    100 men    10.8
##  8  1924    100 men    10.6
##  9  1928    100 men    10.8
## 10  1932    100 men    10.3
## # &amp;hellip; with 302 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이 데이터를 가지고 성별 달리기 최고 기록에 대한 값을 그래프로 나타낸 예제입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;참고로 1916, 1940, 1944년의 경우 세계대전으로 인해 결측 처리되었습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;olympic_running %&amp;gt;% 
  ggplot(aes(x = Year, y = Time, color = Sex, group = Sex)) +
  geom_line(size = 0.7) +
  geom_point(size = 1.2) +
  facet_wrap(. ~ Length, scales = &quot;free_y&quot;, nrow = 2) +
  labs(x = &quot;Year&quot;, y = &quot;Running time (seconds)&quot;) +
  scale_color_brewer(palette = &quot;Dark2&quot;) + 
  theme_minimal() +
  theme(legend.position = &quot;bottom&quot;, legend.title = element_blank())&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;464&quot; data-filename=&quot;스크린샷 2021-07-28 오전 11.16.56.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/u972w/btraGFS4fkG/wlZwvPJ0PKp2CMpZpKAyxk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/u972w/btraGFS4fkG/wlZwvPJ0PKp2CMpZpKAyxk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/u972w/btraGFS4fkG/wlZwvPJ0PKp2CMpZpKAyxk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fu972w%2FbtraGFS4fkG%2FwlZwvPJ0PKp2CMpZpKAyxk%2Fimg.png&quot; data-origin-width=&quot;670&quot; data-origin-height=&quot;464&quot; data-filename=&quot;스크린샷 2021-07-28 오전 11.16.56.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;</description>
      <category>time-series (tidy approach)</category>
      <category>tsibble</category>
      <category>tsibbledata</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/76</guid>
      <comments>https://rstatistics.tistory.com/76#entry76comment</comments>
      <pubDate>Wed, 28 Jul 2021 11:20:50 +0900</pubDate>
    </item>
    <item>
      <title>[R] 2. tsibble</title>
      <link>https://rstatistics.tistory.com/75</link>
      <description>&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;code&gt;tsibble()&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;tsibble&lt;/code&gt; 객체는 아래와 같은 기본적인 원칙을 가집니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;index&lt;/code&gt;: 과거부터 현재까지 순서화된 자료값의 관측 시간&lt;/li&gt;
&lt;li&gt;&lt;code&gt;key&lt;/code&gt;: 시간에 따른 관측값을 정의하는 변수의 집합&lt;/li&gt;
&lt;li&gt;각 관측치는 &lt;code&gt;index&lt;/code&gt;와 &lt;code&gt;key&lt;/code&gt;를 통해 유니크하게 식별되어야 합니다.&lt;/li&gt;
&lt;li&gt;각 관측치는 등간격으로 관측된 자료여야만 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;즉, &lt;code&gt;tsibble&lt;/code&gt; 포맷으로 변환하기 위해서는 데이터에서 &lt;code&gt;key&lt;/code&gt;와 &lt;code&gt;index&lt;/code&gt;를 명시해주어야 합니다.&lt;/li&gt;
&lt;li&gt;아래 &lt;code&gt;nycflights13&lt;/code&gt; 라이브러리의 &lt;code&gt;weather&lt;/code&gt; 데이터를 활용해서 예시를 보이겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;weather_sample &amp;lt;- nycflights13::weather %&amp;gt;% 
  select(origin, time_hour, temp, humid, precip)

weather_sample&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 26,115 x 5
##    origin time_hour            temp humid precip
##    &amp;lt;chr&amp;gt;  &amp;lt;dttm&amp;gt;              &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
##  1 EWR    2013-01-01 01:00:00  39.0  59.4      0
##  2 EWR    2013-01-01 02:00:00  39.0  61.6      0
##  3 EWR    2013-01-01 03:00:00  39.0  64.4      0
##  4 EWR    2013-01-01 04:00:00  39.9  62.2      0
##  5 EWR    2013-01-01 05:00:00  39.0  64.4      0
##  6 EWR    2013-01-01 06:00:00  37.9  67.2      0
##  7 EWR    2013-01-01 07:00:00  39.0  64.4      0
##  8 EWR    2013-01-01 08:00:00  39.9  62.2      0
##  9 EWR    2013-01-01 09:00:00  39.9  62.2      0
## 10 EWR    2013-01-01 10:00:00  41    59.6      0
## # &amp;hellip; with 26,105 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;origin 변수를 &lt;code&gt;key&lt;/code&gt;로, time_hour 변수를 &lt;code&gt;index&lt;/code&gt;로 잡습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;다중 시계열이 아닌 단일 시계열이라면 &lt;code&gt;key&lt;/code&gt;를 명시하지 않으셔도 됩니다.&lt;/li&gt;
&lt;li&gt;예시 데이터에서는 출발지(origin) 별로 관측된 다중 시계열에 해당됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;weather_tsbl &amp;lt;- weather_sample %&amp;gt;% 
  as_tsibble(key = origin, index = time_hour)

weather_tsbl&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 26,115 x 5 [1h] &amp;lt;America/New_York&amp;gt;
## # Key:       origin [3]
##    origin time_hour            temp humid precip
##    &amp;lt;chr&amp;gt;  &amp;lt;dttm&amp;gt;              &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
##  1 EWR    2013-01-01 01:00:00  39.0  59.4      0
##  2 EWR    2013-01-01 02:00:00  39.0  61.6      0
##  3 EWR    2013-01-01 03:00:00  39.0  64.4      0
##  4 EWR    2013-01-01 04:00:00  39.9  62.2      0
##  5 EWR    2013-01-01 05:00:00  39.0  64.4      0
##  6 EWR    2013-01-01 06:00:00  37.9  67.2      0
##  7 EWR    2013-01-01 07:00:00  39.0  64.4      0
##  8 EWR    2013-01-01 08:00:00  39.9  62.2      0
##  9 EWR    2013-01-01 09:00:00  39.9  62.2      0
## 10 EWR    2013-01-01 10:00:00  41    59.6      0
## # &amp;hellip; with 26,105 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;인덱스의 간격은 연도(year)부터 나노 초(nanosecond)까지 숫자로 정렬되는 요소로부터 계산됩니다.&lt;/li&gt;
&lt;li&gt;아래는 &lt;code&gt;tsibble&lt;/code&gt;에서 인덱스의 클래스를 나타낸 표입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;table data-ke-align=&quot;alignLeft&quot;&gt;
&lt;thead&gt;
&lt;tr class=&quot;header&quot;&gt;
&lt;th&gt;Interval&lt;/th&gt;
&lt;th&gt;Class&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td&gt;Annual&lt;/td&gt;
&lt;td&gt;&lt;code&gt;integer&lt;/code&gt;, &lt;code&gt;double&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td&gt;Quaterly&lt;/td&gt;
&lt;td&gt;&lt;code&gt;yearquarter&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td&gt;Monthly&lt;/td&gt;
&lt;td&gt;&lt;code&gt;yearmonth&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td&gt;Weekly&lt;/td&gt;
&lt;td&gt;&lt;code&gt;yearweek&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;odd&quot;&gt;
&lt;td&gt;Daily&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Date&lt;/code&gt;, &lt;code&gt;difftime&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr class=&quot;even&quot;&gt;
&lt;td&gt;Subdaily&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POSIXt&lt;/code&gt;, &lt;code&gt;difftime&lt;/code&gt;, &lt;code&gt;hms&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위에서 기본 가정으로 시계열의 인덱스는 등간격을 가정하였는데, 사실 &lt;code&gt;tsibble()&lt;/code&gt; 함수는 등간격이 아닌 자료에 대해서도 적용이 가능합니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tsibble()&lt;/code&gt; 함수의 &lt;code&gt;regular = FALSE&lt;/code&gt; arguments를 설정하면 됩니다. (기본값은 TRUE)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# make_datetime() 함수는 lubridate 라이브러리에 있는 함수로 지정된 timezone으로 timestamp를 생성하는 함수입니다.
nycflights13::flights %&amp;gt;% 
  mutate(sched_dep_datetime = make_datetime(year, month, day, hour, minute, tz = &quot;America/New_York&quot;)) %&amp;gt;% 
  select(carrier, flight, sched_dep_datetime, air_time, distance) %&amp;gt;% 
  as_tsibble(
    key = c(&quot;carrier&quot;, &quot;flight&quot;),
    index = sched_dep_datetime,
    regular = FALSE
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 336,776 x 5 [!] &amp;lt;America/New_York&amp;gt;
## # Key:       carrier, flight [5,725]
##    carrier flight sched_dep_datetime  air_time distance
##    &amp;lt;chr&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;dttm&amp;gt;                 &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1 9E        2900 2013-11-03 15:40:00      113      765
##  2 9E        2900 2013-11-04 15:40:00      117      765
##  3 9E        2900 2013-11-05 15:40:00      120      765
##  4 9E        2900 2013-11-06 15:40:00      118      765
##  5 9E        2900 2013-11-07 15:40:00      131      765
##  6 9E        2900 2013-11-08 15:40:00      114      765
##  7 9E        2900 2013-11-09 15:40:00      121      765
##  8 9E        2900 2013-11-10 15:40:00      115      765
##  9 9E        2900 2013-11-11 15:40:00      119      765
## 10 9E        2900 2013-11-12 15:40:00      118      765
## # &amp;hellip; with 336,766 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이렇게 등간격이 아닌 &lt;code&gt;tsibble&lt;/code&gt; 객체의 경우 출력물에 &lt;code&gt;[ ! ]&lt;/code&gt; 표시를 통해 확인할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;암묵적 결측값 명시하기: &lt;code&gt;fill_gaps()&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 데이터에서는 암묵적으로 결측값이 존재하는 케이스도 있습니다.&lt;/li&gt;
&lt;li&gt;이렇게 암묵적인 결측치가 존재하는 경우 &lt;code&gt;fill_gaps()&lt;/code&gt; 함수를 사용하여 처리 또는 명시할 수 있습니다.&lt;/li&gt;
&lt;li&gt;아래 예시를 들어 살펴보겠습니다. &lt;a href=&quot;https://tsibble.tidyverts.org/reference/fill_gaps.html&quot;&gt;참고&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;harvest &amp;lt;- tsibble(
  year = c(2010, 2011, 2013, 2011, 2012, 2014),
  fruit = rep(c(&quot;kiwi&quot;, &quot;cherry&quot;), each = 3),
  kilo = sample(1:10, size = 6),
  key = fruit, 
  index = year
)

harvest&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 6 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2014 cherry     6
## 4  2010 kiwi       7
## 5  2011 kiwi       2
## 6  2013 kiwi       4&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 데이터를 보면 체리(cherry)의 경우 2010년도와 2013년도 생산량이 없었기에 값으로 명시되어 있지 않습니다.&lt;/li&gt;
&lt;li&gt;이는 아래 코드로 &lt;code&gt;NA&lt;/code&gt; 결측 명시 처리를 할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;harvest %&amp;gt;% 
  fill_gaps(.full = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 10 x 3 [1Y]
## # Key:       fruit [2]
##     year fruit   kilo
##    &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;
##  1  2010 cherry    NA
##  2  2011 cherry     8
##  3  2012 cherry     9
##  4  2013 cherry    NA
##  5  2014 cherry     6
##  6  2010 kiwi       7
##  7  2011 kiwi       2
##  8  2012 kiwi      NA
##  9  2013 kiwi       4
## 10  2014 kiwi      NA&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;.full = FALSE&lt;/code&gt;로 arguments를 설정하면 각 key의 index에서 발생한 결측에 대해서만 명시가 이루어집니다. (FALSE가 디폴트 옵션입니다)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;harvest %&amp;gt;% 
  fill_gaps(.full = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry    NA
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi      NA
## 8  2013 kiwi       4&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;특정 값으로도 명시가 가능합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;harvest %&amp;gt;% 
  fill_gaps(kilo = 0)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;dbl&amp;gt;
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry     0
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi       0
## 8  2013 kiwi       4&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;사칙연산 등의 함수 적용도 가능합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;harvest %&amp;gt;% 
  fill_gaps(kilo = sum(kilo))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry    36
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi      36
## 8  2013 kiwi       4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# 각 key 별 중앙값으로 명시하기 위해 group_by_key() 함수 적용 (tidyverse의 group_by와 동일한 개념)
harvest %&amp;gt;% 
  group_by_key() %&amp;gt;% 
  fill_gaps(kilo = median(kilo, na.rm = TRUE))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
## # Groups:    fruit [2]
##    year fruit   kilo
##   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry     8
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi       4
## 8  2013 kiwi       4&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;fill_gaps()&lt;/code&gt; 함수와 &lt;code&gt;tidyr&lt;/code&gt; 라이브러리에 있는 &lt;code&gt;fill()&lt;/code&gt; 함수를 함께 적용하면 암묵적 결측치를 이전 또는 다음 시점의 결측치로 대체할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;harvest&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 6 x 3 [1Y]
## # Key:       fruit [2]
##    year fruit   kilo
##   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2014 cherry     6
## 4  2010 kiwi       7
## 5  2011 kiwi       2
## 6  2013 kiwi       4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# 이전 시점으로 대체
harvest %&amp;gt;% 
  group_by_key() %&amp;gt;% 
  fill_gaps() %&amp;gt;% 
  tidyr::fill(kilo, .direction = &quot;down&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
## # Groups:    fruit [2]
##    year fruit   kilo
##   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry     9
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi       2
## 8  2013 kiwi       4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# 다음 시점으로 대체
harvest %&amp;gt;% 
  group_by_key() %&amp;gt;% 
  fill_gaps() %&amp;gt;% 
  tidyr::fill(kilo, .direction = &quot;up&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 8 x 3 [1Y]
## # Key:       fruit [2]
## # Groups:    fruit [2]
##    year fruit   kilo
##   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;  &amp;lt;int&amp;gt;
## 1  2011 cherry     8
## 2  2012 cherry     9
## 3  2013 cherry     6
## 4  2014 cherry     6
## 5  2010 kiwi       7
## 6  2011 kiwi       2
## 7  2012 kiwi       4
## 8  2013 kiwi       4&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# 맨 위에서 예시로 살펴본 데이터에도 적용
weather_tsbl %&amp;gt;% 
  filter(origin == &quot;EWR&quot;) %&amp;gt;% 
  fill_gaps(precip = 0) %&amp;gt;% 
  group_by_key() %&amp;gt;% 
  tidyr::fill(temp, humid, .direction = &quot;down&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 8,730 x 5 [1h] &amp;lt;America/New_York&amp;gt;
## # Key:       origin [1]
## # Groups:    origin [1]
##    origin time_hour            temp humid precip
##    &amp;lt;chr&amp;gt;  &amp;lt;dttm&amp;gt;              &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
##  1 EWR    2013-01-01 01:00:00  39.0  59.4      0
##  2 EWR    2013-01-01 02:00:00  39.0  61.6      0
##  3 EWR    2013-01-01 03:00:00  39.0  64.4      0
##  4 EWR    2013-01-01 04:00:00  39.9  62.2      0
##  5 EWR    2013-01-01 05:00:00  39.0  64.4      0
##  6 EWR    2013-01-01 06:00:00  37.9  67.2      0
##  7 EWR    2013-01-01 07:00:00  39.0  64.4      0
##  8 EWR    2013-01-01 08:00:00  39.9  62.2      0
##  9 EWR    2013-01-01 09:00:00  39.9  62.2      0
## 10 EWR    2013-01-01 10:00:00  41    59.6      0
## # &amp;hellip; with 8,720 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;특정 인덱스에 대해 함수 적용하기: &lt;code&gt;index_by()&lt;/code&gt; + &lt;code&gt;summarise()&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;index_by()&lt;/code&gt; 함수는 &lt;code&gt;lubridate&lt;/code&gt; 계열의 날짜/시간 클래스에도 적용 가능합니다.&lt;/li&gt;
&lt;li&gt;아래 예시처럼 월별 평균 기온과 총 강수량을 표현해보겠습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;map()&lt;/code&gt; 함수를 적용하는 것처럼 &lt;code&gt;tsibble&lt;/code&gt; 객체의 index를 &lt;code&gt;.&lt;/code&gt;으로 명시합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;weather_tsbl %&amp;gt;% 
  group_by_key() %&amp;gt;% 
  index_by(year_month = ~yearmonth(.)) %&amp;gt;% 
  summarise(
    avg_temp = mean(temp, na.rm = TRUE),
    total_precip = sum(precip, na.rm = TRUE)
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tsibble: 36 x 4 [1M]
## # Key:       origin [3]
##    origin year_month avg_temp total_precip
##    &amp;lt;chr&amp;gt;       &amp;lt;mth&amp;gt;    &amp;lt;dbl&amp;gt;        &amp;lt;dbl&amp;gt;
##  1 EWR       2013  1     35.6         3.53
##  2 EWR       2013  2     34.3         3.83
##  3 EWR       2013  3     40.1         3   
##  4 EWR       2013  4     53.0         1.47
##  5 EWR       2013  5     63.3         5.44
##  6 EWR       2013  6     73.3         8.73
##  7 EWR       2013  7     80.7         3.74
##  8 EWR       2013  8     74.5         4.57
##  9 EWR       2013  9     67.3         1.54
## 10 EWR       2013 10     59.8         0.5 
## # &amp;hellip; with 26 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이러한 조합의 함수는 등간격이 아닌 인덱스에서도 사용할 수 있다고 합니다.&lt;br /&gt;(This &lt;code&gt;index_by()&lt;/code&gt; + &lt;code&gt;summarise()&lt;/code&gt; combo can help with regularising a tsibble of irregular time space too.)&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>time-series (tidy approach)</category>
      <category>fill_gaps</category>
      <category>index_by</category>
      <category>Lubridate</category>
      <category>tsibble</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/75</guid>
      <comments>https://rstatistics.tistory.com/75#entry75comment</comments>
      <pubDate>Tue, 27 Jul 2021 23:10:45 +0900</pubDate>
    </item>
    <item>
      <title>[R] 1. fpp3 간단한 소개</title>
      <link>https://rstatistics.tistory.com/74</link>
      <description>&lt;h2 data-ke-size=&quot;size26&quot;&gt;소개&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;a class=&quot;uri&quot; href=&quot;https://tidyverts.org/&quot;&gt;https://tidyverts.org/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tidyverts&lt;/code&gt;는 시계열 데이터 분석을 tidy approach로 진행하게 하는 ecosystem 입니다.&lt;/li&gt;
&lt;li&gt;R에서는 &lt;code&gt;fpp3&lt;/code&gt; 라는 이름으로 &lt;code&gt;tidyverts&lt;/code&gt;를 구성하고 있는 라이브러리들을 불러올 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;또는 &lt;code&gt;install_packages(&quot;...&quot;)&lt;/code&gt;, &lt;code&gt;install_github(&quot;tidyverts/...&quot;)&lt;/code&gt;와 같이 필요한 라이브러리들만 별도로 불러올 수 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;fpp3&lt;/code&gt;는 &lt;a href=&quot;https://otexts.com/fpp3/&quot;&gt;Forecasting: principles and practice 3rd&lt;/a&gt;의 약자라고 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(fpp3)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## ─ Attaching packages ────────────────────── fpp3 0.4.0 ─&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## ✓ tibble      3.1.2      ✓ tsibble     1.0.1 
## ✓ dplyr       1.0.7      ✓ tsibbledata 0.3.0 
## ✓ tidyr       1.1.3      ✓ feasts      0.2.2 
## ✓ lubridate   1.7.10     ✓ fable       0.3.1 
## ✓ ggplot2     3.3.5&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;rust&quot;&gt;&lt;code&gt;## ─ Conflicts ───────────────────────── fpp3_conflicts ─
## x lubridate::date()    masks base::date()
## x dplyr::filter()      masks stats::filter()
## x tsibble::intersect() masks base::intersect()
## x tsibble::interval()  masks lubridate::interval()
## x dplyr::lag()         masks stats::lag()
## x tsibble::setdiff()   masks base::setdiff()
## x tsibble::union()     masks base::union()&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;attaching packages를 보니 &lt;code&gt;tidyverse&lt;/code&gt;를 불러들일 때 보였던 라이브러리 외 몇 가지가 더 보입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;code&gt;lubridate&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;날짜와 시간 정보를 표시하는 변수를 다룰 때 매우 유용한 라이브러리이며, 종종 tidy data를 핸들링할 때 같이 쓰이기도 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;code&gt;tsibble&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 데이터를 tidy approach 접근에 기반하여 정리할 수 있는 라이브러리 입니다. &lt;a href=&quot;https://github.com/tidyverts/tsibble&quot;&gt;github&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;해당 라이브러리 내장 함수인 &lt;code&gt;tsibble()&lt;/code&gt; 함수를 통해 &lt;code&gt;tsibble&lt;/code&gt; 이라는 객체를 생성할 수 있으며,&lt;br /&gt;&lt;b&gt;&lt;code&gt;tsibble&lt;/code&gt; 객체는 아래와 같은 기본적인 원칙을 가집니다&lt;/b&gt;.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;index&lt;/code&gt;: 과거부터 현재까지 순서화된 자료값의 관측 시간&lt;/li&gt;
&lt;li&gt;&lt;code&gt;key&lt;/code&gt;: 시간에 따른 관측값을 정의하는 변수의 집합&lt;/li&gt;
&lt;li&gt;각 관측치는 &lt;code&gt;index&lt;/code&gt;와 &lt;code&gt;key&lt;/code&gt;를 통해 유니크하게 식별되어야 합니다.&lt;/li&gt;
&lt;li&gt;각 관측치는 등간격으로 관측된 자료여야만 합니다.&lt;br /&gt;(나중에 언급하겠지만 이 부분은 함수의 옵션으로 어느 정도 통제할 수 있긴 합니다.)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;code&gt;tsibbledata&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;우리가 공부하면서 흔하게 접하는 &lt;code&gt;iris&lt;/code&gt;와 같이 &lt;code&gt;tsibble&lt;/code&gt; 포맷의 다양한 예제 데이터를 제공해주는 라이브러리 입니다.&lt;/li&gt;
&lt;li&gt;주로 예제로 쓰이는 데이터로 &lt;code&gt;olympic_running&lt;/code&gt; 이라는 데이터를 주로 쓰는 것 같습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;code&gt;feasts&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;feasts&lt;/code&gt;의 의미는 Feature Extraction And Statistics for Time Series의 약자라고 합니다. (FEASTS)&lt;/li&gt;
&lt;li&gt;시계열 데이터 분석에 필요한 여러 가지 함수들을 제공하는 라이브러리 입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;시계열 분해, 추출, 시각화 등&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tsibble&lt;/code&gt; 객체와 함께 작동하며 &lt;code&gt;fable&lt;/code&gt; 라이브러리와 긴밀하게 결합하여 사용됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;&lt;code&gt;fable&lt;/code&gt;&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;ARIMA, 지수평활(exponential smoothing) 등 일반적으로 사용되는 단변량/다변량 시계열 예측 모델을 제공하는 라이브러리 입니다.&lt;/li&gt;
&lt;li&gt;모델에 대한 추정과 비교, 결합, 예측 등을 제공합니다.&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>time-series (tidy approach)</category>
      <category>Forecasting: principles and practice</category>
      <category>fpp3</category>
      <category>tidy data</category>
      <category>tidyverse</category>
      <category>tidyverts</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/74</guid>
      <comments>https://rstatistics.tistory.com/74#entry74comment</comments>
      <pubDate>Tue, 27 Jul 2021 22:22:33 +0900</pubDate>
    </item>
    <item>
      <title>[R]  6. Topic modeling</title>
      <link>https://rstatistics.tistory.com/73</link>
      <description>&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## [1] &quot;ko_KR.UTF-8&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;h2 data-ke-size=&quot;size26&quot;&gt;6. Topic modeling&lt;/h2&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;1340&quot; data-origin-height=&quot;818&quot; data-filename=&quot;스크린샷 2021-07-20 오후 8.02.20.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/Ni9mD/btq9WpLrVh7/c9tzaedQ5K1si6sPFb05jk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/Ni9mD/btq9WpLrVh7/c9tzaedQ5K1si6sPFb05jk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/Ni9mD/btq9WpLrVh7/c9tzaedQ5K1si6sPFb05jk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FNi9mD%2Fbtq9WpLrVh7%2Fc9tzaedQ5K1si6sPFb05jk%2Fimg.png&quot; data-origin-width=&quot;1340&quot; data-origin-height=&quot;818&quot; data-filename=&quot;스크린샷 2021-07-20 오후 8.02.20.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;토픽 모델링은 클러스터링처럼 텍스트 데이터를 대상으로하는 비지도학습 분류 방법입니다.&lt;/li&gt;
&lt;li&gt;여러 토픽 모델들이 있는데 그 중 널리 사용되는 LDA(Latent Dirichlet Allocation)에 대해서 살펴보겠습니다.&lt;/li&gt;
&lt;li&gt;사전에 필요한 라이브러리는 &lt;code&gt;topicmodels&lt;/code&gt; 라이브러리로 LDA 객체를 다루는 방법에 대해 소개하겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(topicmodels)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;6. 1. Latent Dirichlet Allocation&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;LDA는 토픽 모델링을 위한 가장 일반적인 알고리즘 중 하나입니다.&lt;/li&gt;
&lt;li&gt;해당 포스팅에서는 모델의 수학적인 전개는 생략하고 아래 두 가지 원칙에 대해서만 정리하겠습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;모든 문서는 토픽이 혼합되어 있다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;각 문서가 특정 비율로 여러 토픽의 단어를 포함할 수 있다고 가정&lt;/li&gt;
&lt;li&gt;예를 들어, 문서 1은 토픽A 90%, 토픽B 10%이고 문서 2는 토픽A 30%, 토픽B 70%&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;모든 토픽은 단어가 혼합되어 있다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;예를 들어 하나의 토픽은 &amp;ldquo;정치&amp;rdquo;이고 또 다른 하나는 &amp;ldquo;엔터테인먼트&amp;rdquo;라고 가정했을 때&lt;/li&gt;
&lt;li&gt;&amp;ldquo;정치&amp;rdquo; 토픽에서 많이 사용되는 단어는 &amp;ldquo;대통령&amp;rdquo;, &amp;ldquo;국회&amp;rdquo;, &amp;ldquo;정부&amp;rdquo; 일 수 있으며&lt;br /&gt;&amp;ldquo;엔터테인먼트&amp;rdquo; 토픽에서는 &amp;ldquo;영화&amp;rdquo;, &amp;ldquo;TV&amp;rdquo;, &amp;ldquo;배우&amp;rdquo; 등이 될 수도 있다.&lt;/li&gt;
&lt;li&gt;중요한 것은 토픽 간에 단어를 공유할 수 있다는 점&lt;/li&gt;
&lt;li&gt;&amp;ldquo;예산&amp;rdquo;과 같은 단어는 두 토픽에 동등하게 나타날 수 있다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;LDA는 위 두 가지 가정을 기반으로 추정하는 수학적인 방법입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;지도학습에서의 선형판별분석 LDA(Linear Discriminant Analysis)와 약자가 동일하니 해석 때 주의하셔야 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;각 토픽과 관련된 단어의 조합을 찾는 동시에 각 문서를 설명하는 토픽의 조합을 결정합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;예시를 위해 &lt;code&gt;DocumentTermMatrix&lt;/code&gt; 객체인 &lt;code&gt;AssociatedPress&lt;/code&gt; 데이터를 사용하겠습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;1988년쯤 발행된 미국 통신사의 2,246개 뉴스 기사 모음&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;data(&quot;AssociatedPress&quot;)

AssociatedPress&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## &amp;lt;&amp;lt;DocumentTermMatrix (documents: 2246, terms: 10473)&amp;gt;&amp;gt;
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;적용하는 함수는 &lt;code&gt;topicmodels&lt;/code&gt; 라이브러리 함수인 &lt;code&gt;LDA()&lt;/code&gt; 입니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# k는 분류하고자 하는 토픽의 갯수
ap_lda &amp;lt;- LDA(AssociatedPress, k = 2, control = list(seed = 20210720))
ap_lda&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## A LDA_VEM topic model with 2 topics.&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;6. 1. 1. Word-topic probabilities&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;모델에서 &lt;span class=&quot;math inline&quot;&gt;\(\beta\)&lt;/span&gt; 라고 불리우는 단어-토픽 확률을 구하기 위해서 &lt;code&gt;tidy()&lt;/code&gt; 함수를 적용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_topics &amp;lt;- ap_lda %&amp;gt;% 
  tidy(matrix = &quot;beta&quot;)

ap_topics&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 20,946 x 3
##    topic term           beta
##    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;         &amp;lt;dbl&amp;gt;
##  1     1 aaron      3.31e- 5
##  2     2 aaron      6.71e- 6
##  3     1 abandon    3.23e- 5
##  4     2 abandon    3.77e- 5
##  5     1 abandoned  1.11e- 4
##  6     2 abandoned  6.05e- 5
##  7     1 abandoning 2.24e- 5
##  8     2 abandoning 3.16e-16
##  9     1 abbott     9.99e- 7
## 10     2 abbott     4.61e- 5
## # &amp;hellip; with 20,936 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그 결과 모델이 토픽(topic) 당 단어(term) 당 하나의 확률(beta)값을 갖는 데이터 형태가 되었습니다.&lt;/li&gt;
&lt;li&gt;해석을 하자면, 해당 토픽에서 해당 단어가 나올 확률은 beta가 되는 것 입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;예를 들어 &amp;ldquo;aaron&amp;rdquo; 이라는 단어는 토픽1에서 생성될 확률과 토픽2에서의 확률이 서로 다릅니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;slice_max()&lt;/code&gt; 함수를 사용하여 각 토픽에서 일반적인 단어를 캐치할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_top10_terms &amp;lt;- ap_topics %&amp;gt;% 
  group_by(topic) %&amp;gt;% 
  slice_max(beta, n = 10) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  arrange(topic, desc(beta))

ap_top10_terms&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 20 x 3
##    topic term          beta
##    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;        &amp;lt;dbl&amp;gt;
##  1     1 i          0.00716
##  2     1 people     0.00504
##  3     1 two        0.00445
##  4     1 president  0.00426
##  5     1 police     0.00404
##  6     1 government 0.00402
##  7     1 soviet     0.00364
##  8     1 bush       0.00343
##  9     1 new        0.00338
## 10     1 years      0.00323
## 11     2 percent    0.0108 
## 12     2 million    0.00767
## 13     2 new        0.00661
## 14     2 year       0.00647
## 15     2 billion    0.00507
## 16     2 last       0.00404
## 17     2 company    0.00369
## 18     2 market     0.00358
## 19     2 federal    0.00345
## 20     2 years      0.00284&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_top10_terms %&amp;gt;% 
  ggplot(aes(x = reorder_within(term, beta, within = topic), y = beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = &quot;free&quot;) +
  coord_flip() +
  scale_x_reordered()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;672&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-20 오후 8.02.35.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bs4Tm1/btq94randTl/3SY7ci3mTd0VCkKAKg82XK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bs4Tm1/btq94randTl/3SY7ci3mTd0VCkKAKg82XK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bs4Tm1/btq94randTl/3SY7ci3mTd0VCkKAKg82XK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fbs4Tm1%2Fbtq94randTl%2F3SY7ci3mTd0VCkKAKg82XK%2Fimg.png&quot; data-origin-width=&quot;672&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-20 오후 8.02.35.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 결과를 통해 두 가지 주제에 대해서 대략적인 감을 찾아볼 수 있습니다.&lt;/li&gt;
&lt;li&gt;토픽1은 &amp;ldquo;people&amp;rdquo;, &amp;ldquo;president&amp;rdquo;, &amp;ldquo;police&amp;rdquo;, &amp;ldquo;government&amp;rdquo; 등 정치와 관련된 뉴스 기사임을 대략적으로 알 수 있으며&lt;br /&gt;토픽2는 &amp;ldquo;percent&amp;rdquo;, &amp;ldquo;million&amp;rdquo;, &amp;ldquo;company&amp;rdquo;등 경제 관련 뉴스 기사임을 확인할 수 있습니다.&lt;/li&gt;
&lt;li&gt;또한 공통적으로 &amp;ldquo;years&amp;rdquo;와 같이 하나의 단어가 두 개의 토픽(클러스터)에서 높은 확률 값을 나타내고 있습니다.&lt;/li&gt;
&lt;li&gt;이는 기존에 우리가 익숙히 알고 있는 하드한 클러스터링과 다르게 소프트하게 클러스터링을 해볼 수 있다는 장점도 있다는 점을 알려줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;다음으로는 두 토픽 간의 가장 큰 차이를 보이는 항을 고려할 수 있습니다.&lt;/li&gt;
&lt;li&gt;이 차이를 계산하는 방법은 A 대비 B의 방식과 같이 로그비(log-ratio)를 고려합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_topics_wider &amp;lt;- ap_topics %&amp;gt;% 
  mutate(topic = paste0(&quot;topic&quot;, topic)) %&amp;gt;% 
  pivot_wider(
    names_from = &quot;topic&quot;,
    values_from = beta,
    values_fill = 0
  ) %&amp;gt;% 
  filter(topic1 &amp;gt; 0.001 | topic2 &amp;gt; 0.001) %&amp;gt;% 
  mutate(log_ratio = log2(topic2/topic1))

ap_topics_wider&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 209 x 4
##    term             topic1    topic2 log_ratio
##    &amp;lt;chr&amp;gt;             &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
##  1 administration 1.13e- 3 0.000767     -0.562
##  2 agreement      6.29e- 4 0.00130       1.05 
##  3 aid            1.01e- 3 0.0000431    -4.54 
##  4 air            8.71e- 4 0.00134       0.628
##  5 american       1.67e- 3 0.00207       0.309
##  6 analysts       2.37e- 6 0.00116       8.94 
##  7 army           1.17e- 3 0.0000110    -6.74 
##  8 asked          1.41e- 3 0.000327     -2.11 
##  9 authorities    1.11e- 3 0.000138     -3.01 
## 10 average        7.39e-12 0.00144      27.5  
## # &amp;hellip; with 199 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_topics_wider %&amp;gt;% 
  group_by(group = ifelse(log_ratio &amp;gt;= 0, &quot;+&quot;, &quot;-&quot;)) %&amp;gt;% 
  slice_max(abs(log_ratio), n = 10) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  ggplot(aes(x = reorder(term, log_ratio), y = log_ratio)) +
  geom_bar(stat = &quot;identity&quot;, width = 0.8) +
  coord_flip() +
  labs(x = NULL, y = &quot;Log2 ratio of beta in topic2 / topic1&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;672&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-20 오후 8.02.47.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bTOuIc/btq91NSI6YI/snvcNnNgCEwgvmHjskjku1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bTOuIc/btq91NSI6YI/snvcNnNgCEwgvmHjskjku1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bTOuIc/btq91NSI6YI/snvcNnNgCEwgvmHjskjku1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbTOuIc%2Fbtq91NSI6YI%2FsnvcNnNgCEwgvmHjskjku1%2Fimg.png&quot; data-origin-width=&quot;672&quot; data-origin-height=&quot;472&quot; data-filename=&quot;스크린샷 2021-07-20 오후 8.02.47.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 결과를 통해 확인해볼 수 있는 것은 토픽2는 &amp;ldquo;stock&amp;rdquo;, &amp;ldquo;dollar&amp;rdquo; 등의 단어와 같이 상대적으로 경제 관련 기사임을 확인해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;6. 1. 2. Document-topic probabilities&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;모델에서 &lt;span class=&quot;math inline&quot;&gt;\(\gamma\)&lt;/span&gt; 라고 불리우는 문서-토픽 확률을 구하기 위해서 마찬가지로 &lt;code&gt;tidy()&lt;/code&gt; 함수를 적용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_documents &amp;lt;- ap_lda %&amp;gt;% 
  tidy(matrix = &quot;gamma&quot;)

ap_documents&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 4,492 x 3
##    document topic  gamma
##       &amp;lt;int&amp;gt; &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt;
##  1        1     1 0.999 
##  2        2     1 0.612 
##  3        3     1 0.959 
##  4        4     1 0.797 
##  5        5     1 0.997 
##  6        6     1 1.00  
##  7        7     1 0.193 
##  8        8     1 0.997 
##  9        9     1 0.0261
## 10       10     1 0.926 
## # &amp;hellip; with 4,482 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그 결과 모델이 문서(document) 당 토픽(topic) 당 하나의 확률(gamma)값을 갖는 데이터 형태가 되었습니다.&lt;/li&gt;
&lt;li&gt;해석을 하자면, 해당 문서에서 해당 토픽에 대한 단어의 추청 확률이 gamma가 되는 것 입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;예를 들어 문서2에 있는 단어의 약 61%만이 토픽1에서 생성된 것으로 추정합니다. (두번째 줄)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;위 결과로 보면 문서6은 토픽1에 있는 단어의 거의 100%를 가져왔습니다.&lt;/li&gt;
&lt;li&gt;따라서 해당 문서에서 가장 빈도가 높은 단어들이 어떤 것인지 확인해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;AssociatedPress %&amp;gt;% 
  tidy() %&amp;gt;% 
  filter(document == 6) %&amp;gt;% 
  arrange(desc(count))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 287 x 3
##    document term           count
##       &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
##  1        6 noriega           16
##  2        6 panama            12
##  3        6 jackson            6
##  4        6 powell             6
##  5        6 administration     5
##  6        6 economic           5
##  7        6 general            5
##  8        6 i                  5
##  9        6 panamanian         5
## 10        6 american           4
## # &amp;hellip; with 277 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;6. 2. Example: the great library heist&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;해당 예시를 똑같이 따라해보려고 했는데&amp;hellip; 이상하게 파일을 불러올 수가 없네요 ㅠㅠ&lt;/li&gt;
&lt;li&gt;원인을 확인해보고 확인되는대로 다시 재업로드 하겠습니다.&lt;/li&gt;
&lt;li&gt;꼭 아래 링크를 통해서 한번쯤은 따라해보고 공부해보시길 권장드립니다. 꼭!&lt;/li&gt;
&lt;li&gt;필요하신 분들은 &lt;a href=&quot;https://www.tidytextmining.com/topicmodeling.html#library-heist&quot;&gt;여기&lt;/a&gt;를 참고해주시면 되어요!&lt;/li&gt;
&lt;/ul&gt;</description>
      <category>tidytext</category>
      <category>DocumentTermMatrix</category>
      <category>fuzz clustering</category>
      <category>Latent Dirichlet Allocation</category>
      <category>LDA</category>
      <category>soft clustering</category>
      <category>tidy data</category>
      <category>tidytext</category>
      <category>topic model</category>
      <category>topicmodels</category>
      <category>토픽모델링</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/73</guid>
      <comments>https://rstatistics.tistory.com/73#entry73comment</comments>
      <pubDate>Tue, 20 Jul 2021 20:05:56 +0900</pubDate>
    </item>
    <item>
      <title>[R] 5. Converting to and from non-tidy formats</title>
      <link>https://rstatistics.tistory.com/72</link>
      <description>&lt;h2 data-ke-size=&quot;size26&quot;&gt;5. Converting to and from non-tidy formats&lt;/h2&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이번 챕터에서는 텍스트 데이터를 tidy text format이 아닌 &lt;code&gt;tm&lt;/code&gt;, &lt;code&gt;quanteda&lt;/code&gt; 라이브러리에서 활용될 수 있는 코퍼스(&lt;code&gt;corpus&lt;/code&gt;) 객체로 분석하는 방법에 대해서 설명합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;1362&quot; data-origin-height=&quot;832&quot; data-filename=&quot;스크린샷 2021-07-19 오전 10.25.16.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/q300g/btq9RbLYe4U/X54Y1JLQgVckeyAHukkwV0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/q300g/btq9RbLYe4U/X54Y1JLQgVckeyAHukkwV0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/q300g/btq9RbLYe4U/X54Y1JLQgVckeyAHukkwV0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fq300g%2Fbtq9RbLYe4U%2FX54Y1JLQgVckeyAHukkwV0%2Fimg.png&quot; data-origin-width=&quot;1362&quot; data-origin-height=&quot;832&quot; data-filename=&quot;스크린샷 2021-07-19 오전 10.25.16.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;5. 1. Tidying a document-term matrix&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;문서 용어 행렬(DTM, Document-Term Matrix)은 텍스트 분석에서 일반적으로 쓰이는 구조 중 하나 입니다.&lt;/li&gt;
&lt;li&gt;이는 아래와 같은 형태를 갖습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;각 행은 하나의 문서(ex. book, article, &amp;hellip;)를 나타냅니다.&lt;/li&gt;
&lt;li&gt;각 열은 하나의 단어를 나타냅니다.&lt;/li&gt;
&lt;li&gt;일반적으로 각 행렬에 대한 값은 해당 문서에서 해당 단어의 출현 빈도가 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;여러 문서 안에서 문서-단어 쌍이 공통적으로 많이 발생하는 케이스는 드물기에 DTM은 일반적으로 희소 행렬(sparse matrix)로 구현됩니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tidytext&lt;/code&gt; 라이브러리는 DTM 객체를 직접 사용할 수 없지만 이를 tidy data frame 형태로 변환해주는 함수를 제공합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;tidy()&lt;/code&gt;: DTM to tidy data (in &lt;code&gt;broom&lt;/code&gt; library)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;5. 1. 1. Tidying DocumentTermMatrix objects&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;R에서 가장 널리 이용되는 DTM 구현은 &lt;code&gt;tm&lt;/code&gt; 라이브러리 내에 &lt;code&gt;DocumentTermMatrix&lt;/code&gt; 클래스를 갖는 객체입니다.&lt;/li&gt;
&lt;li&gt;예시를 보이기 위해 &lt;code&gt;topicmodels&lt;/code&gt; 라이브러리에 있는 &lt;code&gt;Associated Press&lt;/code&gt; 뉴스 기사 데이터를 참고합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(tm)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## 필요한 패키지를 로딩중입니다: NLP&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## 
## 다음의 패키지를 부착합니다: 'NLP'&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following object is masked from 'package:ggplot2':
## 
##     annotate&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;data(&quot;AssociatedPress&quot;, package = &quot;topicmodels&quot;)
AssociatedPress&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## &amp;lt;&amp;lt;DocumentTermMatrix (documents: 2246, terms: 10473)&amp;gt;&amp;gt;
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위에서 보이다시피 이 데이터는 2,246개의 뉴스 기사와 10,473개의 단어로 구성된 DTM 객체이며 99%가 문서-단어 쌍의 값이 0인 희소행렬로 입니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Terms()&lt;/code&gt; 함수를 사용하여 문서의 단어에 접근할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;terms &amp;lt;- Terms(AssociatedPress)

terms %&amp;gt;% 
  head(20)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;##  [1] &quot;aaron&quot;      &quot;abandon&quot;    &quot;abandoned&quot;  &quot;abandoning&quot; &quot;abbott&quot;    
##  [6] &quot;abboud&quot;     &quot;abc&quot;        &quot;abcs&quot;       &quot;abctvs&quot;     &quot;abdomen&quot;   
## [11] &quot;abducted&quot;   &quot;abduction&quot;  &quot;abductors&quot;  &quot;abdul&quot;      &quot;abide&quot;     
## [16] &quot;abilities&quot;  &quot;ability&quot;    &quot;ablaze&quot;     &quot;able&quot;       &quot;abm&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이를 tidy data format으로 변환하기 위해서 &lt;code&gt;tidy()&lt;/code&gt; 함수를 사용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_tidy &amp;lt;- AssociatedPress %&amp;gt;% 
  tidy()

ap_tidy&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 302,031 x 3
##    document term       count
##       &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt;
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # &amp;hellip; with 302,021 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이렇게 나온 결과 값은 희소행렬을 &lt;code&gt;reshape2&lt;/code&gt; 라이브러리의 &lt;code&gt;melt()&lt;/code&gt; 함수를 사용한 결과와 유사하다고 보면 됩니다.&lt;/li&gt;
&lt;li&gt;또한 &lt;code&gt;tidy()&lt;/code&gt; 함수를 적용함으로써 희소행렬 값이 0으로 취급받는 문서-단어 쌍은 결과에 출력되지 않습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이제 이러한 결과를 가지고 감정 분석을 진행해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_sentiments &amp;lt;- ap_tidy %&amp;gt;% 
  inner_join(get_sentiments(&quot;bing&quot;), by = c(term = &quot;word&quot;))

ap_sentiments&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 30,094 x 4
##    document term    count sentiment
##       &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;    
##  1        1 assault     1 negative 
##  2        1 complex     1 negative 
##  3        1 death       1 negative 
##  4        1 died        1 negative 
##  5        1 good        2 positive 
##  6        1 illness     1 negative 
##  7        1 killed      2 negative 
##  8        1 like        2 positive 
##  9        1 liked       1 positive 
## 10        1 miracle     1 positive 
## # &amp;hellip; with 30,084 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_sentiments %&amp;gt;% 
  count(sentiment, term, wt = count) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  filter(n &amp;gt;= 200) %&amp;gt;% 
  mutate(n = ifelse(sentiment == &quot;negative&quot;, -n, n)) %&amp;gt;% 
  ggplot(aes(x = reorder(term, n), y = n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  labs(x = &quot;Contribution to sentiment&quot;, y = NULL)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;663&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-19 오후 12.32.04.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/dVWcMz/btq9OwRyd6x/NPYlNdOmn4QTXyTddYpeqk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/dVWcMz/btq9OwRyd6x/NPYlNdOmn4QTXyTddYpeqk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/dVWcMz/btq9OwRyd6x/NPYlNdOmn4QTXyTddYpeqk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FdVWcMz%2Fbtq9OwRyd6x%2FNPYlNdOmn4QTXyTddYpeqk%2Fimg.png&quot; data-origin-width=&quot;663&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-19 오후 12.32.04.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;가장 흔하게 나타나는 긍정적인 단어는 &amp;ldquo;like&amp;rdquo;, &amp;ldquo;work&amp;rdquo;, &amp;ldquo;support&amp;rdquo;, &amp;ldquo;good&amp;rdquo;&lt;br /&gt;가장 부정적인 단어는 &amp;ldquo;killed&amp;rdquo;, &amp;ldquo;death&amp;rdquo; 등이 잇습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;5. 1. 2. Tidying &lt;code&gt;dfm&lt;/code&gt; objects&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;다른 라이브러리인 &lt;code&gt;quanteda&lt;/code&gt;에서 &lt;code&gt;dfm()&lt;/code&gt; 함수도 &lt;code&gt;dfm&lt;/code&gt; 이라는 클래스로 문서-단어 행렬을 구현을 제공합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;dtm&lt;/code&gt;, &lt;code&gt;dfm&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;가운데 글자 하나에 차이가 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;dfm&lt;/code&gt;에서 f는 feature를 의미한다고 합니다. (document-feature matrix)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;예시를 보기 위해 &lt;code&gt;quanteda&lt;/code&gt; 라이브러리의 &lt;code&gt;data_corpus_inaugural&lt;/code&gt; 데이터를 참고하겠습니다. (취임연설 관련 데이터)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(quanteda)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Package version: 3.0.0
## Unicode version: 13.0
## ICU version: 69.1&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## Parallel computing: 16 of 16 threads used.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## See https://quanteda.io for tutorials and examples.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## 
## 다음의 패키지를 부착합니다: 'quanteda'&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following object is masked from 'package:tm':
## 
##     stopwords&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following objects are masked from 'package:NLP':
## 
##     meta, meta&amp;lt;-&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;data(&quot;data_corpus_inaugural&quot;, pacakge = &quot;quanteda&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Warning in data(&quot;data_corpus_inaugural&quot;, pacakge = &quot;quanteda&quot;): data set
## 'quanteda' not found&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;data_corpus_inaugural&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## Corpus consisting of 59 documents and 4 docvars.
## 1789-Washington :
## &quot;Fellow-Citizens of the Senate and of the House of Representa...&quot;
## 
## 1793-Washington :
## &quot;Fellow citizens, I am again called upon by the voice of my c...&quot;
## 
## 1797-Adams :
## &quot;When it was first perceived, in early times, that no middle ...&quot;
## 
## 1801-Jefferson :
## &quot;Friends and Fellow Citizens: Called upon to undertake the du...&quot;
## 
## 1805-Jefferson :
## &quot;Proceeding, fellow citizens, to that qualification which the...&quot;
## 
## 1809-Madison :
## &quot;Unwilling to depart from examples of the most revered author...&quot;
## 
## [ reached max_ndoc ... 53 more documents ]&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;inaug_dfm &amp;lt;- dfm(data_corpus_inaugural, verbose = FALSE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;vhdl&quot;&gt;&lt;code&gt;## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;inaug_dfm&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## Document-feature matrix of: 59 documents, 9,439 features (91.84% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens  of the senate and house representatives :
##   1789-Washington               1  71 116      1  48     2               2 1
##   1793-Washington               0  11  13      0   2     0               0 1
##   1797-Adams                    3 140 163      1 130     0               2 0
##   1801-Jefferson                2 104 130      0  81     0               0 1
##   1805-Jefferson                0 101 143      0  93     0               0 0
##   1809-Madison                  1  69 104      0  43     0               0 0
##                  features
## docs              among vicissitudes
##   1789-Washington     1            1
##   1793-Washington     0            0
##   1797-Adams          4            0
##   1801-Jefferson      1            0
##   1805-Jefferson      7            0
##   1809-Madison        0            0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,429 more features ]&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이것도 역시 &lt;code&gt;tidy()&lt;/code&gt; 함수를 적용할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;inaug_tidy &amp;lt;- inaug_dfm %&amp;gt;% 
  tidy()

inaug_tidy&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 45,453 x 3
##    document        term            count
##    &amp;lt;chr&amp;gt;           &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;
##  1 1789-Washington fellow-citizens     1
##  2 1797-Adams      fellow-citizens     3
##  3 1801-Jefferson  fellow-citizens     2
##  4 1809-Madison    fellow-citizens     1
##  5 1813-Madison    fellow-citizens     1
##  6 1817-Monroe     fellow-citizens     5
##  7 1821-Monroe     fellow-citizens     1
##  8 1841-Harrison   fellow-citizens    11
##  9 1845-Polk       fellow-citizens     1
## 10 1849-Taylor     fellow-citizens     1
## # &amp;hellip; with 45,443 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이를 가지고 &lt;code&gt;bind_tf_idf()&lt;/code&gt; 함수를 사용하여 TF-IDF 값을 구해볼 수 있고 이를 시각화해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;inaug_tf_idf &amp;lt;- inaug_tidy %&amp;gt;% 
  bind_tf_idf(
    term = term,
    document = document,
    n = count
  )

inaug_tf_idf&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 45,453 x 6
##    document        term            count       tf   idf   tf_idf
##    &amp;lt;chr&amp;gt;           &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;    &amp;lt;dbl&amp;gt;
##  1 1789-Washington fellow-citizens     1 0.000651  1.13 0.000737
##  2 1797-Adams      fellow-citizens     3 0.00116   1.13 0.00132 
##  3 1801-Jefferson  fellow-citizens     2 0.00104   1.13 0.00118 
##  4 1809-Madison    fellow-citizens     1 0.000793  1.13 0.000899
##  5 1813-Madison    fellow-citizens     1 0.000768  1.13 0.000870
##  6 1817-Monroe     fellow-citizens     5 0.00136   1.13 0.00154 
##  7 1821-Monroe     fellow-citizens     1 0.000205  1.13 0.000232
##  8 1841-Harrison   fellow-citizens    11 0.00121   1.13 0.00137 
##  9 1845-Polk       fellow-citizens     1 0.000193  1.13 0.000218
## 10 1849-Taylor     fellow-citizens     1 0.000849  1.13 0.000962
## # &amp;hellip; with 45,443 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;inaug_tf_idf %&amp;gt;% 
  filter(document %in% c(&quot;1861-Lincoln&quot;, &quot;1933-Roosevelt&quot;, &quot;1961-Kennedy&quot;, &quot;2009-Obama&quot;)) %&amp;gt;% 
  group_by(document) %&amp;gt;% 
  slice_max(tf_idf, n = 10) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  ggplot(aes(x = reorder(term, tf_idf), y = tf_idf, fill = document)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ document, ncol = 2, scales = &quot;free&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;678&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-19 오후 12.32.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/em5JFG/btq9Mttzv9L/ziupy6CKEfCbuqoVFEpXM1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/em5JFG/btq9Mttzv9L/ziupy6CKEfCbuqoVFEpXM1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/em5JFG/btq9Mttzv9L/ziupy6CKEfCbuqoVFEpXM1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2Fem5JFG%2Fbtq9Mttzv9L%2Fziupy6CKEfCbuqoVFEpXM1%2Fimg.png&quot; data-origin-width=&quot;678&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-19 오후 12.32.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;또 다른 시각화의 예로, 각 문서의 이름에서 연도를 추출하고 각 연도의 총 단어 수도 게산할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;year_term_counts &amp;lt;- inaug_tidy %&amp;gt;% 
  extract(
    col = document,
    into = &quot;year&quot;,
    regex = &quot;(\\d+)&quot;,
    convert = TRUE
  ) %&amp;gt;% 
  complete(year, term, fill = list(count = 0)) %&amp;gt;% # 단어가 문서에 나타나지 않는 케이스를 포함하기 위해
  group_by(year) %&amp;gt;% 
  mutate(total_count = sum(count)) %&amp;gt;% 
  ungroup()

year_term_counts&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 556,901 x 4
##     year term  count total_count
##    &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;
##  1  1789 &quot;-&quot;       1        1537
##  2  1789 &quot;,&quot;      70        1537
##  3  1789 &quot;;&quot;       8        1537
##  4  1789 &quot;:&quot;       1        1537
##  5  1789 &quot;!&quot;       0        1537
##  6  1789 &quot;?&quot;       0        1537
##  7  1789 &quot;.&quot;      23        1537
##  8  1789 &quot;&amp;hellip;&quot;       0        1537
##  9  1789 &quot;'&quot;       0        1537
## 10  1789 &quot;\&quot;&quot;      2        1537
## # &amp;hellip; with 556,891 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;주요 특정 단어를 필터링하여 해당 단어들이 시간이 지남에 따라 빈도가 어떻게 변했는지 확인해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;year_term_counts %&amp;gt;% 
  filter(term %in% c(&quot;god&quot;, &quot;america&quot;, &quot;foreign&quot;, &quot;union&quot;, &quot;constitution&quot;, &quot;freedom&quot;)) %&amp;gt;% 
  mutate(ratio = count/total_count) %&amp;gt;% 
  ggplot(aes(x = year, y = ratio)) +
  geom_point(size = 1.2) +
  geom_smooth(formula = y ~ x, method = &quot;loess&quot;) +
  facet_wrap(~ term, scales = &quot;free_y&quot;) +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = &quot;% frequency of word in inaugural address&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;678&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-19 오후 12.32.28.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/caJpm5/btq9Re3FDhZ/1FILK7yM6zUjRx4JyJLRe1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/caJpm5/btq9Re3FDhZ/1FILK7yM6zUjRx4JyJLRe1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/caJpm5/btq9Re3FDhZ/1FILK7yM6zUjRx4JyJLRe1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcaJpm5%2Fbtq9Re3FDhZ%2F1FILK7yM6zUjRx4JyJLRe1%2Fimg.png&quot; data-origin-width=&quot;678&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-19 오후 12.32.28.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;5. 2. Casting tidy text data into a matrix&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;어떤 함수들은 tidy format이 아닌 dtm format을 input으로 필요로 하는 케이스가 있을 수도 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;tidytext&lt;/code&gt; 라이브러리에서는 tidy format을 dtm format으로 변환해주는 함수 역시 존재합니다.&lt;/li&gt;
&lt;li&gt;그 함수명은 &lt;code&gt;cast_dtm()&lt;/code&gt; 입니다.&lt;/li&gt;
&lt;li&gt;위에서 dtm to tidy로 변환했던 &lt;code&gt;ap_tidy&lt;/code&gt; 객체를 활용해보겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_tidy&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 302,031 x 3
##    document term       count
##       &amp;lt;int&amp;gt; &amp;lt;chr&amp;gt;      &amp;lt;dbl&amp;gt;
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # &amp;hellip; with 302,021 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_dtm &amp;lt;- ap_tidy %&amp;gt;% 
  cast_dtm(
    term = term,
    document = document,
    value = count
  )

ap_dtm&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## &amp;lt;&amp;lt;DocumentTermMatrix (documents: 2246, terms: 10473)&amp;gt;&amp;gt;
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마찬가지로 &lt;code&gt;quanteda&lt;/code&gt; 라이브러리의 &lt;code&gt;dfm&lt;/code&gt; 객체 역시 &lt;code&gt;cast_dfm()&lt;/code&gt; 함수로 변환할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ap_dfm &amp;lt;- ap_tidy %&amp;gt;% 
  cast_dfm(
    term = term,
    document = document,
    value = count
  )

ap_dfm&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## Document-feature matrix of: 2,246 documents, 10,473 features (98.72% sparse) and 0 docvars.
##     features
## docs adding adult ago alcohol allegedly allen apparently appeared arrested
##    1      1     2   1       1         1     1          2        1        1
##    2      0     0   0       0         0     0          0        1        0
##    3      0     0   1       0         0     0          0        1        0
##    4      0     0   3       0         0     0          0        0        0
##    5      0     0   0       0         0     0          0        0        0
##    6      0     0   2       0         0     0          0        0        0
##     features
## docs assault
##    1       1
##    2       0
##    3       0
##    4       0
##    5       0
##    6       0
## [ reached max_ndoc ... 2,240 more documents, reached max_nfeat ... 10,463 more features ]&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이러한 종류의 변환을 통해서 이전 챕터에서 예시로 보았던 Jane Austen 책 역시 dtm 객체로 만들 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(janeaustenr)

austen_dtm &amp;lt;- austen_books() %&amp;gt;% 
  unnest_tokens(input = text, output = &quot;word&quot;) %&amp;gt;% 
  count(book, word) %&amp;gt;% 
  cast_dtm(
    term = word,
    document = book,
    value = n
  )

austen_dtm&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## &amp;lt;&amp;lt;DocumentTermMatrix (documents: 6, terms: 14520)&amp;gt;&amp;gt;
## Non-/sparse entries: 40379/46741
## Sparsity           : 54%
## Maximal term length: 19
## Weighting          : term frequency (tf)&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;5. 3. Tidying corpus objects with metadata&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;Corpus&lt;/code&gt;라고 하는 객체는 토큰화 전에 문서 컬렉션들을 저장해놓은 객체입니다.&lt;/li&gt;
&lt;li&gt;여기에는 각 문서의 고유 아이디나 날짜/시간 또는 제목 등 포함할 수 있는 메타데이터와 함께 텍스트를 저장합니다.&lt;/li&gt;
&lt;li&gt;아래 예시를 들어 살펴보겠습니다 (&lt;code&gt;tm&lt;/code&gt; 라이브러리에서 &lt;code&gt;acq&lt;/code&gt; 데이터, 뉴스 기사 50개)&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;data(&quot;acq&quot;)

acq&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;pgsql&quot;&gt;&lt;code&gt;## &amp;lt;&amp;lt;VCorpus&amp;gt;&amp;gt;
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 50&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;class(acq)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## [1] &quot;VCorpus&quot; &quot;Corpus&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래와 같이 &lt;code&gt;Corpus&lt;/code&gt; 객체는 텍스트와 메타데이터가 모두 포함됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;acq[[1]]&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;pgsql&quot;&gt;&lt;code&gt;## &amp;lt;&amp;lt;PlainTextDocument&amp;gt;&amp;gt;
## Metadata:  15
## Content:  chars: 1287&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이러한 방식은 텍스트 데이터를 저장하는 유연한 방법이지만 &lt;code&gt;tidytext&lt;/code&gt; 라이브러리로 처리하기에는 적합하지 않습니다.&lt;/li&gt;
&lt;li&gt;따라서 &lt;code&gt;tidy()&lt;/code&gt; 함수를 사용하여 tidy data format으로 변환시켜 분석해야 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;acq_tidy &amp;lt;- acq %&amp;gt;% 
  tidy()

acq_tidy&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;objectivec&quot;&gt;&lt;code&gt;## # A tibble: 50 x 16
##    author  datetimestamp       description heading  id    language origin topics
##    &amp;lt;chr&amp;gt;   &amp;lt;dttm&amp;gt;              &amp;lt;chr&amp;gt;       &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;    &amp;lt;chr&amp;gt;  &amp;lt;chr&amp;gt; 
##  1 &amp;lt;NA&amp;gt;    1987-02-26 15:18:06 &quot;&quot;          COMPUTE&amp;hellip; 10    en       Reute&amp;hellip; YES   
##  2 &amp;lt;NA&amp;gt;    1987-02-26 15:19:15 &quot;&quot;          OHIO MA&amp;hellip; 12    en       Reute&amp;hellip; YES   
##  3 &amp;lt;NA&amp;gt;    1987-02-26 15:49:56 &quot;&quot;          MCLEAN'&amp;hellip; 44    en       Reute&amp;hellip; YES   
##  4 By Cal&amp;hellip; 1987-02-26 15:51:17 &quot;&quot;          CHEMLAW&amp;hellip; 45    en       Reute&amp;hellip; YES   
##  5 &amp;lt;NA&amp;gt;    1987-02-26 16:08:33 &quot;&quot;          &amp;lt;COFAB &amp;hellip; 68    en       Reute&amp;hellip; YES   
##  6 &amp;lt;NA&amp;gt;    1987-02-26 16:32:37 &quot;&quot;          INVESTM&amp;hellip; 96    en       Reute&amp;hellip; YES   
##  7 By Pat&amp;hellip; 1987-02-26 16:43:13 &quot;&quot;          AMERICA&amp;hellip; 110   en       Reute&amp;hellip; YES   
##  8 &amp;lt;NA&amp;gt;    1987-02-26 16:59:25 &quot;&quot;          HONG KO&amp;hellip; 125   en       Reute&amp;hellip; YES   
##  9 &amp;lt;NA&amp;gt;    1987-02-26 17:01:28 &quot;&quot;          LIEBERT&amp;hellip; 128   en       Reute&amp;hellip; YES   
## 10 &amp;lt;NA&amp;gt;    1987-02-26 17:08:27 &quot;&quot;          GULF AP&amp;hellip; 134   en       Reute&amp;hellip; YES   
## # &amp;hellip; with 40 more rows, and 8 more variables: lewissplit &amp;lt;chr&amp;gt;, cgisplit &amp;lt;chr&amp;gt;,
## #   oldid &amp;lt;chr&amp;gt;, places &amp;lt;named list&amp;gt;, people &amp;lt;lgl&amp;gt;, orgs &amp;lt;lgl&amp;gt;,
## #   exchanges &amp;lt;lgl&amp;gt;, text &amp;lt;chr&amp;gt;&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이를 &lt;code&gt;unnest_tokens()&lt;/code&gt; 함수를 사용하여 단어들을 토큰화시킬 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;acq_tokens &amp;lt;- acq_tidy %&amp;gt;% 
  select(-places) %&amp;gt;% 
  unnest_tokens(
    input = text,
    output = &quot;word&quot;
  ) %&amp;gt;% 
  anti_join(stop_words, by = &quot;word&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;pgsql&quot;&gt;&lt;code&gt;## Warning: Outer names are only allowed for unnamed scalar atomic inputs&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# most common words
acq_tokens %&amp;gt;% 
  count(word, sort = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 1,566 x 2
##    word         n
##    &amp;lt;chr&amp;gt;    &amp;lt;int&amp;gt;
##  1 dlrs       100
##  2 pct         70
##  3 mln         65
##  4 company     63
##  5 shares      52
##  6 reuter      50
##  7 stock       46
##  8 offer       34
##  9 share       34
## 10 american    28
## # &amp;hellip; with 1,556 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# tf-idf
acq_tokens %&amp;gt;% 
  count(id, word) %&amp;gt;% 
  bind_tf_idf(
    term = word,
    document = id,
    n = n
  ) %&amp;gt;% 
  arrange(desc(tf_idf))&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 2,853 x 6
##    id    word         n     tf   idf tf_idf
##    &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;    &amp;lt;int&amp;gt;  &amp;lt;dbl&amp;gt; &amp;lt;dbl&amp;gt;  &amp;lt;dbl&amp;gt;
##  1 186   groupe       2 0.133   3.91  0.522
##  2 128   liebert      3 0.130   3.91  0.510
##  3 474   esselte      5 0.109   3.91  0.425
##  4 371   burdett      6 0.103   3.91  0.405
##  5 442   hazleton     4 0.103   3.91  0.401
##  6 199   circuit      5 0.102   3.91  0.399
##  7 162   suffield     2 0.1     3.91  0.391
##  8 498   west         3 0.1     3.91  0.391
##  9 441   rmj          8 0.121   3.22  0.390
## 10 467   nursery      3 0.0968  3.91  0.379
## # &amp;hellip; with 2,843 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;</description>
      <category>tidytext</category>
      <category>cast_dfm</category>
      <category>cast_dtm</category>
      <category>dfm</category>
      <category>document-feature matrix</category>
      <category>document-term matrix</category>
      <category>DTM</category>
      <category>quanteda</category>
      <category>sprase matrix</category>
      <category>TIDY</category>
      <category>tm</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/72</guid>
      <comments>https://rstatistics.tistory.com/72#entry72comment</comments>
      <pubDate>Mon, 19 Jul 2021 12:33:04 +0900</pubDate>
    </item>
    <item>
      <title>[R] 4. Relationships between words: n-grams and correlations</title>
      <link>https://rstatistics.tistory.com/71</link>
      <description>&lt;h2 data-ke-size=&quot;size26&quot;&gt;4. Relationships between words: n-grams and correlations&lt;/h2&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;4. 1. Tokenizing by n-gram&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;지금까지 &lt;code&gt;unnset_tokens()&lt;/code&gt; 함수를 사용하여 단어, 또는 문장으로 토큰화를 진행했었는데,&lt;br /&gt;이러한 토큰 단위는 감정 또는 빈도 관련 분석에 유용합니다.&lt;/li&gt;
&lt;li&gt;그러나 해당 함수를 사용하여 n-grams라고 하는 연속적인 단어 시퀀스로도 토큰화를 할 수 있습니다.&lt;/li&gt;
&lt;li&gt;즉, 어느 단어 다음에 특정 단어가 얼마나 자주 나오는 지 확인함으로써 이들 사이의 관계를 확인해볼 수도 있습니다.&lt;/li&gt;
&lt;li&gt;방식은 간단합니다. &lt;code&gt;unnest_tokens()&lt;/code&gt; 함수에 &lt;code&gt;token = &quot;ngrams&quot;&lt;/code&gt;와 &lt;code&gt;n = 2&lt;/code&gt;(연속되는 단어 수) arguments를 주면 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(janeaustenr)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;austen_bigrams &amp;lt;- austen_books() %&amp;gt;% 
  unnest_tokens(input = text, output = &quot;bigram&quot;, token = &quot;ngrams&quot;, n = 2)

austen_bigrams&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 675,025 x 2
##    book                bigram         
##    &amp;lt;fct&amp;gt;               &amp;lt;chr&amp;gt;          
##  1 Sense &amp;amp; Sensibility sense and      
##  2 Sense &amp;amp; Sensibility and sensibility
##  3 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;           
##  4 Sense &amp;amp; Sensibility by jane        
##  5 Sense &amp;amp; Sensibility jane austen    
##  6 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;           
##  7 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;           
##  8 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;           
##  9 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;           
## 10 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;           
## # &amp;hellip; with 675,015 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;4. 1. 1. Counting and filtering n-grams&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이도 마찬가지로 &lt;code&gt;count()&lt;/code&gt; 함수를 사용하여 빈도를 체크해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;austen_bigrams %&amp;gt;% 
  count(bigram, sort = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## # A tibble: 193,210 x 2
##    bigram      n
##    &amp;lt;chr&amp;gt;   &amp;lt;int&amp;gt;
##  1 &amp;lt;NA&amp;gt;    12242
##  2 of the   2853
##  3 to be    2670
##  4 in the   2221
##  5 it was   1691
##  6 i am     1485
##  7 she had  1405
##  8 of her   1363
##  9 to the   1315
## 10 she was  1309
## # &amp;hellip; with 193,200 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;separate()&lt;/code&gt; 함수는 구분자를 기준으로 컬럼을 여러 개로 분할하는 데 쓰일 수 있는 함수 입니다.&lt;/li&gt;
&lt;li&gt;이 함수를 가지고 위 결과를 두 개의 컬럼으로 분리할 수 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;astuen_bigrams&lt;/code&gt; 결과가 두 개의 단어를 띄어쓰기 공백으로 분리하였기에 구분자는 &lt;code&gt;&lt;/code&gt; 띄어쓰기 한 칸이 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;bigrams_separated &amp;lt;- austen_bigrams %&amp;gt;% 
  separate(
    col = bigram, 
    into = c(&quot;word1&quot;, &quot;word2&quot;),
    sep = &quot; &quot;
  )

bigrams_separated&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 675,025 x 3
##    book                word1 word2      
##    &amp;lt;fct&amp;gt;               &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;      
##  1 Sense &amp;amp; Sensibility sense and        
##  2 Sense &amp;amp; Sensibility and   sensibility
##  3 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;  &amp;lt;NA&amp;gt;       
##  4 Sense &amp;amp; Sensibility by    jane       
##  5 Sense &amp;amp; Sensibility jane  austen     
##  6 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;  &amp;lt;NA&amp;gt;       
##  7 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;  &amp;lt;NA&amp;gt;       
##  8 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;  &amp;lt;NA&amp;gt;       
##  9 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;  &amp;lt;NA&amp;gt;       
## 10 Sense &amp;amp; Sensibility &amp;lt;NA&amp;gt;  &amp;lt;NA&amp;gt;       
## # &amp;hellip; with 675,015 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;stop_words&lt;/code&gt;를 활용하여 불용어를 제거한 후 빈도를 확인해보곘습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;data(&quot;stop_words&quot;)

bigrams_filtered &amp;lt;- bigrams_separated %&amp;gt;% 
  filter(!word1 %in% stop_words$word) %&amp;gt;% 
  filter(!word2 %in% stop_words$word)

bigram_counts &amp;lt;- bigrams_filtered %&amp;gt;% 
  count(word1, word2, sort = TRUE)

bigram_counts&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 28,975 x 3
##    word1   word2         n
##    &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;     &amp;lt;int&amp;gt;
##  1 &amp;lt;NA&amp;gt;    &amp;lt;NA&amp;gt;      12242
##  2 sir     thomas      266
##  3 miss    crawford    196
##  4 captain wentworth   143
##  5 miss    woodhouse   143
##  6 frank   churchill   114
##  7 lady    russell     110
##  8 sir     walter      108
##  9 lady    bertram     101
## 10 miss    fairfax      98
## # &amp;hellip; with 28,965 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;Jane Austen의 책에서는 이름과 성이 가장 빈도가 높은 한 쌍임을 알 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;또 다른 분석에서는 재결합된 단어로 작업할 수 있습니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;unite()&lt;/code&gt; 함수는 &lt;code&gt;separate()&lt;/code&gt; 함수와 반대로 열을 하나로 다시 결합할 수 있습니다.&lt;/li&gt;
&lt;li&gt;따라서 &lt;code&gt;separate()&lt;/code&gt;, &lt;code&gt;filter()&lt;/code&gt;, &lt;code&gt;count()&lt;/code&gt;, &lt;code&gt;unite()&lt;/code&gt; 함수를 사용하여 가장 일반적인 두 단어 쌍을 찾을 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;bigrams_united &amp;lt;- bigrams_filtered %&amp;gt;% 
  unite(
    col = &quot;bigram&quot;,
    word1, word2,
    sep = &quot; &quot;
  )

bigrams_united&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 51,155 x 2
##    book                bigram     
##    &amp;lt;fct&amp;gt;               &amp;lt;chr&amp;gt;      
##  1 Sense &amp;amp; Sensibility NA NA      
##  2 Sense &amp;amp; Sensibility jane austen
##  3 Sense &amp;amp; Sensibility NA NA      
##  4 Sense &amp;amp; Sensibility NA NA      
##  5 Sense &amp;amp; Sensibility NA NA      
##  6 Sense &amp;amp; Sensibility NA NA      
##  7 Sense &amp;amp; Sensibility NA NA      
##  8 Sense &amp;amp; Sensibility NA NA      
##  9 Sense &amp;amp; Sensibility chapter 1  
## 10 Sense &amp;amp; Sensibility NA NA      
## # &amp;hellip; with 51,145 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;4. 1. 2. Analyzing bigrams&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;우리는 각 책에 언급된 &amp;ldquo;street&amp;rdquo;라는 단어에 관심이 있다고 가정합시다.&lt;/li&gt;
&lt;li&gt;&amp;ldquo;street&amp;rdquo; 단어 이전에 어떤 단어들이 많이 나왔는 지 EDA 관점에서 접근하고 싶다면? 아래와 같이 입력해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;bigrams_filtered %&amp;gt;% 
  filter(word2 == &quot;street&quot;) %&amp;gt;% 
  count(book, word1, srot = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 33 x 4
##    book                word1       srot      n
##    &amp;lt;fct&amp;gt;               &amp;lt;chr&amp;gt;       &amp;lt;lgl&amp;gt; &amp;lt;int&amp;gt;
##  1 Sense &amp;amp; Sensibility berkeley    TRUE     15
##  2 Sense &amp;amp; Sensibility bond        TRUE      4
##  3 Sense &amp;amp; Sensibility conduit     TRUE      4
##  4 Sense &amp;amp; Sensibility harley      TRUE     16
##  5 Sense &amp;amp; Sensibility james       TRUE      1
##  6 Sense &amp;amp; Sensibility park        TRUE      1
##  7 Sense &amp;amp; Sensibility sackville   TRUE      1
##  8 Pride &amp;amp; Prejudice   edward      TRUE      1
##  9 Pride &amp;amp; Prejudice   gracechurch TRUE      8
## 10 Pride &amp;amp; Prejudice   grosvenor   TRUE      2
## # &amp;hellip; with 23 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;또한 ngram 역시 문장 단위 안에서 토큰으로 취급한 것이기에 TF-IDF 계산도 가능합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;bigram_tf_idf &amp;lt;- bigrams_united %&amp;gt;% 
  count(book, bigram) %&amp;gt;% 
  bind_tf_idf(
    term = bigram,
    document = book,
    n = n
  ) %&amp;gt;% 
  arrange(desc(tf_idf))

bigram_tf_idf %&amp;gt;% 
  group_by(book) %&amp;gt;% 
  slice_max(tf_idf, n = 10) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  ggplot(aes(x = reorder(bigram, tf_idf), y = tf_idf, fill = book)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ book, ncol = 2, scales = &quot;free&quot;) +
  coord_flip() +
  labs(x = &quot;bi-gram&quot;, y = NULL)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;677&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.17.27.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/mPxjL/btq9PUcNpBZ/JJyl7pNPJZbrx2KBnAk0U0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/mPxjL/btq9PUcNpBZ/JJyl7pNPJZbrx2KBnAk0U0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/mPxjL/btq9PUcNpBZ/JJyl7pNPJZbrx2KBnAk0U0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FmPxjL%2Fbtq9PUcNpBZ%2FJJyl7pNPJZbrx2KBnAk0U0%2Fimg.png&quot; data-origin-width=&quot;677&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.17.27.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;개별 단어보다 ngram의 TF-IDF는 단일 단어를 계산할 때 보이지 않는 구조를 포착하고 토큰을 더 이해하기 쉽게 만드는 데 도움을 줍니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;4. 1. 3. Using bigrams to provide context in sentiment analysis&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;챕터 2에서는 사전을 활용하여 단순히 긍정적이거나 부정적인 단어의 빈도를 계산하였습니다.&lt;/li&gt;
&lt;li&gt;이제는 ngram을 구성하였으므로 단어 앞에 &amp;ldquo;not&amp;rdquo;과 같은 단어가 오는 빈도도 알 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;bigrams_separated %&amp;gt;% 
  filter(word1 == &quot;not&quot;) %&amp;gt;% 
  count(word1, word2, sort = TRUE)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;shell&quot;&gt;&lt;code&gt;## # A tibble: 1,178 x 3
##    word1 word2     n
##    &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt; &amp;lt;int&amp;gt;
##  1 not   be      580
##  2 not   to      335
##  3 not   have    307
##  4 not   know    237
##  5 not   a       184
##  6 not   think   162
##  7 not   been    151
##  8 not   the     135
##  9 not   at      126
## 10 not   in      110
## # &amp;hellip; with 1,168 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이를 가지고 AFINN 사전을 사용하여 각 단어에 대한 감정을 수치로 표현하고자 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;not_words &amp;lt;- bigrams_separated %&amp;gt;% 
  filter(word1 == &quot;not&quot;) %&amp;gt;% 
  inner_join(get_sentiments(&quot;afinn&quot;), by = c(word2 = &quot;word&quot;)) %&amp;gt;% 
  count(word2, value, sort = TRUE)

not_words&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;angelscript&quot;&gt;&lt;code&gt;## # A tibble: 229 x 3
##    word2   value     n
##    &amp;lt;chr&amp;gt;   &amp;lt;dbl&amp;gt; &amp;lt;int&amp;gt;
##  1 like        2    95
##  2 help        2    77
##  3 want        1    41
##  4 wish        1    39
##  5 allow       1    30
##  6 care        2    21
##  7 sorry      -1    20
##  8 leave      -1    17
##  9 pretend    -1    17
## 10 worth       2    17
## # &amp;hellip; with 219 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 결과 중 하나를 보면 &amp;ldquo;not&amp;rdquo; 뒤에 오는 가장 일반적인 감정 관련 단어는 &amp;ldquo;like&amp;rdquo;이며 점수는 2입니다.&lt;/li&gt;
&lt;li&gt;이처럼 어떤 단어가 negative에 많이 기여했는지도 확인해볼 수 있습니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;단어 별 value 값에 빈도를 곱한 결과로 확인해볼 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;not_words %&amp;gt;% 
  mutate(contribution = n*value) %&amp;gt;% 
  arrange(desc(abs(contribution))) %&amp;gt;% 
  head(20) %&amp;gt;% 
  ggplot(aes(x = reorder(word2, contribution), y = contribution, fill = contribution &amp;gt; 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  labs(y = &quot;Sentiment value * number of occurrences&quot;, x = &quot;Words preceded by \&quot;not\&quot;&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.17.41.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/DNlJx/btq9JNZ8Chd/7wkNnI9QhpN6KknUETGJQ1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/DNlJx/btq9JNZ8Chd/7wkNnI9QhpN6KknUETGJQ1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/DNlJx/btq9JNZ8Chd/7wkNnI9QhpN6KknUETGJQ1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FDNlJx%2Fbtq9JNZ8Chd%2F7wkNnI9QhpN6KknUETGJQ1%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.17.41.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;4 1. 4. Visualizing a network of bigrams with &lt;code&gt;ggraph&lt;/code&gt;&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;한 단어에 상위 몇 개만 표시하는 것을 넘어서 단어 간의 모든 관계를 동시에 시각화하는 데 관심이 있을 수 있습니다.&lt;/li&gt;
&lt;li&gt;이는 네트워크 그래프를 활용하여 정렬해볼 수 있습니다. (연결된 노드의 조합을 보이는 그래프 포맷)&lt;/li&gt;
&lt;li&gt;이 때 활용한 라이브러리는 &lt;code&gt;igraph&lt;/code&gt; 라이브러리이며 tidy data에서 igraph 객체를 생성하는 함수인 &lt;code&gt;graph_from_data_frame()&lt;/code&gt; 함수를 사용할 것 입니다.&lt;/li&gt;
&lt;li&gt;또한 시각화에는 &lt;code&gt;ggraph&lt;/code&gt; 라이브러리를 사용합니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;igraph&lt;/code&gt; 라이브러리에도 플로팅 함수가 있지만 &lt;code&gt;ggplot2&lt;/code&gt; 문법이 익숙한 시각화 라이브러리인 &lt;code&gt;ggraph&lt;/code&gt; 라이브러리를 사용합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(igraph)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## 
## 다음의 패키지를 부착합니다: 'igraph'&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following objects are masked from 'package:purrr':
## 
##     compose, simplify&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following object is masked from 'package:tidyr':
## 
##     crossing&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following object is masked from 'package:tibble':
## 
##     as_data_frame&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## The following object is masked from 'package:base':
## 
##     union&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(ggraph)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;bigram_counts&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 28,975 x 3
##    word1   word2         n
##    &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;     &amp;lt;int&amp;gt;
##  1 &amp;lt;NA&amp;gt;    &amp;lt;NA&amp;gt;      12242
##  2 sir     thomas      266
##  3 miss    crawford    196
##  4 captain wentworth   143
##  5 miss    woodhouse   143
##  6 frank   churchill   114
##  7 lady    russell     110
##  8 sir     walter      108
##  9 lady    bertram     101
## 10 miss    fairfax      98
## # &amp;hellip; with 28,965 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;bigram_graph &amp;lt;- bigram_counts %&amp;gt;% 
  filter(n &amp;gt; 20) %&amp;gt;% # 빈도가 20회가 넘는 두 단어의 조합 식별
  graph_from_data_frame()&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Warning in graph_from_data_frame(.): In `d' `NA' elements were replaced with
## string &quot;NA&quot;&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;bigram_graph&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## IGRAPH 68c3987 DN-- 86 71 -- 
## + attr: name (v/c), n (e/n)
## + edges from 68c3987 (vertex names):
##  [1] NA      -&amp;gt;NA         sir     -&amp;gt;thomas     miss    -&amp;gt;crawford  
##  [4] captain -&amp;gt;wentworth  miss    -&amp;gt;woodhouse  frank   -&amp;gt;churchill 
##  [7] lady    -&amp;gt;russell    sir     -&amp;gt;walter     lady    -&amp;gt;bertram   
## [10] miss    -&amp;gt;fairfax    colonel -&amp;gt;brandon    sir     -&amp;gt;john      
## [13] miss    -&amp;gt;bates      jane    -&amp;gt;fairfax    lady    -&amp;gt;catherine 
## [16] lady    -&amp;gt;middleton  miss    -&amp;gt;tilney     miss    -&amp;gt;bingley   
## [19] thousand-&amp;gt;pounds     miss    -&amp;gt;dashwood   dear    -&amp;gt;miss      
## [22] miss    -&amp;gt;bennet     miss    -&amp;gt;morland    captain -&amp;gt;benwick   
## + ... omitted several edges&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 객체를 가지고 &lt;code&gt;ggraph()&lt;/code&gt; 함수를 적용하여 &lt;code&gt;igraph&lt;/code&gt; 객체를 &lt;code&gt;ggraph&lt;/code&gt; 객체로 변환할 수 있습니다.&lt;/li&gt;
&lt;li&gt;이후 &lt;code&gt;ggplot2&lt;/code&gt;에서 레이어를 추가하는 것처럼 레이어를 추가하여 진행합니다.&lt;/li&gt;
&lt;li&gt;기본적으로 노드(node)와 가장자리(edge), 그리고 텍스트(text) 세 가지 레이어를 추가해야 합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;set.seed(2021)

ggraph(bigram_graph, layout = &quot;fr&quot;) +
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1)&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.17.53.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/cJsDrs/btq9NX1J6fU/7dP1knHFJLFLzy32GCKwW1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/cJsDrs/btq9NX1J6fU/7dP1knHFJLFLzy32GCKwW1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/cJsDrs/btq9NX1J6fU/7dP1knHFJLFLzy32GCKwW1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FcJsDrs%2Fbtq9NX1J6fU%2F7dP1knHFJLFLzy32GCKwW1%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.17.53.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위 그래프에서 보면 &amp;ldquo;miss&amp;rdquo;, &amp;ldquo;lady&amp;rdquo;, &amp;ldquo;sir&amp;rdquo; 등의 단어를 중심으로 공통의 노드를 형성하고 종종 뒤에 이름이 붙는 것을 확인할 수 있습니다.&lt;/li&gt;
&lt;li&gt;더 나은 모양의 그래프를 만들기 위해 아래와 같이 몇 가지 작업으로 마무리합니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;ggraph(bigram_graph, layout = &quot;fr&quot;) +
  geom_edge_link(
    aes(edge_alpha = n), 
    arrow = grid::arrow(type = &quot;closed&quot;, length = unit(.15, &quot;inches&quot;)),
    end_cap = circle(.07, &quot;inches&quot;),
    show.legend = FALSE
  ) +
  geom_node_point(color = &quot;lightblue&quot;, size = 5) +
  geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  theme_void()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.18.00.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/2gNoj/btq9PS69lTH/fDXhfoRqYPdXXryT9W64ek/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/2gNoj/btq9PS69lTH/fDXhfoRqYPdXXryT9W64ek/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/2gNoj/btq9PS69lTH/fDXhfoRqYPdXXryT9W64ek/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2F2gNoj%2Fbtq9PS69lTH%2FfDXhfoRqYPdXXryT9W64ek%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.18.00.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h3 data-ke-size=&quot;size26&quot;&gt;4. 2. Counting and correlating pairs of words with the &lt;code&gt;widyr&lt;/code&gt; package&lt;/h3&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;위에서 살펴보았듯 ngram으로 토큰화하는 작업은 인접한 단어 쌍을 탐색하는 유용한 방법입니다.&lt;/li&gt;
&lt;li&gt;그러나 우리는 특정 문서나 특정 챕터에서 나란히 발생하지 않는 경우에도 함께 발생할 가능성이 있는 단어에 관심을 가질 수 있습니다.&lt;/li&gt;
&lt;li&gt;tidy data는 변수를 비교하거나 그룹화하는 데 있어서 유용한 구조이지만 행간의 비교는 다소 어려울 수 있습니다.&lt;/li&gt;
&lt;li&gt;예를 들어 두 단어가 동일한 문서에서 나타나는 횟수를 계산하거나 두 단어가 얼마나 상관관계가 있는 지 확인하기 위해서는 데이터를 wide format으로 변환해야 합니다. (행렬꼴)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;widyr&lt;/code&gt; 라이브러리는 이러한 부분에 있어서 도움을 줄 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;4. 2. 1. Counting and correlating among sections&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 &amp;ldquo;Pride &amp;amp; Prejudice&amp;rdquo; 책의 내용을 가지고 단어들을 토큰화 시켜보겠습니다.&lt;/li&gt;
&lt;li&gt;섹션을 나누는 기준은 10줄 단위로 하겠습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;austen_section_words &amp;lt;- austen_books() %&amp;gt;% 
  filter(book == &quot;Pride &amp;amp; Prejudice&quot;) %&amp;gt;% 
  mutate(section = row_number() %/% 10) %&amp;gt;% 
  filter(section &amp;gt; 0) %&amp;gt;% 
  unnest_tokens(
    input = text,
    output = &quot;word&quot;,
    token = &quot;words&quot;
  ) %&amp;gt;% 
  filter(!word %in% stop_words$word)

austen_section_words&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 37,240 x 3
##    book              section word        
##    &amp;lt;fct&amp;gt;               &amp;lt;dbl&amp;gt; &amp;lt;chr&amp;gt;       
##  1 Pride &amp;amp; Prejudice       1 truth       
##  2 Pride &amp;amp; Prejudice       1 universally 
##  3 Pride &amp;amp; Prejudice       1 acknowledged
##  4 Pride &amp;amp; Prejudice       1 single      
##  5 Pride &amp;amp; Prejudice       1 possession  
##  6 Pride &amp;amp; Prejudice       1 fortune     
##  7 Pride &amp;amp; Prejudice       1 wife        
##  8 Pride &amp;amp; Prejudice       1 feelings    
##  9 Pride &amp;amp; Prejudice       1 views       
## 10 Pride &amp;amp; Prejudice       1 entering    
## # &amp;hellip; with 37,230 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;widyr&lt;/code&gt;의 유용한 함수 중 하나는 &lt;code&gt;pairwise_count()&lt;/code&gt; 함수입니다.
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;&lt;code&gt;pairwise_&lt;/code&gt;는 &lt;code&gt;word&lt;/code&gt; 변수의 각 단어 쌍에 대해 하나의 행을 구성하는 의미입니다.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;item&lt;/code&gt; arguments에 단어가 들어가게 되고, &lt;code&gt;feature&lt;/code&gt; arguments에 각 섹션이 들어가게 됩니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;library(widyr)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;word_pairs &amp;lt;- austen_section_words %&amp;gt;% 
  pairwise_count(
    item = word,
    feature = section,
    sort = TRUE
  )&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## Warning: `distinct_()` was deprecated in dplyr 0.7.0.
## Please use `distinct()` instead.
## See vignette('programming') for more help&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;autohotkey&quot;&gt;&lt;code&gt;## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;word_pairs&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 796,008 x 3
##    item1     item2         n
##    &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt;
##  1 darcy     elizabeth   144
##  2 elizabeth darcy       144
##  3 miss      elizabeth   110
##  4 elizabeth miss        110
##  5 elizabeth jane        106
##  6 jane      elizabeth   106
##  7 miss      darcy        92
##  8 darcy     miss         92
##  9 elizabeth bingley      91
## 10 bingley   elizabeth    91
## # &amp;hellip; with 795,998 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;그 결과 각 섹션에서 단어의 쌍에 대해 하나의 행으로 구성된 데이터가 출력됩니다.&lt;/li&gt;
&lt;li&gt;이를 통해 섹션 단위 기준으로 특정 단어와 함께 자주 노출되는 단어를 쉽게 찾을 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;# darcy는 elizabeth와 같은 주인공 인물이라고 합니다.
word_pairs %&amp;gt;% 
  filter(item1 == &quot;darcy&quot;)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 2,930 x 3
##    item1 item2         n
##    &amp;lt;chr&amp;gt; &amp;lt;chr&amp;gt;     &amp;lt;dbl&amp;gt;
##  1 darcy elizabeth   144
##  2 darcy miss         92
##  3 darcy bingley      86
##  4 darcy jane         46
##  5 darcy bennet       45
##  6 darcy sister       45
##  7 darcy time         41
##  8 darcy lady         38
##  9 darcy friend       37
## 10 darcy wickham      37
## # &amp;hellip; with 2,920 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;p data-ke-size=&quot;size18&quot;&gt;&lt;br /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h4 data-ke-size=&quot;size26&quot;&gt;4. 2. 2. Pairwise correlation&lt;/h4&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;아래 표를 가지고 파이 계수를 아래와 같이 정의할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;1386&quot; data-origin-height=&quot;318&quot; data-filename=&quot;스크린샷 2021-07-18 오후 6.49.29.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bhhHHR/btq9R6jwauB/R3OzLcGGhZdkOeB9EYD8xk/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bhhHHR/btq9R6jwauB/R3OzLcGGhZdkOeB9EYD8xk/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bhhHHR/btq9R6jwauB/R3OzLcGGhZdkOeB9EYD8xk/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbhhHHR%2Fbtq9R6jwauB%2FR3OzLcGGhZdkOeB9EYD8xk%2Fimg.png&quot; data-origin-width=&quot;1386&quot; data-origin-height=&quot;318&quot; data-filename=&quot;스크린샷 2021-07-18 오후 6.49.29.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;1412&quot; data-origin-height=&quot;124&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.06.34.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bUgJJb/btq9JO5M5th/CTcdh7z6R8CTnJ1ydPE0Q1/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bUgJJb/btq9JO5M5th/CTcdh7z6R8CTnJ1ydPE0Q1/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bUgJJb/btq9JO5M5th/CTcdh7z6R8CTnJ1ydPE0Q1/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbUgJJb%2Fbtq9JO5M5th%2FCTcdh7z6R8CTnJ1ydPE0Q1%2Fimg.png&quot; data-origin-width=&quot;1412&quot; data-origin-height=&quot;124&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.06.34.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;p data-ke-size=&quot;size16&quot;&gt;&amp;nbsp;&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이는 binary data에서의 피어슨 상관계수와 동일한 포맷입니다.&lt;/li&gt;
&lt;li&gt;두 단어의 상관계수를 구하려면 &lt;code&gt;pairwise_col()&lt;/code&gt; 함수를 사용할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;word_corr &amp;lt;- austen_section_words %&amp;gt;% 
  group_by(word) %&amp;gt;% 
  filter(n() &amp;gt; 20) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  pairwise_cor(
    item = word,
    feature = section,
    sort = TRUE
  )

word_corr&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&quot;clean&quot;&gt;&lt;code&gt;## # A tibble: 140,250 x 3
##    item1     item2     correlation
##    &amp;lt;chr&amp;gt;     &amp;lt;chr&amp;gt;           &amp;lt;dbl&amp;gt;
##  1 bourgh    de              0.951
##  2 de        bourgh          0.951
##  3 pounds    thousand        0.701
##  4 thousand  pounds          0.701
##  5 william   sir             0.664
##  6 sir       william         0.664
##  7 catherine lady            0.663
##  8 lady      catherine       0.663
##  9 forster   colonel         0.622
## 10 colonel   forster         0.622
## # &amp;hellip; with 140,240 more rows&lt;/code&gt;&lt;/pre&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;이를 통해 특정 관심있는 단어를 필터링하여 가장 상관성이 높은 다른 단어를 찾을 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;word_corr %&amp;gt;% 
  filter(item1 %in% c(&quot;elizabeth&quot;, &quot;pounds&quot;, &quot;married&quot;, &quot;pride&quot;)) %&amp;gt;% 
  group_by(item1) %&amp;gt;% 
  slice_max(correlation, n = 6) %&amp;gt;% 
  ungroup() %&amp;gt;% 
  ggplot(aes(x = reorder(item2, correlation), y = correlation)) +
  geom_bar(stat = &quot;identity&quot;, colour = &quot;black&quot;, fill = &quot;grey85&quot;) +
  facet_wrap(~ item1, scales = &quot;free&quot;) +
  coord_flip()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.18.09.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/OYBfW/btq9JOLriZK/gVAouiPF5wnkHZdzV2rcI0/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/OYBfW/btq9JOLriZK/gVAouiPF5wnkHZdzV2rcI0/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/OYBfW/btq9JOLriZK/gVAouiPF5wnkHZdzV2rcI0/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FOYBfW%2Fbtq9JOLriZK%2FgVAouiPF5wnkHZdzV2rcI0%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.18.09.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;
&lt;ul style=&quot;list-style-type: disc;&quot; data-ke-list-type=&quot;disc&quot;&gt;
&lt;li&gt;마찬가지로 &lt;code&gt;ggraph&lt;/code&gt; 라이브러리를 활용하여 상관계수에 기반한 네트워크 그래프를 출력할 수 있습니다.&lt;/li&gt;
&lt;/ul&gt;
&lt;pre class=&quot;language-r&quot;&gt;&lt;code&gt;word_corr %&amp;gt;% 
  filter(correlation &amp;gt; .15) %&amp;gt;% # 상관계수가 0.15를 넘는 단어 쌍에 대해서만 필터링
  graph_from_data_frame() %&amp;gt;% 
  ggraph(layout = &quot;fr&quot;) +
  geom_edge_link(aes(edge_alpha = correlation), show.legend = FALSE) +
  geom_node_point(color = &quot;lightblue&quot;, size = 5) +
  geom_node_text(aes(label = name), repel = TRUE) + # repel = TRUE arguments는 라벨링의 가독성에 도움을 줌
  theme_void()&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;figure class=&quot;imageblock alignCenter&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.18.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot;&gt;&lt;span data-url=&quot;https://blog.kakaocdn.net/dn/bsEdyw/btq9O8h4gwE/2ckEnof1KbmIPjwJyqz0gK/img.png&quot; data-phocus=&quot;https://blog.kakaocdn.net/dn/bsEdyw/btq9O8h4gwE/2ckEnof1KbmIPjwJyqz0gK/img.png&quot;&gt;&lt;img src=&quot;https://blog.kakaocdn.net/dn/bsEdyw/btq9O8h4gwE/2ckEnof1KbmIPjwJyqz0gK/img.png&quot; srcset=&quot;https://img1.daumcdn.net/thumb/R1280x0/?scode=mtistory2&amp;fname=https%3A%2F%2Fblog.kakaocdn.net%2Fdn%2FbsEdyw%2Fbtq9O8h4gwE%2F2ckEnof1KbmIPjwJyqz0gK%2Fimg.png&quot; data-origin-width=&quot;673&quot; data-origin-height=&quot;478&quot; data-filename=&quot;스크린샷 2021-07-18 오후 7.18.17.png&quot; data-ke-mobilestyle=&quot;widthOrigin&quot; onerror=&quot;this.onerror=null; this.src='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png'; this.srcset='//t1.daumcdn.net/tistory_admin/static/images/no-image-v1.png';&quot;/&gt;&lt;/span&gt;&lt;/figure&gt;
&lt;/p&gt;</description>
      <category>tidytext</category>
      <category>bigram</category>
      <category>ggraph</category>
      <category>graph_from_data_frame</category>
      <category>igraph</category>
      <category>N-Gram</category>
      <category>pairwise_cor</category>
      <category>pairwise_count</category>
      <category>tidy data</category>
      <category>tidytext</category>
      <category>widyr</category>
      <author>택2</author>
      <guid isPermaLink="true">https://rstatistics.tistory.com/71</guid>
      <comments>https://rstatistics.tistory.com/71#entry71comment</comments>
      <pubDate>Sun, 18 Jul 2021 19:22:40 +0900</pubDate>
    </item>
  </channel>
</rss>