[R] 5. Converting to and from non-tidy formats

5. Converting to and from non-tidy formats

이번 챕터에서는 텍스트 데이터를 tidy text format이 아닌 tm, quanteda 라이브러리에서 활용될 수 있는 코퍼스(corpus) 객체로 분석하는 방법에 대해서 설명합니다.

5. 1. Tidying a document-term matrix

문서 용어 행렬(DTM, Document-Term Matrix)은 텍스트 분석에서 일반적으로 쓰이는 구조 중 하나 입니다.
이는 아래와 같은 형태를 갖습니다.
- 각 행은 하나의 문서(ex. book, article, …)를 나타냅니다.
- 각 열은 하나의 단어를 나타냅니다.
- 일반적으로 각 행렬에 대한 값은 해당 문서에서 해당 단어의 출현 빈도가 됩니다.
여러 문서 안에서 문서-단어 쌍이 공통적으로 많이 발생하는 케이스는 드물기에 DTM은 일반적으로 희소 행렬(sparse matrix)로 구현됩니다.
tidytext 라이브러리는 DTM 객체를 직접 사용할 수 없지만 이를 tidy data frame 형태로 변환해주는 함수를 제공합니다.
- tidy(): DTM to tidy data (in broom library)

5. 1. 1. Tidying DocumentTermMatrix objects

R에서 가장 널리 이용되는 DTM 구현은 tm 라이브러리 내에 DocumentTermMatrix 클래스를 갖는 객체입니다.
예시를 보이기 위해 topicmodels 라이브러리에 있는 Associated Press 뉴스 기사 데이터를 참고합니다.

library(tm)

## 필요한 패키지를 로딩중입니다: NLP

## 
## 다음의 패키지를 부착합니다: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

data("AssociatedPress", package = "topicmodels")
AssociatedPress

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

위에서 보이다시피 이 데이터는 2,246개의 뉴스 기사와 10,473개의 단어로 구성된 DTM 객체이며 99%가 문서-단어 쌍의 값이 0인 희소행렬로 입니다.
Terms() 함수를 사용하여 문서의 단어에 접근할 수 있습니다.

terms <- Terms(AssociatedPress)

terms %>% 
  head(20)

##  [1] "aaron"      "abandon"    "abandoned"  "abandoning" "abbott"    
##  [6] "abboud"     "abc"        "abcs"       "abctvs"     "abdomen"   
## [11] "abducted"   "abduction"  "abductors"  "abdul"      "abide"     
## [16] "abilities"  "ability"    "ablaze"     "able"       "abm"

이를 tidy data format으로 변환하기 위해서 tidy() 함수를 사용합니다.

ap_tidy <- AssociatedPress %>% 
  tidy()

ap_tidy

## # A tibble: 302,031 x 3
##    document term       count
##       <int> <chr>      <dbl>
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # … with 302,021 more rows

이렇게 나온 결과 값은 희소행렬을 reshape2 라이브러리의 melt() 함수를 사용한 결과와 유사하다고 보면 됩니다.
또한 tidy() 함수를 적용함으로써 희소행렬 값이 0으로 취급받는 문서-단어 쌍은 결과에 출력되지 않습니다.

이제 이러한 결과를 가지고 감정 분석을 진행해볼 수 있습니다.

ap_sentiments <- ap_tidy %>% 
  inner_join(get_sentiments("bing"), by = c(term = "word"))

ap_sentiments

## # A tibble: 30,094 x 4
##    document term    count sentiment
##       <int> <chr>   <dbl> <chr>    
##  1        1 assault     1 negative 
##  2        1 complex     1 negative 
##  3        1 death       1 negative 
##  4        1 died        1 negative 
##  5        1 good        2 positive 
##  6        1 illness     1 negative 
##  7        1 killed      2 negative 
##  8        1 like        2 positive 
##  9        1 liked       1 positive 
## 10        1 miracle     1 positive 
## # … with 30,084 more rows

ap_sentiments %>% 
  count(sentiment, term, wt = count) %>% 
  ungroup() %>% 
  filter(n >= 200) %>% 
  mutate(n = ifelse(sentiment == "negative", -n, n)) %>% 
  ggplot(aes(x = reorder(term, n), y = n, fill = sentiment)) +
  geom_col() +
  coord_flip() +
  labs(x = "Contribution to sentiment", y = NULL)

가장 흔하게 나타나는 긍정적인 단어는 “like”, “work”, “support”, “good”
가장 부정적인 단어는 “killed”, “death” 등이 잇습니다.

5. 1. 2. Tidying `dfm` objects

다른 라이브러리인 quanteda에서 dfm() 함수도 dfm 이라는 클래스로 문서-단어 행렬을 구현을 제공합니다.
- dtm, dfm
- 가운데 글자 하나에 차이가 있습니다.
- dfm에서 f는 feature를 의미한다고 합니다. (document-feature matrix)
예시를 보기 위해 quanteda 라이브러리의 data_corpus_inaugural 데이터를 참고하겠습니다. (취임연설 관련 데이터)

library(quanteda)

## Package version: 3.0.0
## Unicode version: 13.0
## ICU version: 69.1

## Parallel computing: 16 of 16 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## 다음의 패키지를 부착합니다: 'quanteda'

## The following object is masked from 'package:tm':
## 
##     stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

data("data_corpus_inaugural", pacakge = "quanteda")

## Warning in data("data_corpus_inaugural", pacakge = "quanteda"): data set
## 'quanteda' not found

data_corpus_inaugural

## Corpus consisting of 59 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
## 
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
## 
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
## 
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
## 
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
## 
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."
## 
## [ reached max_ndoc ... 53 more documents ]

inaug_dfm <- dfm(data_corpus_inaugural, verbose = FALSE)

## Warning: 'dfm.corpus()' is deprecated. Use 'tokens()' first.

inaug_dfm

## Document-feature matrix of: 59 documents, 9,439 features (91.84% sparse) and 4 docvars.
##                  features
## docs              fellow-citizens  of the senate and house representatives :
##   1789-Washington               1  71 116      1  48     2               2 1
##   1793-Washington               0  11  13      0   2     0               0 1
##   1797-Adams                    3 140 163      1 130     0               2 0
##   1801-Jefferson                2 104 130      0  81     0               0 1
##   1805-Jefferson                0 101 143      0  93     0               0 0
##   1809-Madison                  1  69 104      0  43     0               0 0
##                  features
## docs              among vicissitudes
##   1789-Washington     1            1
##   1793-Washington     0            0
##   1797-Adams          4            0
##   1801-Jefferson      1            0
##   1805-Jefferson      7            0
##   1809-Madison        0            0
## [ reached max_ndoc ... 53 more documents, reached max_nfeat ... 9,429 more features ]

이것도 역시 tidy() 함수를 적용할 수 있습니다.

inaug_tidy <- inaug_dfm %>% 
  tidy()

inaug_tidy

## # A tibble: 45,453 x 3
##    document        term            count
##    <chr>           <chr>           <dbl>
##  1 1789-Washington fellow-citizens     1
##  2 1797-Adams      fellow-citizens     3
##  3 1801-Jefferson  fellow-citizens     2
##  4 1809-Madison    fellow-citizens     1
##  5 1813-Madison    fellow-citizens     1
##  6 1817-Monroe     fellow-citizens     5
##  7 1821-Monroe     fellow-citizens     1
##  8 1841-Harrison   fellow-citizens    11
##  9 1845-Polk       fellow-citizens     1
## 10 1849-Taylor     fellow-citizens     1
## # … with 45,443 more rows

이를 가지고 bind_tf_idf() 함수를 사용하여 TF-IDF 값을 구해볼 수 있고 이를 시각화해볼 수 있습니다.

inaug_tf_idf <- inaug_tidy %>% 
  bind_tf_idf(
    term = term,
    document = document,
    n = count
  )

inaug_tf_idf

## # A tibble: 45,453 x 6
##    document        term            count       tf   idf   tf_idf
##    <chr>           <chr>           <dbl>    <dbl> <dbl>    <dbl>
##  1 1789-Washington fellow-citizens     1 0.000651  1.13 0.000737
##  2 1797-Adams      fellow-citizens     3 0.00116   1.13 0.00132 
##  3 1801-Jefferson  fellow-citizens     2 0.00104   1.13 0.00118 
##  4 1809-Madison    fellow-citizens     1 0.000793  1.13 0.000899
##  5 1813-Madison    fellow-citizens     1 0.000768  1.13 0.000870
##  6 1817-Monroe     fellow-citizens     5 0.00136   1.13 0.00154 
##  7 1821-Monroe     fellow-citizens     1 0.000205  1.13 0.000232
##  8 1841-Harrison   fellow-citizens    11 0.00121   1.13 0.00137 
##  9 1845-Polk       fellow-citizens     1 0.000193  1.13 0.000218
## 10 1849-Taylor     fellow-citizens     1 0.000849  1.13 0.000962
## # … with 45,443 more rows

inaug_tf_idf %>% 
  filter(document %in% c("1861-Lincoln", "1933-Roosevelt", "1961-Kennedy", "2009-Obama")) %>% 
  group_by(document) %>% 
  slice_max(tf_idf, n = 10) %>% 
  ungroup() %>% 
  ggplot(aes(x = reorder(term, tf_idf), y = tf_idf, fill = document)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  facet_wrap(~ document, ncol = 2, scales = "free")

또 다른 시각화의 예로, 각 문서의 이름에서 연도를 추출하고 각 연도의 총 단어 수도 게산할 수 있습니다.

year_term_counts <- inaug_tidy %>% 
  extract(
    col = document,
    into = "year",
    regex = "(\\d+)",
    convert = TRUE
  ) %>% 
  complete(year, term, fill = list(count = 0)) %>% # 단어가 문서에 나타나지 않는 케이스를 포함하기 위해
  group_by(year) %>% 
  mutate(total_count = sum(count)) %>% 
  ungroup()

year_term_counts

## # A tibble: 556,901 x 4
##     year term  count total_count
##    <int> <chr> <dbl>       <dbl>
##  1  1789 "-"       1        1537
##  2  1789 ","      70        1537
##  3  1789 ";"       8        1537
##  4  1789 ":"       1        1537
##  5  1789 "!"       0        1537
##  6  1789 "?"       0        1537
##  7  1789 "."      23        1537
##  8  1789 "…"       0        1537
##  9  1789 "'"       0        1537
## 10  1789 "\""      2        1537
## # … with 556,891 more rows

주요 특정 단어를 필터링하여 해당 단어들이 시간이 지남에 따라 빈도가 어떻게 변했는지 확인해볼 수 있습니다.

year_term_counts %>% 
  filter(term %in% c("god", "america", "foreign", "union", "constitution", "freedom")) %>% 
  mutate(ratio = count/total_count) %>% 
  ggplot(aes(x = year, y = ratio)) +
  geom_point(size = 1.2) +
  geom_smooth(formula = y ~ x, method = "loess") +
  facet_wrap(~ term, scales = "free_y") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(y = "% frequency of word in inaugural address")

5. 2. Casting tidy text data into a matrix

어떤 함수들은 tidy format이 아닌 dtm format을 input으로 필요로 하는 케이스가 있을 수도 있습니다.
tidytext 라이브러리에서는 tidy format을 dtm format으로 변환해주는 함수 역시 존재합니다.
그 함수명은 cast_dtm() 입니다.
위에서 dtm to tidy로 변환했던 ap_tidy 객체를 활용해보겠습니다.

ap_tidy

## # A tibble: 302,031 x 3
##    document term       count
##       <int> <chr>      <dbl>
##  1        1 adding         1
##  2        1 adult          2
##  3        1 ago            1
##  4        1 alcohol        1
##  5        1 allegedly      1
##  6        1 allen          1
##  7        1 apparently     2
##  8        1 appeared       1
##  9        1 arrested       1
## 10        1 assault        1
## # … with 302,021 more rows

ap_dtm <- ap_tidy %>% 
  cast_dtm(
    term = term,
    document = document,
    value = count
  )

ap_dtm

## <<DocumentTermMatrix (documents: 2246, terms: 10473)>>
## Non-/sparse entries: 302031/23220327
## Sparsity           : 99%
## Maximal term length: 18
## Weighting          : term frequency (tf)

마찬가지로 quanteda 라이브러리의 dfm 객체 역시 cast_dfm() 함수로 변환할 수 있습니다.

ap_dfm <- ap_tidy %>% 
  cast_dfm(
    term = term,
    document = document,
    value = count
  )

ap_dfm

## Document-feature matrix of: 2,246 documents, 10,473 features (98.72% sparse) and 0 docvars.
##     features
## docs adding adult ago alcohol allegedly allen apparently appeared arrested
##    1      1     2   1       1         1     1          2        1        1
##    2      0     0   0       0         0     0          0        1        0
##    3      0     0   1       0         0     0          0        1        0
##    4      0     0   3       0         0     0          0        0        0
##    5      0     0   0       0         0     0          0        0        0
##    6      0     0   2       0         0     0          0        0        0
##     features
## docs assault
##    1       1
##    2       0
##    3       0
##    4       0
##    5       0
##    6       0
## [ reached max_ndoc ... 2,240 more documents, reached max_nfeat ... 10,463 more features ]

이러한 종류의 변환을 통해서 이전 챕터에서 예시로 보았던 Jane Austen 책 역시 dtm 객체로 만들 수 있습니다.

library(janeaustenr)

austen_dtm <- austen_books() %>% 
  unnest_tokens(input = text, output = "word") %>% 
  count(book, word) %>% 
  cast_dtm(
    term = word,
    document = book,
    value = n
  )

austen_dtm

## <<DocumentTermMatrix (documents: 6, terms: 14520)>>
## Non-/sparse entries: 40379/46741
## Sparsity           : 54%
## Maximal term length: 19
## Weighting          : term frequency (tf)

5. 3. Tidying corpus objects with metadata

Corpus라고 하는 객체는 토큰화 전에 문서 컬렉션들을 저장해놓은 객체입니다.
여기에는 각 문서의 고유 아이디나 날짜/시간 또는 제목 등 포함할 수 있는 메타데이터와 함께 텍스트를 저장합니다.
아래 예시를 들어 살펴보겠습니다 (tm 라이브러리에서 acq 데이터, 뉴스 기사 50개)

data("acq")

acq

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 50

class(acq)

## [1] "VCorpus" "Corpus"

아래와 같이 Corpus 객체는 텍스트와 메타데이터가 모두 포함됩니다.

acq[[1]]

## <<PlainTextDocument>>
## Metadata:  15
## Content:  chars: 1287

이러한 방식은 텍스트 데이터를 저장하는 유연한 방법이지만 tidytext 라이브러리로 처리하기에는 적합하지 않습니다.
따라서 tidy() 함수를 사용하여 tidy data format으로 변환시켜 분석해야 합니다.

acq_tidy <- acq %>% 
  tidy()

acq_tidy

## # A tibble: 50 x 16
##    author  datetimestamp       description heading  id    language origin topics
##    <chr>   <dttm>              <chr>       <chr>    <chr> <chr>    <chr>  <chr> 
##  1 <NA>    1987-02-26 15:18:06 ""          COMPUTE… 10    en       Reute… YES   
##  2 <NA>    1987-02-26 15:19:15 ""          OHIO MA… 12    en       Reute… YES   
##  3 <NA>    1987-02-26 15:49:56 ""          MCLEAN'… 44    en       Reute… YES   
##  4 By Cal… 1987-02-26 15:51:17 ""          CHEMLAW… 45    en       Reute… YES   
##  5 <NA>    1987-02-26 16:08:33 ""          <COFAB … 68    en       Reute… YES   
##  6 <NA>    1987-02-26 16:32:37 ""          INVESTM… 96    en       Reute… YES   
##  7 By Pat… 1987-02-26 16:43:13 ""          AMERICA… 110   en       Reute… YES   
##  8 <NA>    1987-02-26 16:59:25 ""          HONG KO… 125   en       Reute… YES   
##  9 <NA>    1987-02-26 17:01:28 ""          LIEBERT… 128   en       Reute… YES   
## 10 <NA>    1987-02-26 17:08:27 ""          GULF AP… 134   en       Reute… YES   
## # … with 40 more rows, and 8 more variables: lewissplit <chr>, cgisplit <chr>,
## #   oldid <chr>, places <named list>, people <lgl>, orgs <lgl>,
## #   exchanges <lgl>, text <chr>

이를 unnest_tokens() 함수를 사용하여 단어들을 토큰화시킬 수 있습니다.

acq_tokens <- acq_tidy %>% 
  select(-places) %>% 
  unnest_tokens(
    input = text,
    output = "word"
  ) %>% 
  anti_join(stop_words, by = "word")

## Warning: Outer names are only allowed for unnamed scalar atomic inputs

# most common words
acq_tokens %>% 
  count(word, sort = TRUE)

## # A tibble: 1,566 x 2
##    word         n
##    <chr>    <int>
##  1 dlrs       100
##  2 pct         70
##  3 mln         65
##  4 company     63
##  5 shares      52
##  6 reuter      50
##  7 stock       46
##  8 offer       34
##  9 share       34
## 10 american    28
## # … with 1,556 more rows

# tf-idf
acq_tokens %>% 
  count(id, word) %>% 
  bind_tf_idf(
    term = word,
    document = id,
    n = n
  ) %>% 
  arrange(desc(tf_idf))

## # A tibble: 2,853 x 6
##    id    word         n     tf   idf tf_idf
##    <chr> <chr>    <int>  <dbl> <dbl>  <dbl>
##  1 186   groupe       2 0.133   3.91  0.522
##  2 128   liebert      3 0.130   3.91  0.510
##  3 474   esselte      5 0.109   3.91  0.425
##  4 371   burdett      6 0.103   3.91  0.405
##  5 442   hazleton     4 0.103   3.91  0.401
##  6 199   circuit      5 0.102   3.91  0.399
##  7 162   suffield     2 0.1     3.91  0.391
##  8 498   west         3 0.1     3.91  0.391
##  9 441   rmj          8 0.121   3.22  0.390
## 10 467   nursery      3 0.0968  3.91  0.379
## # … with 2,843 more rows

저작자표시 비영리 (새창열림)

'tidytext' 카테고리의 다른 글

[R] 6. Topic modeling (0)	2021.07.20
[R] 4. Relationships between words: n-grams and correlations (0)	2021.07.18
[R] 3. Analyzing word and document frequency: TF-IDF (0)	2021.07.18
[R] 한글 형태소 분석 (0)	2021.07.17
[R] unnest_tokens() (0)	2021.07.17

TAGS.

제이드의 낙서장

카테고리

방문자수

[R] 5. Converting to and from non-tidy formats

5. Converting to and from non-tidy formats

5. 1. Tidying a document-term matrix

5. 1. 1. Tidying DocumentTermMatrix objects

5. 1. 2. Tidying `dfm` objects

5. 2. Casting tidy text data into a matrix

5. 3. Tidying corpus objects with metadata

'tidytext' 카테고리의 다른 글

Comments

티스토리툴바

카테고리

방문자수

[R] 5. Converting to and from non-tidy formats

5. Converting to and from non-tidy formats

5. 1. Tidying a document-term matrix

5. 1. 1. Tidying DocumentTermMatrix objects

5. 1. 2. Tidying dfm objects

5. 2. Casting tidy text data into a matrix

5. 3. Tidying corpus objects with metadata

'tidytext' 카테고리의 다른 글

Comments

티스토리툴바

5. 1. 2. Tidying `dfm` objects