[R] unnest_tokens()

토큰화

텍스트 데이터를 분석할 수 있는 단위로 쪼개는 과정입니다.
분석 목적에 따라 글자, 단어, n-gram, 문장, 문단 등 다양하게 지정할 수 있으며 디폴트 값은 단어(words)입니다.
- 단어 단위 token = "words"
- 글자 단위 token = "characters"
- 복수의 글자 단위 token = "character_shingles"
- 복수의 단어 단위 token = "ngrams"
- 정규표현식으로 지정 token = "regex"

text <- c("I'm not lazy.\nI'm just on my energy saving mode.")

tibble(text = text) %>% 
  unnest_tokens(
    input = text,
    output = "word",
    token = "words"
  )

## # A tibble: 10 x 1
##    word  
##    <chr> 
##  1 i'm   
##  2 not   
##  3 lazy  
##  4 i'm   
##  5 just  
##  6 on    
##  7 my    
##  8 energy
##  9 saving
## 10 mode

tibble(text = text) %>% 
  unnest_tokens(
    input = text,
    output = "word",
    token = "characters"
  )

## # A tibble: 35 x 1
##    word 
##    <chr>
##  1 i    
##  2 m    
##  3 n    
##  4 o    
##  5 t    
##  6 l    
##  7 a    
##  8 z    
##  9 y    
## 10 i    
## # … with 25 more rows

tibble(text = text) %>% 
  unnest_tokens(
    input = text,
    output = "word",
    token = "character_shingles",
    n = 5
  )

## # A tibble: 31 x 1
##    word 
##    <chr>
##  1 imnot
##  2 mnotl
##  3 notla
##  4 otlaz
##  5 tlazy
##  6 lazyi
##  7 azyim
##  8 zyimj
##  9 yimju
## 10 imjus
## # … with 21 more rows

tibble(text = text) %>% 
  unnest_tokens(
    input = text,
    output = "word",
    token = "ngrams",
    n = 2
  )

## # A tibble: 9 x 1
##   word         
##   <chr>        
## 1 i'm not      
## 2 not lazy     
## 3 lazy i'm     
## 4 i'm just     
## 5 just on      
## 6 on my        
## 7 my energy    
## 8 energy saving
## 9 saving mode

tibble(text = text) %>% 
  unnest_tokens(
    input = text,
    output = "word",
    token = "regex",
    pattern = "\n"
  )

## # A tibble: 2 x 1
##   word                              
##   <chr>                             
## 1 i'm not lazy.                     
## 2 i'm just on my energy saving mode.

토큰화 하는 과정에서 디폴트로 모든 대문자들은 소문자로 변경합니다.
하지만 to_lower = FALSE argument를 전달하면 모든 대문자를 소문자로 변경하지 않습니다. (사람이름 등 구분 시 필요)

tibble(text = text) %>% 
  unnest_tokens(
    input = text,
    output = "word",
    token = "regex",
    pattern = "\n",
    to_lower = FALSE
  )

## # A tibble: 2 x 1
##   word                              
##   <chr>                             
## 1 I'm not lazy.                     
## 2 I'm just on my energy saving mode.

문장의 부호를 제거하지 않으려면 strip_punct = FALSE argument를 전달하면 됩니다.

tibble(text = text) %>% 
  unnest_tokens(
    input = text,
    output = "word",
    token = "words",
    strip_punct = FALSE
  )

## # A tibble: 12 x 1
##    word  
##    <chr> 
##  1 i'm   
##  2 not   
##  3 lazy  
##  4 .     
##  5 i'm   
##  6 just  
##  7 on    
##  8 my    
##  9 energy
## 10 saving
## 11 mode  
## 12 .

저작자표시 비영리 (새창열림)

'tidytext' 카테고리의 다른 글

[R] 3. Analyzing word and document frequency: TF-IDF (0)	2021.07.18
[R] 한글 형태소 분석 (0)	2021.07.17
[R] 2. Sentiment analysis with tidy data (0)	2021.07.17
[R] 정규표현식 관련 (0)	2021.07.15
[R] stringr 문자열 관련 처리 함수 (0)	2021.07.14

TAGS.

제이드의 낙서장

카테고리

방문자수

[R] unnest_tokens()

토큰화

'tidytext' 카테고리의 다른 글

Comments

티스토리툴바