Document Classification

Document Classification - Basic

1. tm package
2. 변환
3. 문서의 행렬 표현
4. 단어 빈도
5. 단어 간 상관관계

1. `tm` package

공부했다면 흔히 볼 수 있는 예시 이메일의 스팸메일 여부를 구분하는 것이 대표적인 문서 분류의 예시입니다. 또 다른 예시로는 제품 리뷰 글을 보고 해당 리뷰가 긍정인지 부정인지 구분하는 감성 분석(Sentiment Analysis)도 있습니다.
tm 패키지는 텍스트마이닝 패키지 중 하나로 문서의 집합은 Corpus로, 각 문서는 TextDocument로 표현됩니다.
한글로 텍스트마이닝은 한글형태소가 잘되어 있는 패키지 KoNLP 등을 병행하여 같이 이용하시면 됩니다. 이번 예제에서는 기본이기때문에 내장되어진 영어로된 문서를 이용하겠습니다.
```
library(tm)
```
```
summary(corpus)   # 코퍼스의 요약 정보를 출력
```
```
inspect(corpus)   # input으로 코퍼스 또는 단어-문서(term-document)행렬이 들어가며 문서 정보를 출력
```
코퍼스는 더 이상 큰 추가작업 없이 로컬에서 마이닝을 할 수 있는 상태인 객체라고 생각하시면 됩니다.
또한 다음과 같은 코드로 비정형데이터 x를 코퍼스화 시킬 수 있습니다.

Corpus(x, readerControl = list(language = "lat"))
Corpus( VectorSource(vec) )

tm 패키지에 내장된 데이터로 뉴스 기사 중 원유와 관련된 기사 20개가 저장된 crude 데이터를 살펴보겠습니다.

data(crude)

summary(crude)

##     Length Class             Mode
## 127 2      PlainTextDocument list
## 144 2      PlainTextDocument list
## 191 2      PlainTextDocument list
## 194 2      PlainTextDocument list
## 211 2      PlainTextDocument list
## 236 2      PlainTextDocument list
## 237 2      PlainTextDocument list
## 242 2      PlainTextDocument list
## 246 2      PlainTextDocument list
## 248 2      PlainTextDocument list
## 273 2      PlainTextDocument list
## 349 2      PlainTextDocument list
## 352 2      PlainTextDocument list
## 353 2      PlainTextDocument list
## 368 2      PlainTextDocument list
## 489 2      PlainTextDocument list
## 502 2      PlainTextDocument list
## 543 2      PlainTextDocument list
## 704 2      PlainTextDocument list
## 708 2      PlainTextDocument list

2. 변환

문서에서 문장 부호를 제거하거나, 문자를 모두 소문자로 바꾸거나, 단어를 그 원형이 되는 뿌리 형태로 바꿔주는 스테밍(stemming) 등을 적용할 때 tm_map() 함수를 이용합니다.
```
tm_map(x, FUN)
```
- x : corpus
- FUN : 변환에 사용할 함수
tm 패키지에서 변환 함수들의 목록은 getTransformations() 함수를 이용하여 찾아볼 수 있습니다.
```
getTransformations()
```
```
## [1] "removeNumbers"     "removePunctuation" "removeWords"      
## [4] "stemDocument"      "stripWhitespace"
```
- stripWhitespace : 불필요한 공백 제거, 연속된 공백 여러 개는 공백 하나로 치환
- tolower : 대문자를 소문자로 변경
- removePunctuation : 문장 부호 제거

다음 예시는 글자들을 모두 소문자로 바꾸고 문장부호를 제거하는 예시입니다.

x <- tm_map(crude, tolower)
inspect(tm_map(x, removePunctuation)[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## $`reut-00001.xml`
## [1] diamond shamrock corp said that\neffective today it had cut its contract prices for crude oil by\n150 dlrs a barrel\n    the reduction brings its posted price for west texas\nintermediate to 1600 dlrs a barrel the copany said\n    the price reduction today was made in the light of falling\noil product prices and a weak crude oil market a company\nspokeswoman said\n    diamond is the latest in a line of us oil companies that\nhave cut its contract or posted prices over the last two days\nciting weak oil markets\n reuter

3. 문서의 행렬 표현

문서를 분류하려면 문서를 기술하는 표현을 문서로부터 추출한 후, 이로부터 행렬로 표현합니다.
단어와 문서의 행렬로 corpus를 표현하고자 할 때에는 TermDocumentMatrix() 또는 DocumentTermMatrix() 함수를 이용합니다.
- TermDocumentMatrix() 함수는 주어진 문서들로부터 단어를 행, 문서를 열로 하는 행렬을 만듭니다.
- DocumentTermMatrix() 함수는 문서를 행, 단어를 열로 표현합니다.
```
TermDocumentMatrix(x, control = list())   # x : corpus, control : bounds, weighting 등의 제어 옵션
```

다음은 crude를 단어-문서 행렬로 표현한 예시입니다.

x <- TermDocumentMatrix(crude)
x

## <<TermDocumentMatrix (terms: 1266, documents: 20)>>
## Non-/sparse entries: 2255/23065
## Sparsity           : 91%
## Maximal term length: 17
## Weighting          : term frequency (tf)

inspect(x[1:10, 1:10])

## <<TermDocumentMatrix (terms: 10, documents: 10)>>
## Non-/sparse entries: 9/91
## Sparsity           : 91%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##             Docs
## Terms        127 144 191 194 211 236 237 242 246 248
##   "(it)        0   0   0   0   0   0   1   0   0   0
##   "demand      0   1   0   0   0   0   0   0   0   0
##   "expansion   0   0   0   0   0   0   0   0   0   0
##   "for         0   0   0   0   0   0   1   0   0   0
##   "growth      0   0   0   0   0   0   1   0   0   0
##   "if          0   0   0   0   0   1   0   0   0   0
##   "is          0   0   0   0   0   0   0   0   0   1
##   "may         0   0   0   0   0   0   0   0   0   1
##   "none        0   0   0   0   0   1   0   0   0   0
##   "opec        0   2   0   0   0   0   0   0   0   0

TF-IDF

Weighting 함수로 가장 많이 사용되는 함수는 TF-IDF입니다. TF는 Term Frequency의 약자로 단어의 출현 횟수를 뜻합니다. 예를 들면 특정 단어가 전혀 안나오거나 한 번 나온 문서보다 열번 나온 문서가 그 특정단어에 대해 더 많이 그리고 자세히 기술하고 있음을 쉽게 짐작할 수 있습니다. 이러한 이유로 단어의 출현 횟수는 문서를 설명하는데 주요한 특징이 될 수 있습니다.
IDF는 Inverse Document Frequency의 약자로 단어가 출현한 문서의 역수 입니다.
TF-IDF에 대한 자세한 설명은 구글 검색을 참고하시기 바랍니다.

다음은 TF-IDF weighting을 사용한 예입니다.

x2 <- TermDocumentMatrix(crude, control = list(weighting = weightTfIdf))
inspect(x2[1:10, 1:10])

## <<TermDocumentMatrix (terms: 10, documents: 10)>>
## Non-/sparse entries: 9/91
## Sparsity           : 91%
## Maximal term length: 10
## Weighting          : term frequency - inverse document frequency (normalized) (tf-idf)
## Sample             :
##             Docs
## Terms        127        144 191 194 211         236        237 242 246
##   "(it)        0 0.00000000   0   0   0 0.000000000 0.01214025   0   0
##   "demand      0 0.01180855   0   0   0 0.000000000 0.00000000   0   0
##   "expansion   0 0.00000000   0   0   0 0.000000000 0.00000000   0   0
##   "for         0 0.00000000   0   0   0 0.000000000 0.01214025   0   0
##   "growth      0 0.00000000   0   0   0 0.000000000 0.01214025   0   0
##   "if          0 0.00000000   0   0   0 0.011618086 0.00000000   0   0
##   "is          0 0.00000000   0   0   0 0.000000000 0.00000000   0   0
##   "may         0 0.00000000   0   0   0 0.000000000 0.00000000   0   0
##   "none        0 0.00000000   0   0   0 0.008929914 0.00000000   0   0
##   "opec        0 0.02361709   0   0   0 0.000000000 0.00000000   0   0
##             Docs
## Terms               248
##   "(it)      0.00000000
##   "demand    0.00000000
##   "expansion 0.00000000
##   "for       0.00000000
##   "growth    0.00000000
##   "if        0.00000000
##   "is        0.01153447
##   "may       0.01500669
##   "none      0.00000000
##   "opec      0.00000000

4. 단어 빈도

findFreqTerms() 함수는 단어-문서 행렬로부터 자주 출현하는 단어를 찾아주는 함수입니다.
```
findFreqTerms(x,
              lowfreq = 0,
              highfreq = Inf)
```
- x : 단어-문서 또는 문서-단어 행렬
- lowfreq : 최소 출현 횟수, highfreq : 최대 출현 횟수

다음 예시는 전체 20개 문서로 구성된 crude 코퍼스에서 10회 이상 출현한 단어를 찾은 예시입니다.

findFreqTerms(TermDocumentMatrix(crude), lowfreq = 10)

##  [1] "about"      "and"        "are"        "bpd"        "but"       
##  [6] "crude"      "dlrs"       "for"        "from"       "government"
## [11] "has"        "its"        "kuwait"     "last"       "market"    
## [16] "mln"        "new"        "not"        "official"   "oil"       
## [21] "one"        "opec"       "pct"        "price"      "prices"    
## [26] "reuter"     "said"       "said."      "saudi"      "sheikh"    
## [31] "that"       "the"        "they"       "u.s."       "was"       
## [36] "were"       "will"       "with"       "would"

5. 단어 간 상관관계

findAssocs() 함수는 주어진 단어와 상관계수가 높은 단어를 찾는 함수입니다.
```
findAssocs(x, 
           terms,
           corlimit)
```
- x : 단어-문서 또는 문서-단어 행렬
- terms : 상관계수가 높은 단어를 찾을 단어들
- corlimit : 상관계수 하한
다음은 oil과 상관계수가 0.7이상인 단어들을 찾은 예시입니다.

findAssocs(TermDocumentMatrix(crude), 
           terms = "oil", 
           corlimit = 0.7)

## $oil
##      15.8      opec   clearly      late    trying       who    winter 
##      0.87      0.87      0.80      0.80      0.80      0.80      0.80 
##  analysts      said   meeting     above emergency    market     fixed 
##      0.79      0.78      0.77      0.76      0.75      0.75      0.73 
##      that    prices agreement    buyers 
##      0.73      0.72      0.71      0.70

TAGS.

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

제이드의 낙서장

카테고리

방문자수

Document Classification - Basic

1. `tm` package

2. 변환

3. 문서의 행렬 표현

4. 단어 빈도

5. 단어 간 상관관계

Comments

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

카테고리

방문자수

Document Classification - Basic

1. tm package

2. 변환

3. 문서의 행렬 표현

4. 단어 빈도

5. 단어 간 상관관계

Comments

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

1. `tm` package