자연어처리: DTM, TF-IDF

Document-Term Matrix and Term Frequency-Inverse Document Frequency

natural language processing

tfidf

Author

Cheonghyo Cho

문서의 의미를 효과적으로 분석하기 위해서는 단어의 출현 빈도를 수치화하는 것이 중요하다. 이를 위해 사용하는 대표적인 방법으로는 DTM(Document-Term Matrix)와 TF-IDF(Term Frequency-Inverse Document Frequency)가 있다. 이러한 방법들은 문서의 핵심 정보를 추출하고, 자연어 처리 작업에서 문서 간의 유사성을 평가하는 데 유용하다.

Bag of Words (BoW)

Bag of words는 단어의 등장 순서를 고려하지 않는 빈도수 기반의 단어 표현 방법이다. BoW는 각 단어가 등장한 횟수를 수치화하는 텍스트 표현 방법으로, 주로 어떤 단어가 얼마나 등장했는지를 기준으로 문서의 성격을 판단하는 작업에 쓰인다.

from sklearn.feature_extraction.text import CountVectorizer

# 예제 문서
documents = [
    "I love programming.",
    "I love coding.",
    "Programming is fun."
]

# CountVectorizer를 사용하여 BoW 생성
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# BoW 결과 출력
print("BoW:")
print(X.toarray())
print("Vocabulary:")
print(vectorizer.vocabulary_)

BoW:
[[0 0 0 1 1]
 [1 0 0 1 0]
 [0 1 1 0 1]]
Vocabulary:
{'love': 3, 'programming': 4, 'coding': 0, 'is': 2, 'fun': 1}

문서 단어 행렬 (Document-Term Matrix, DTM)

문서 단어 행렬(Document-Term Matrix, DTM)은 다수의 문서에서 등장하는 각 단어들의 빈도를 행렬로 표현한 것이다.

DTM의 한계점

DTM에서의 각 행(각 문서 벡터)의 차원은 전체 단어 집합의 크기이다.
- 문서 벡터의 차원은 수만 이상의 차원을 가질 수도 있다.
- 대부분의 값이 0을 가질 수도 있다: 희소 표현(sparse representation)
- 많은 양의 저장 공간과 높은 계산 복잡도를 요구한다.

import numpy as np

# DTM 예제
print("Document-Term Matrix (DTM):")
print(X.toarray())

Document-Term Matrix (DTM):
[[0 0 0 1 1]
 [1 0 0 1 0]
 [0 1 1 0 1]]

각 문서에는 중요한 단어와 불필요한 단어들이 혼재되어 있다.

예: 불용어(stopwords)는 빈도수가 높더라도 자연어 처리에 있어 의미를 갖지 못하는 단어이다.

# 불용어(stopwords) 제거 예제
vectorizer = CountVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)

print("DTM without stopwords:")
print(X.toarray())
print("Vocabulary without stopwords:")
print(vectorizer.vocabulary_)

DTM without stopwords:
[[0 0 1 1]
 [1 0 1 0]
 [0 1 0 1]]
Vocabulary without stopwords:
{'love': 2, 'programming': 3, 'coding': 0, 'fun': 1}

TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF(Term Frequency-Inverse Document Frequency)는 단어의 빈도와 문서 빈도를 사용하여 DTM 내의 각 단어마다 중요한 정도를 가중치로 주는 방법이다.

TF-IDF는 주로 문서의 유사도를 구하는 작업, 검색 시스템에서 검색 결과의 중요도를 정하는 작업, 문서 내에서 특정 단어의 중요도를 구하는 작업 등에 쓰인다.

from sklearn.feature_extraction.text import TfidfVectorizer

# TF-IDF 벡터화
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(documents)

print("TF-IDF:")
print(X_tfidf.toarray())
print("TF-IDF Vocabulary:")
print(tfidf_vectorizer.vocabulary_)

TF-IDF:
[[0.         0.         0.         0.70710678 0.70710678]
 [0.79596054 0.         0.         0.60534851 0.        ]
 [0.         0.62276601 0.62276601 0.         0.4736296 ]]
TF-IDF Vocabulary:
{'love': 3, 'programming': 4, 'coding': 0, 'is': 2, 'fun': 1}

tf(d,t): 특정 문서 d에서의 특정 단어 t의 등장 횟수
df(t): 특정 단어 t가 등장한 문서의 수
idf(t): df(t)에 반비례하는 수
- 단, 분모에 1을 더하고 로그를 취한다.
- \(\text{idf}(t) = \log (\frac{n}{1+\text{df}(t)})\)
- 총 문서의 수 n이 커질수록, IDF의 값이 기하급수적으로 커지는 것을 방지한다.

\[\text{TF-IDF}(t,d)=\text{tf}(d,t) \times \text{idf}(t)\]

참고자료

딥 러닝을 이용한 자연어 처리 입문(https://wikidocs.net/book/2155