자연어처리: 워드임베딩 - Glove, FastText

Glove, FastText

natural language processing

word embedding

glove

fasttext

Author

Cheonghyo Cho

워드 임베딩(Word Embedding)은 자연어 처리(NLP)에서 단어를 고정된 크기의 실수 벡터로 변환하는 기술이다. 이번에는 GloVe와 FastText에 대해 알아보자.

카운트 기반 - LSA(Latent Semantic Analysis): DTM이나 TF-IDF 행렬과 같이 각 문서에서의 각 단어의 빈도수를 카운트 한 행렬이라는 전체적인 통계 정보를 입력으로 받아 차원을 축소(Truncated SVD)하여 잠재된 의미를 끌어내는 방법론 - 단어 의미의 유추 작업(Analogy task)에 성능이 떨어짐

예측 기반 - Word2Vec: 실제값과 예측값에 대한 오차를 손실 함수를 통해 줄여나가며 학습 - 임베딩 벡터가 윈도우 크기 내에서만 주변 단어를 고려하기 때문에 코퍼스의 전체적인 통계 정보를 반영하지 못함

글로브(GloVe)

카운트 기반, 예측기반을 모두 사용하는 방법론

윈도우 기반 동시 등장 행렬(window based co-occurrence matrix)

I like deep learning
I like NLP
I enjoy flying

카운트	I	like	enjoy	deep	learning	NLP	flying
I	0	2	1	0	0	0	0
like	2	0	0	1	0	1	0
enjoy	1	0	0	0	0	0	1
deep	0	1	0	0	1	0	0
learning	0	0	0	1	0	0	0
NLP	0	1	0	0	0	0	0
flying	0	0	1	0	0	0	0

동시 등장 확률(co-occurrence probability)

동시 등장 확률과 크기 관계 비(ratio)	k=solid	k=gas	k=water	k=fasion
\(P(k \| ice)\)	0.00019	0.000066	0.003	0.000017
\(P(k \| steam)\)	0.000022	0.00078	0.0022	0.000018
\(P(k \| ice) / P(k \| steam)\)	8.9	0.085	1.36	0.96

GloVe의 아이디어를 한 줄로 요약하면 ‘임베딩 된 중심 단어와 주변 단어 벡터의 내적이 전체 코퍼스에서의 동시 등장 확률(의 로그값)이 되도록 만드는 것’

\[ \text{dot product}(w_i, \tilde{w}_k) \approx \log P(k \mid i) = \log P_{ik} \]

이로부터 손실 함수를 설계한다.

그 과정은 생략한다.

결과적으로 아래와 같은 일반화된 손실함수를 도출할 수 있다.

\[ \text{Loss function} = \sum_{m,n=1}^{V} f(X_{mn}) \left( w_m^T \tilde{w}_n + b_m + \tilde{b}_n - \log X_{mn} \right)^2 \]

import gensim.downloader as api

# GloVe 모델 다운로드 및 로드
glove_vectors = api.load("glove-wiki-gigaword-100")  # 100차원 GloVe 벡터

# "machine" 단어의 벡터 출력
print("Vector for 'machine':")
print(glove_vectors['machine'])

# "learning" 단어의 벡터 출력
print("Vector for 'learning':")
print(glove_vectors['learning'])

# 두 단어의 유사도 계산
similarity = glove_vectors.similarity('machine', 'learning')
print(f"Similarity between 'machine' and 'learning': {similarity}")

# 가장 유사한 단어 5개 찾기
similar_words = glove_vectors.most_similar('machine', topn=5)
print("Top 5 words similar to 'machine':")
for word, score in similar_words:
    print(f"{word}: {score}")

Vector for 'machine':
[-0.65365    0.49419   -0.26245   -0.20722   -0.11413    0.35701
  1.0454     0.21881    0.52769    0.60606    0.42521   -0.65169
  0.15318   -0.14797    0.12651   -0.017124   0.45325    0.37166
 -0.26847   -0.2627     0.43869   -0.016615   0.12714   -0.54708
  0.089084   0.24336   -0.34415    0.0026505 -0.094268   0.056114
  0.46366    0.68786   -0.20631   -0.088003   0.32153   -0.91399
 -0.080976  -0.90761    0.92889   -0.68033    0.23801   -0.37469
 -0.43278   -0.19243   -0.23711   -0.73041   -0.50592   -0.30237
  0.0017281 -0.60923   -0.21046    0.47403    0.37333    1.2475
  0.6299    -1.5292    -0.32403    0.59681    0.97994    0.59756
  0.67625    0.28223   -0.26748    1.425     -0.34419    0.25212
  0.3024    -0.26582   -0.22583    0.53783   -0.44439   -0.24281
  0.38001    0.085317   0.49694    0.24058    0.20611    0.023896
 -0.53078    0.12086    1.1627    -0.0053908 -0.66132    0.073666
 -1.5987     0.3626     0.68496   -0.93403    0.30523   -0.1688
  0.43895    0.73641    0.56431    1.0804     0.074377  -0.89155
 -0.20935   -0.3041     1.3027     0.1273   ]
Vector for 'learning':
[ 0.64812    0.69878   -0.39947    0.77634   -0.13132    0.2024
 -0.33399   -0.0066588  0.061684   0.1885    -0.10559   -0.31316
 -0.082495  -0.080517   0.3858    -0.10302    0.049431   0.17216
 -0.59079    0.77068   -1.2768    -0.25187    0.2195    -0.20176
 -0.30581   -0.18518    0.010889  -0.07529   -0.34732    0.61998
 -0.99703    1.0516    -0.42071   -0.39635    0.32607   -0.40061
 -0.46462    0.69904    0.29567   -0.35309   -0.59074    0.28999
 -0.25732   -0.1317    -0.69798    0.49818    0.41503    0.1487
  0.083347  -0.43543   -0.093969  -0.3543     0.014998   0.63593
  0.54564   -1.8439     0.78842   -0.19836    1.5707     0.25988
  0.20875    0.7521    -0.085488  -0.70717    0.094104   0.44485
  0.087818  -0.34779    0.57148    0.18662   -0.29435    0.42928
  0.28392   -0.61614   -0.34108    0.58192   -0.16388   -0.0081997
 -0.27162   -0.27112   -0.21471    0.37376   -0.5352    -0.060945
 -1.6317     0.85144    0.056035  -0.53861   -0.58383   -0.19612
 -0.33941   -0.3141     0.22999    0.32688    0.043012  -0.037482
 -0.092067  -1.0734     0.8924     0.41776  ]
Similarity between 'machine' and 'learning': 0.33105602860450745
Top 5 words similar to 'machine':
machines: 0.7854335904121399
device: 0.6772987842559814
equipment: 0.6411973237991333
gun: 0.640908420085907
guns: 0.6361788511276245

FastText

패스트텍스트(FastText) (2015, facebook)

Word2Vec의 확장, 단 하나의 단어에도 여러 단어들이 존재하는 것으로 간주. 즉, 서브워드(subword)를 고려하여 학습

서브워드(subword)

각 단어를 글자 단위 n-gram의 구성을 취급
단 시작과 끝인 <, >, 그리고 기존 단어 토큰 <word>를 추가

기본 값으로 최소=3, 최대=6으로 설정 되어 있음
이 때, 서브워드를 벡터화하고,
단어(ex. apple)의 벡터값은 벡터값들의 총 합으로 구성함

n=3~6인 경우 <ap, app, ppl, ppl, le>, <app, appl, pple, ple>, <appl, pple>, ..., <apple>

apple = <ap, app, ppl, ppl, le> + <app, appl, pple, ple> + <appl, pple>+, ...,+ <apple>

모르는 단어(out of vocabulary, OOV)에 대한 대응

FastText의 인공 신경망을 학습한 후에는 데이터 셋의 모든 단어의 각 n-gram에 대해서 워드 임베딩이 됨.
subword를 통해 모르는 단어(OOV)에 대해서도 다른 단어와의 유사도를 계산할 수 있음.
Ex. ‘birthplace’는 학습되지 않고, ‘birth’와 ‘place’는 되었다면, FastText는 ‘birthplace’ 벡터를 얻을 수 있음.

빈도 수가 적었던 단어(rare word)에 대한 대응 - Word2Vec은 등장 빈도가 적은 단어에 대해 임베딩 정확도가 낮음. - 하지만 FastText는 그 단어의 n-gram이 다른 단어의 n-gram과 겹치면 정확도 상승 - 또한 노이즈(오타 등)가 많은 코퍼스에서 강점을 가짐. (ex. oranze)

from gensim.models import FastText
from nltk.tokenize import word_tokenize

# 예제 문서
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "I love natural language processing and machine learning.",
    "Word embeddings are a type of word representation that allows words to be represented as vectors.",
    "Gensim is a useful library for text processing in Python.",
    "Machine learning models can be used for a variety of tasks, including classification and regression.",
    "Deep learning is a subset of machine learning that uses neural networks.",
    "Text data requires preprocessing before it can be used in machine learning models.",
    "Natural language processing involves the interaction between computers and humans using natural language.",
    "The field of artificial intelligence includes machine learning and deep learning.",
    "Python is a popular programming language for data science and machine learning.",
    "In the world of data science, Python is a widely used programming language.",
    "FastText is an extension of Word2Vec, developed by Facebook's AI Research lab.",
    "It is particularly well-suited for text classification and representation learning.",
    "Neural networks are a powerful tool for machine learning and artificial intelligence.",
    "Preprocessing text data is a crucial step in natural language processing.",
    "Data science involves using algorithms, data analysis, and machine learning to extract insights from data.",
    "The history of natural language processing dates back to the 1950s.",
    "Gensim provides efficient implementations of popular algorithms for word vector representations.",
    "The applications of machine learning are vast and include fields such as finance, healthcare, and technology.",
    "Understanding the context of a word is crucial for accurate natural language understanding.",
    "Text classification is a common task in natural language processing.",
    "FastText can be used to create word embeddings that capture the meaning of words in context.",
    "The development of neural networks has significantly advanced the field of machine learning.",
    "Python's libraries and frameworks make it a powerful tool for data scientists.",
    "Machine learning algorithms can identify patterns and make predictions based on data.",
    "Deep learning models are capable of learning complex patterns in data.",
    "Natural language processing enables computers to understand and respond to human language.",
    "The rise of artificial intelligence has led to significant advancements in many fields.",
    "FastText supports both supervised and unsupervised learning.",
    "Word embeddings are used to convert words into numerical vectors.",
    "The performance of a machine learning model depends on the quality of the data and the choice of algorithm."
]

# 문서 토큰화
tokenized_docs = [word_tokenize(doc.lower()) for doc in documents]

# FastText 모델 초기화
model = FastText(vector_size=100, window=3, min_count=1)

# 어휘 빌드
model.build_vocab(corpus_iterable=tokenized_docs)

# 모델 학습
model.train(corpus_iterable=tokenized_docs, total_examples=len(tokenized_docs), epochs=10)

# "machine" 단어의 벡터 출력
print("Vector for 'machine':")
print(model.wv['machine'])

# "learning" 단어의 벡터 출력
print("Vector for 'learning':")
print(model.wv['learning'])

# 두 단어의 유사도 계산
similarity = model.wv.similarity('machine', 'learning')
print(f"Similarity between 'machine' and 'learning': {similarity}")

# 가장 유사한 단어 5개 찾기
similar_words = model.wv.most_similar('machine', topn=5)
print("Top 5 words similar to 'machine':")
for word, score in similar_words:
    print(f"{word}: {score}")

Vector for 'machine':
[ 3.0476719e-04 -4.5854144e-04  1.4809435e-03 -3.2558880e-04
 -1.2461523e-03 -2.6668392e-03 -5.9312402e-04 -7.8534149e-04
 -3.3754381e-04 -8.5237127e-04  1.9166825e-03 -4.5680077e-05
  2.8688330e-05  1.6392061e-03  2.9227816e-04 -4.8287626e-04
 -9.6471590e-04  1.2272107e-03  1.3621673e-03  3.9218552e-04
 -1.1882874e-03  2.5077889e-04 -2.6549169e-04  3.5418704e-04
 -1.1895838e-03  3.3356482e-05  2.1073280e-03 -1.4254816e-03
 -7.2300015e-04  1.5622780e-03  5.5868068e-04  1.1086962e-03
  7.9263339e-04  4.7905196e-04 -3.5625519e-04  1.8447104e-03
 -5.5404962e-04  1.1235311e-03 -3.4734365e-04 -5.2451598e-04
  1.4605122e-03 -2.4416181e-04  6.3423667e-04 -2.3380520e-03
 -9.6629614e-05 -9.4167644e-04 -1.2651831e-03 -2.4021158e-05
 -1.6872913e-03  4.6961497e-05  5.1126088e-04  2.4645268e-03
  1.1963520e-03  2.4838850e-04 -2.5080782e-03 -1.5787635e-03
  3.1514847e-04 -1.4349066e-03 -9.8461658e-04 -5.2130187e-04
  5.0792546e-04 -5.4143183e-04 -2.3813437e-04 -4.0953144e-04
  2.5396477e-04  7.3200627e-04  8.8836439e-04  5.4181961e-04
  1.0334291e-03  7.8741541e-05 -1.3979069e-03 -5.7497567e-05
 -7.5514033e-04 -1.2660893e-03  4.0109811e-04  1.5810893e-03
 -4.9963361e-04  1.9547443e-03  2.8228809e-04  1.2446599e-04
 -4.8071955e-04  6.1670638e-04 -5.6519522e-04 -4.2860306e-04
 -4.0691014e-04 -2.2247438e-03 -1.6878321e-03  5.7058045e-05
 -1.1971041e-03 -4.6619569e-04  3.6342005e-04  3.2276681e-04
 -9.8186603e-05 -8.9766941e-04 -1.2735218e-03 -7.5350417e-04
 -4.0317111e-04 -1.6944273e-04 -1.9314818e-03  1.8059995e-03]
Vector for 'learning':
[ 9.6966513e-04  3.7514817e-04 -1.4854937e-03 -1.9914471e-05
 -1.2665654e-03 -7.9932979e-05 -6.1350228e-04 -7.4773835e-04
  1.3385082e-03  4.2618866e-04 -1.9866110e-04  3.1036657e-04
 -6.0938101e-04  1.2363552e-03  1.0377121e-03 -1.6709780e-03
 -1.4914436e-03 -1.2266421e-03 -1.9124926e-04 -2.2415379e-03
 -2.3766351e-03 -2.7787792e-03  1.2596940e-03 -4.9960025e-04
 -4.9759069e-04 -1.3365583e-03  2.8376342e-04 -1.2224232e-03
 -4.3068963e-04 -1.0582112e-03 -1.5169648e-03  1.5298016e-03
  4.6493908e-04 -2.3008294e-04 -2.0064659e-04  1.0392127e-03
 -3.7839505e-04  7.0851075e-04 -1.4305441e-03 -1.0678984e-03
 -1.1661201e-03 -1.5288709e-03 -8.1254612e-04  9.7811641e-04
 -8.0565893e-04  3.4704155e-04  1.0826325e-04  3.1954201e-04
 -2.1086454e-04  5.8729271e-04  1.3712725e-03 -9.1275712e-04
 -4.9673271e-04  1.5072321e-03  5.6268717e-04  8.2734477e-04
 -2.0161485e-04 -1.3020886e-04 -6.3217379e-04 -1.0167899e-03
  4.0030672e-04 -1.3218716e-03 -1.5925691e-03  1.4435191e-03
 -1.6054848e-03 -1.9919673e-04  1.8706530e-03  9.9316600e-04
 -3.5836015e-04 -1.0444176e-04  1.3183409e-03  3.3161032e-04
 -5.1780342e-04 -1.1585384e-03 -1.9950747e-04 -4.3703237e-04
  1.9279205e-03  1.6133464e-03 -1.7324421e-03  1.2685662e-03
  2.2011364e-03  1.1942191e-03 -1.8379397e-03 -7.8931829e-05
 -1.2394376e-03  3.8314451e-04  1.1821385e-03 -6.7486422e-04
 -2.7123379e-04  4.6274171e-04  8.4478880e-04  5.5789534e-04
  6.8324956e-04  1.0421082e-04  1.4224158e-03  1.7097123e-03
 -1.6650079e-04  1.5255744e-03 -6.9366029e-04 -1.3856716e-03]
Similarity between 'machine' and 'learning': -0.002112696412950754
Top 5 words similar to 'machine':
processing: 0.21260984241962433
language: 0.21258284151554108
patterns: 0.20867955684661865
data: 0.20386236906051636
computers: 0.19721049070358276

참고자료

딥 러닝을 이용한 자연어 처리 입문(https://wikidocs.net/book/2155