๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

XLNet: Generalized Autoregressive Pretraining for Language Understanding

by ์ œ๋ฃฝ 2023. 7. 5.
728x90
๋ฐ˜์‘ํ˜•

 

 

 

๐Ÿ’ก
<๋ฒˆ์—ญ>
[๋…ผ๋ฌธ๋ฆฌ๋ทฐ] XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet: Generalized Autoregressive Pretraining for Language Understanding Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, Quoc V. Le https://arxiv.org/abs/1906.08237 1. Introduction Unsupervised Representation Learning์€ Large-scale์˜ corpora๋ฅผ ํ†ตํ•ด Pre…
 
https://jeonsworld.github.io/NLP/xlnet/
  • XLNet์€ GPT๋กœ ๋Œ€ํ‘œ๋˜๋Š” auto-regressive(AR) ๋ชจ๋ธ๊ณผ BERT๋กœ ๋Œ€ํ‘œ๋˜๋Š” auto-encoder(AE) ๋ชจ๋ธ์˜ ์žฅ์ ๋งŒ์„ ํ•ฉํ•œ generalized AR pretraining model.
  • ์ด๋ฅผ ์œ„ํ•ด permutation language modeling objective๊ณผ two-stream attention mechanism์„ ์ œ์•ˆ.
  • ๋‹ค์–‘ํ•œ NLP ํ…Œ์Šคํฌ์—์„œ ๊ธฐ์กด ๋Œ€๋น„ ์ƒ๋‹นํ•œ ํ–ฅ์ƒ์„ ๋ณด์ด๋ฉฐ state-of-the-art ์„ฑ๋Šฅ์„ ๋ณด์ž„.
1. Intro
  • ์ตœ๊ทผ ๋งŽ์€ ์–‘์˜ corpus๋ฅผ ์ด์šฉํ•˜๋Š” unsupervised representation learning์ด ํ™œ๋ฐœํžˆ ์—ฐ๊ตฌ๋˜๊ณ  ์žˆ์Œ
  • Pre-training์„ ํ†ตํ•ด ์–ป์–ด์ง„ representation (word2vec, ELMO ๋“ฑ)์„ ์ง์ ‘์ ์œผ๋กœ ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ pre-trained model์„ downstream task์— ๋Œ€ํ•ด fine-tuning ํ•˜๋Š” ๋ฐฉ๋ฒ•(GPT, BERT ๋“ฑ)์ด ์„ฑ๊ณต์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์คŒ.
  • Pre-training ๋‹จ๊ณ„์—์„œ๋„ ์—ฌ๋Ÿฌ objective๋“ค์ด ์ด์šฉ๋˜์–ด ์™”๋Š”๋ฐ, ๊ทธ ์ค‘ ๊ฐ€์žฅ ๋Œ€ํ‘œ์ ์ธ ๋‘ ๊ฐ€์ง€๋ฅผ ์†Œ๊ฐœํ•จ (AR + AE)

1-1) Autogressive(AR)

  • ์ผ๋ฐ˜์ ์ธ Language Model (LM)์˜ ํ•™์Šต ๋ฐฉ๋ฒ•์œผ๋กœ ์ด์ „ token๋“ค์„ ๋ณด๊ณ  ๋‹ค์Œ token์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ. ex) ELMO, GPT RNNLM ๋“ฑ์ด ํฌํ•จ
  • ๋‹จ์ผ ๋ฐฉํ–ฅ ์ •๋ณด๋ฅผ ํ†ตํ•œ ์˜ˆ์ธก
  • AR์€ ๋ฐฉํ–ฅ์„ฑ(forward, backward)์ด ์ •ํ•ด์ ธ์•ผ ํ•˜๋ฏ€๋กœ, ํ•œ์ชฝ ๋ฐฉํ–ฅ์˜ ์ •๋ณด๋งŒ์„ ์ด์šฉํ•  ์ˆ˜ ์žˆ์Œ
  • ๋”ฐ๋ผ์„œ ์–‘๋ฐฉํ–ฅ ๋ฌธ๋งฅ์„ ํ™œ์šฉํ•ด ๋ฌธ์žฅ์— ๋Œ€ํ•ด ๊นŠ์ด ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ค์›€.
  • ELMO์˜ ๊ฒฝ์šฐ ์–‘๋ฐฉํ–ฅ์„ ์ด์šฉํ•˜์ง€๋งŒ, ๊ฐ๊ฐ์˜ ๋ฐฉํ–ฅ์— ๋Œ€ํ•ด ๋…๋ฆฝ์ ์œผ๋กœ ํ•™์Šต๋œ ๋ชจ๋ธ์„ ์ด์šฉํ•˜๋ฏ€๋กœ ์–•์€ ์ดํ•ด๋งŒ ๊ฐ€๋Šฅํ•จ

1-2) Auto Encoding(AE)

  • Auto Encoder๋Š” ์ฃผ์–ด์ง„ input์— ๋Œ€ํ•ด ๊ทธ input์„ ๊ทธ๋Œ€๋กœ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ’€๊ณ , Denoising Auto Encoder์€ noise๊ฐ€ ์„ž์ธ input์„ ์›๋ž˜์˜ input์œผ๋กœ ์˜ˆ์ธกํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ’ˆ.
  • BERT๊ฐ™์€ ๊ฒฝ์šฐ๋Š”์ฃผ์–ด์ง„ input sequence์— ์ž„์˜๋กœ ์ถ”๊ฐ€ํ•œ noise([MASK] token)๊ฐ€ ์ฃผ์–ด์กŒ์„ ๋•Œ,[MASK] token ์„ ์›๋ž˜ input token์œผ๋กœ ๋ณต๊ตฌํ•˜๊ณ ์ž ํ•จ.
  • ๋”ฐ๋ผ์„œ Denoising Auto Encoder์˜ ๋ฐฉ์‹์œผ๋กœ ๋ณผ ์ˆ˜ ์žˆ์Œ

1-3) AR๊ณผ AE์˜ ๋ฌธ์ œ์ 

  1. AR
  • ๋‹จ์ผ ๋ฐฉํ–ฅ ์ •๋ณด๋งŒ ์ด์šฉํ•ด์„œ ํ•™์Šต ๊ฐ€๋Šฅํ•จ
  1. AE
  • [Mask] token์ด ๋…๋ฆฝ์ ์œผ๋กœ ์˜ˆ์ธก (independent assumption) ๋˜๊ธฐ ๋•Œ๋ฌธ์—, token ์‚ฌ์ด์˜ dependency๋Š” ํ•™์Šตํ•  ์ˆ˜ ์—†์Œ
  • Fine tuning ๊ณผ์ •์—์„œ [Mask] token์ด ๋“ฑ์žฅํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, pre-training๊ณผ fine-tuning ์‚ฌ์ด์— ๋ถˆ์ผ์น˜ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Œ

 

2. Proposed Method: XLNet
  • ์œ„์˜ ๋‹จ์ ์„ ๋ณด์™„ํ•˜๊ณ , ์žฅ์ ์„ ์‚ด๋ฆฌ๊ธฐ ์œ„ํ•ด ์•„๋ž˜ 3๊ฐ€์ง€์˜ ๋ฐฉ๋ฒ•๋ก ์ด ์ œ์•ˆ๋จ
  1. ์ƒˆ๋กœ์šด Objective (Permutation Language Modeling)
  1. ์ด๋ฅผ ๋ฐ˜์˜ํ•˜๊ธฐ ์œ„ํ•œ Target-Aware Representation
  1. ์œ„ ๋‚ด์šฉ๋“ค๊ณผ Transformer ๊ตฌ์กฐ๋ฅผ ๋™์‹œ์— ์ด์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด Two-Stream Self-Attention ๊ตฌ์กฐ
1. Permutation Language Modeling Objective
  • AR๋ชจ๋ธ์˜ ์žฅ์ ์„ ์œ ์ง€ํ•˜๋˜, ์–‘๋ฐฉํ–ฅ ์ปจํ…์ŠคํŠธ๋ฅผ ํ™•๋ณด(AE์˜ ์žฅ์ ์„ ํ™œ์šฉํ•˜๊ณ ์ž..ํ•œ ๋“ฏ)ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ชจ๋ธ.
  • input sequence index(์ˆœ์„œ)์˜ ๋ชจ๋“  permutation์„ ๊ณ ๋ คํ•œ AR ๋ฐฉ์‹ ์‚ฌ์šฉ.
  • input sequence [x1,x2,x3,x4]์— ๋Œ€ํ•ด์„œ index(์ˆœ์„œ)์˜ permutation์˜ ์ง‘ํ•ฉ์€ ์ด 4!=24๊ฐœ๊ฐ€ ์กด์žฌํ•˜๋ฉฐ ZT=[[1,2,3,4],[1,2,4,3],[1,3,2,4]…[4,3,2,1]]๋กœ ๋‚˜ํƒ€๋‚ผ ์ˆ˜ ์žˆ์Œ

→ Permutation ์ง‘ํ•ฉ์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ sequence ๊ณ ๋ คํ•˜๊ฒŒ ๋จ

→ AR(๋‹จ๋ฐฉํ–ฅ)์œผ๋กœ๋Š” ์ถ”๊ตฌํ•˜์ง€ ๋ชปํ–ˆ๋˜ ์–‘๋ฐฉํ–ฅ์„ฑ์„ AR objective function์— ๋Œ€์ž…ํ•จ์œผ๋กœ์จ ํŠน์ • ํ† ํฐ์— ๋Œ€ํ•œ ์–‘๋ฐฉํ–ฅ context๋ฅผ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ์Œ

โžก๏ธ AR ๋ฐฉ์‹์ด๋ฏ€๋กœ independent assumption(๋…๋ฆฝ ๊ฐ€์ •)์„ ํ•  ํ•„์š”๊ฐ€ ์—†๊ณ , [MASK] token์„ ์ด์šฉํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ, pre-training๊ณผ fine-tuning์‚ฌ์ด์˜ ๋ถˆ์ผ์น˜๋„ ์—†๊ณ  AE๋ฐฉ์‹์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ์Œ.

  • ์˜ˆ๋ฅผ ๋“ค์–ด, ํ† ํฐ 3์— ๋Œ€ํ•ด ์˜ˆ์ธกํ•˜๊ณ ์ž ํ•œ๋‹ค๊ณ  ๊ฐ€์ •
  1. [3,2,4,1]→ ํ† ํฐ์˜ ์ •๋ณด๋ฅผ ๊ฐ€์ ธ์˜ค์ง€ ์•Š์Œ
  1. [2,4,3,1]→ 2,4,3,1 ์ˆœ์„œ์ด๊ธฐ์—, 2,4์˜ ์ •๋ณด์— ๋Œ€ํ•œ ํ† ํฐ์„ ๊ฐ€์ ธ์˜ด
  1. [1,4,2,3]→ 1,4,2์˜ ์ •๋ณด์— ๋Œ€ํ•œ ํ† ํฐ์„ ๊ฐ€์ ธ์˜ด
2. Architecture: Two-Stream Self-Attention for Target-Aware Representations
  • ํ•˜์ง€๋งŒ ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” ๋ฐฉ๋ฒ•์€, Standard Transformer parameterixation์—์„œ๋Š” ์ž‘๋™ํ•˜์ง€ ์•Š์Œ→ ํ•™์Šต ์‹œ์— permutationํ•ด์„œ ์˜ˆ์ธกํ•  token์ด ๋ช…ํ™•ํ•˜์ง€ ์•Š๋Š”๋‹ค๋Š” ๋ฌธ์ œ ๋ฐœ์ƒ์˜ˆ๋ฅผ ๋“ค์–ด, [1,2,3,4] ์ˆœ์„œ๋กœ x3์„ ๋งž์ถฐ์•ผ ํ•œ๋‹ค๋ฉด, x1๊ณผ x2์˜ ์ •๋ณด๋ฅผ ํ†ตํ•ด x3์„ ๋งž์ถฐ์•ผ ํ•จ.์ฆ‰, ๋™์ผํ•œ representation์œผ๋กœ ๋‹ค๋ฅธ target์„ ๋งž์ถฐ์•ผ ํ•˜๋Š”๋ฐ, ์œ„์˜ ๋ฐฉ์‹์„ ํ†ตํ•ด ์ ์šฉํ•œ๋‹ค๋ฉด ๋ญ‰๊ฐœ์ง€๋Š” ํ˜„์ƒ์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž„ (ํ•™์Šต์ด ์ž˜ ์•ˆ๋˜๊ฒŸ์ฃ )
  • ๋˜ํ•œ [1,2,4,3]์˜ ๊ฒฝ์šฐ๋Š” x4๋ฅผ ๋งž์ถฐ์•ผํ•  ๊ฒฝ์šฐ, x1,x2์˜ ์ •๋ณด๋ฅผ ํ†ตํ•ด x4๋ฅผ ๋งž์ถฐ์•ผ ํ•จ.

โžก๏ธ ์ด ๋ฌธ์ œ์ ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ด์ „์˜ context token๋“ค์˜ ์ •๋ณด (xz<t)๋ฟ๋งŒ ์•„๋‹ˆ๋ผ target index์˜ position ์ •๋ณด (zt)๋„ ํ•จ๊ป˜ ์ด์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด Target Position-Aware Representation์„ ์ œ์•ˆํ•จ ⇒  hθ(xz<t)→gθ(xz<t,zt)

1. Two-Stream Self-Attention

→ target position ์ •๋ณด๋ฅผ ์ถ”๊ฐ€์ ์œผ๋กœ ์ด์šฉํ•˜๋Š” gθ ๋ฅผ ์–ด๋–ป๊ฒŒ ๊ตฌ์„ฑํ• ์ง€์˜ ๋ฌธ์ œ๊ฐ€ ๋‚จ์•„์žˆ์Œ. gθ์˜ ์กฐ๊ฑด ๋‘ ๊ฐ€์ง€๋ฅผ ๊ณ ๋ คํ•ด์•ผ ํ•จ

  1. ํŠน์ • ์‹œ์  t์—์„œ target position zt ์˜ token xzt์„ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด, hidden representation g(xz<t,zt)๋Š” t ์‹œ์  ์ด์ „์˜ context ์ •๋ณด xz<t ์™€ target position ์ •๋ณด zt ๋งŒ์„ ์ด์šฉํ•ด์•ผ ํ•จ.zt(target)์˜ ์œ„์น˜๋งŒ ์‚ฌ์šฉํ•˜๊ณ , ๋‚ด์šฉ์€ ์‚ฌ์šฉํ•˜๋ฉด ์•ˆ๋จ
  1. ํŠน์ • ์‹œ์  t ์ดํ›„์ธ j (>t) ์— ํ•ด๋‹นํ•˜๋Š” xzj ๋ฅผ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด, hidden representation g(xz<t,zt) ๊ฐ€ t ์‹œ์ ์˜ content์ธ xzt ๋ฅผ ์ธ์ฝ”๋”ฉํ•ด์•ผ ํ•จ.→ T ์‹œ์ ์˜ context๋„ ๊ฐ€์ง€๊ณ  ์žˆ์–ด์•ผ ํ•œ๋‹ค ๋ญ ์ด๋Ÿฐ๋ง์ธ๋“ฏ.
2. Query Representation
๐Ÿ’ก
t ์ด์ „ ์‹œ์ ์˜ token ์ •๋ณด + t ์‹œ์ ์˜ ์œ„์น˜ ์ •๋ณด๋ฅผ ๊ฐ€์ง€๊ณ  ํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹
  • ํ˜„์žฌ ์‹œ์ ์„ ์ œ์™ธํ•œ ์ด์ „ ์‹œ์  token๋“ค์˜ content์™€ ํ˜„์žฌ ์‹œ์ ์˜ ์œ„์น˜์ •๋ณด๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๋˜๋Š” representation

ex)

  1. position 3์˜ ๊ฒฝ์šฐ

: 3๋ฒˆ์— ํ•ด๋‹นํ•˜๋Š” position ์ •๋ณด ๊ฐ’๋งŒ w(weight)๋งŒ ๊ฐ€์ง€๊ณ  ํ•™์Šต

  1. position 2์˜ ๊ฒฝ์šฐ

: 2๋ฒˆ์˜ ์œ„์น˜ ์ •๋ณด์™€ 3๋ฒˆ์˜ token ์ •๋ณด๋ฅผ ํ•™์Šต

  1. position 4์˜ ๊ฒฝ์šฐ

: 4๋ฒˆ ์œ„์น˜ ์ •๋ณด์™€ 2,3๋ฒˆ์˜ token ์ •๋ณด๋ฅผ ํ•™์Šต

  1. position 1์˜ ๊ฒฝ์šฐ

: 1๋ฒˆ ์œ„์น˜ ์ •๋ณด์™€ 2,3,4๋ฒˆ์˜ token ์ •๋ณด ํ•™์Šต

โ€ป ์•„๋ž˜์˜ x1,x2,x3,x4๋Š” ํŠน์ • ํ† ํฐ์˜ embedding๋œ ๊ฐ’์„ ์˜๋ฏธํ•จ

3. Context Representation (๊ธฐ์กด transformer๊ณผ ๋™์ผ)
๐Ÿ’ก
t ์‹œ์  ๋ฐ t ์ด์ „ ์‹œ์  token์˜ ํ† ํฐ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•ด ํ•™์Šต
  • ํ˜„์žฌ ์‹œ์  ๋ฐ ์ด์ „ ์‹œ์  token๋“ค์˜ content๋ฅผ ์ด์šฉํ•˜์—ฌ ๊ณ„์‚ฐ๋˜๋Š” representation
  • Standard transformer์˜ hidden state์™€ ๋™์ผํ•œ ์—ญํ• 

ex)

  1. position 3์˜ ๊ฒฝ์šฐ

: 3๋ฒˆ์— ํ•ด๋‹นํ•˜๋Š” ํ† ํฐ ์ •๋ณด๋งŒ ๊ฐ€์ง€๊ณ  ํ•™์Šต

  1. position 2์˜ ๊ฒฝ์šฐ

: 2๋ฒˆ๊ณผ 3๋ฒˆ์˜ token ์ •๋ณด๋ฅผ ํ•™์Šต

  1. position 4์˜ ๊ฒฝ์šฐ

: 4, 2,3๋ฒˆ์˜ token ์ •๋ณด๋ฅผ ํ•™์Šต

  1. position 1์˜ ๊ฒฝ์šฐ

: 1, 2,3,4๋ฒˆ์˜ token ์ •๋ณด ํ•™์Šต

โ€ป ์•„๋ž˜์˜ x1,x2,x3,x4๋Š” ํŠน์ • ํ† ํฐ์˜ embedding๋œ ๊ฐ’์„ ์˜๋ฏธํ•จ

4. Partial Prediction
  • ์†Œ๊ฐœ๋œ Objective๋Š” Permutation์„ ์ด์šฉํ•˜์—ฌ ๋ชจ๋“  ์กฐํ•ฉ์˜ ์ˆœ์„œ๋กœ Maximum Likelihood๋ฅผ ์ˆ˜ํ–‰ํ•จ
  • ํ•˜์ง€๋งŒ ์ด๋Š” ํ•™์Šต ์‹œ์— ๋Š๋ฆฐ ์ˆ˜๋ ด์„ ์œ ๋ฐœ์‹œํ‚ด
  • ์ด๋Ÿฌํ•œ Optimization difficulty๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด, ์ €์ž๋“ค์€ ํŠน์ • ์ˆœ์„œ์—์„œ ๋งˆ์ง€๋ง‰ ๋ช‡ ๊ฐœ์˜ ์˜ˆ์ธก๋งŒ ์ด์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉ
  • ex) 3 → 2 → 4 → 1 ์˜ ์ˆœ์„œ์—์„œ ๋งˆ์ง€๋ง‰ 2๊ฐœ๋งŒ ์˜ˆ์ธก์— ์ด์šฉํ•˜๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ

 

3. Incorporating Ideas from Transformer-XL
  • XLNet์€ ๊ธด ๋ฌธ์žฅ์— ๋Œ€ํ•œ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด Transformer-XL (Dai et al., 2019)์—์„œ ์‚ฌ์šฉ๋œ 2๊ฐ€์ง€ ํ…Œํฌ๋‹‰์„ ์ฐจ์šฉํ•จ
  • ์ฒซ ๋ฒˆ์งธ๋Š” Relative Positional Encoding, ๋‘ ๋ฒˆ์งธ๋Š” Segment Recurrence Mechanism
1. Relative Positional Encoding
๐Ÿ’ก
๋ฌธ์žฅ์˜ ๋‹จ์–ด๋“ค ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ชจ๋ธ์— ์ „๋‹ฌํ•˜๋Š” ๋ฐฉ๋ฒ•
  • Self-attention์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Transformer (Vaswani et al., 2017)๋Š” CNN์ด๋‚˜ RNN๊ณผ ๋‹ฌ๋ฆฌ ๋‹จ์–ด๋“ค์˜ ์ƒ๋Œ€์  ํ˜น์€ ์ ˆ๋Œ€์  ์œ„์น˜ ์ •๋ณด๋ฅผ ์ง์ ‘์ ์œผ๋กœ ๋ชจ๋ธ๋งํ•˜๊ณ  ์žˆ์ง€ ์•Š์Œ.
  • ๋Œ€์‹  input์— ๋‹จ์–ด์˜ ์ ˆ๋Œ€์  ์œ„์น˜์— ๋Œ€ํ•œ representation (absolute positional encoding)์„ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ˆœ์„œ์— ๋Œ€ํ•œ ๋ชจ๋ธ๋ง์„ ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€์Œ.
  • ํ•˜์ง€๋งŒ ์ด๋Ÿฐ absolute positional encoding ๋ฐฉ๋ฒ•์€ ํ•˜๋‚˜์˜ segment ๋‚ด์—์„œ๋Š” ์œ„์น˜์— ๋Œ€ํ•œ ์˜๋ฏธ๋ฅผ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์œผ๋‚˜ Transformer-XL๊ณผ ๊ฐ™์ด ์—ฌ๋Ÿฌ segment์— ๋Œ€ํ•ด recurrent ๋ชจ๋ธ๋ง์„ ํ•˜๋Š” ๊ฒฝ์šฐ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๊ฒŒ ๋จ. (์ ˆ๋Œ€์ ์ธ ์œ„์น˜ ์ •๋ณด๋งŒ์œผ๋กœ๋Š” ๋ฌธ์žฅ ๋‚ด์—์„œ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ถฉ๋ถ„ํžˆ ๋ชจ๋ธ๋งํ•˜๊ธฐ ์–ด๋ ต๋‹ค๋Š” ๊ฒƒ์„ ๋งํ•จ)

ex)

"The cat sat on the mat”

  1. ์ƒ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ:
    • "cat"์™€ "sat" ์‚ฌ์ด์˜ ์ƒ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ๋Š” 1์ž…๋‹ˆ๋‹ค. ์ฆ‰, ๋‘ ๋‹จ์–ด๋Š” ๋ฌธ์žฅ์—์„œ ๋ฐ”๋กœ ์˜†์— ์œ„์น˜ํ•ฉ๋‹ˆ๋‹ค.
    • "on"๊ณผ "mat" ์‚ฌ์ด์˜ ์ƒ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ๋Š” 2์ž…๋‹ˆ๋‹ค. ๋‘ ๋‹จ์–ด ์‚ฌ์ด์— "the cat sat"๋ผ๋Š” ๊ตฌ๊ฐ€ ์œ„์น˜ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ƒ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ๊ฐ€ ๋” ํฝ๋‹ˆ๋‹ค.
  1. ์ ˆ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ:
    • "The"๋Š” ๋ฌธ์žฅ์—์„œ ์ฒซ ๋ฒˆ์งธ ๋‹จ์–ด์ด๋ฏ€๋กœ ์ ˆ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ๋Š” 1์ž…๋‹ˆ๋‹ค.
    • "mat"๋Š” ๋ฌธ์žฅ์—์„œ ๋งˆ์ง€๋ง‰ ๋‹จ์–ด์ด๋ฏ€๋กœ ์ ˆ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ๋Š” 6์ž…๋‹ˆ๋‹ค.

์ด๋ ‡๊ฒŒ ์ƒ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ๋Š” ๋‹จ์–ด๋“ค ๊ฐ„์˜ ์ƒ๋Œ€์ ์ธ ์œ„์น˜ ๊ด€๊ณ„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฐ˜๋ฉด, ์ ˆ๋Œ€์ ์ธ ๊ฑฐ๋ฆฌ๋Š” ๋‹จ์–ด๋“ค์ด ๋ฌธ์žฅ ๋‚ด์—์„œ ์–ด๋””์— ์œ„์น˜ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค. ๊ฐ๊ฐ์˜ ์ •๋ณด๋Š” ๋ฌธ์žฅ์˜ ๊ตฌ์กฐ์™€ ๋‹จ์–ด ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ์ดํ•ดํ•˜๋Š” ๋ฐ์— ๋„์›€์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

<์ˆ˜์‹ ์ฐธ๊ณ >

1. Term (b)์™€ (d)์—์„œ ๊ธฐ์กด absolute positional embedding Uj ๋ฅผ relative positional embedding Ri−j๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. R ์€ learnable parameters๊ฐ€ ์•„๋‹Œ sinusoid encoding matrix (Vaswani et al., 2017)์ž…๋‹ˆ๋‹ค.
2. Term (c) ์™€ (d) ์—์„œ UโŠคiWโŠคq ๋ฅผ ๊ฐ๊ฐ uโŠค∈Rd์™€ vโŠค∈Rd๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. Query vector๊ฐ€ ๋ชจ๋“  query position์— ๋Œ€ํ•ด ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ attention bias๊ฐ€ query position์— ์ƒ๊ด€์—†์ด ๋™์ผํ•˜๊ฒŒ ์œ ์ง€๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
3. Wk ๋ฅผ Wk,E ์™€ Wk,R ๋กœ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” content ๊ธฐ๋ฐ˜์˜ key vector์™€ location ๊ธฐ๋ฐ˜์˜ key vector๋ฅผ ๊ฐ๊ฐ ๋งŒ๋“ค์–ด๋‚ด๊ธฐ ์œ„ํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ฐ term๋“ค์€ ๋‹ค์Œ์˜ ์ง๊ด€์ ์ธ ์˜๋ฏธ๋ฅผ ์ง€๋‹™๋‹ˆ๋‹ค: 1) Term (a)๋Š” content๋ฅผ ๊ธฐ๋ฐ˜์˜ ์ฒ˜๋ฆฌ๋ฅผ ํ•˜๊ณ , 2) (b)๋Š” content์— ์˜์กดํ•œ positional bias๋ฅผ ์žก์•„๋‚ด๊ณ , 3) (c)๋Š” global content bias๋ฅผ, 4) (d)๋Š” global positional bias๋ฅผ ์ธ์ฝ”๋”ฉํ•จ

2. Segment Recurrence Mechanism
๐Ÿ’ก
๊ธด ์‹œํ€€์Šค๋ฅผ ์—ฌ๋Ÿฌ ์„ธ๊ทธ๋จผํŠธ๋กœ ๋‚˜๋ˆ„๊ณ , ์ด์ „ ์„ธ๊ทธ๋จผํŠธ์—์„œ ์–ป์€ ์ˆจ๊ฒจ์ง„ ์ƒํƒœ(hidden state)๋ฅผ ๋‹ค์Œ ์„ธ๊ทธ๋จผํŠธ์—์„œ ์žฌ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์ด ์ด์ „ ์ƒํƒœ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜
  • ์˜ˆ๋ฅผ ๋“ค์–ด ๊ธด ์‹œํ€€์Šค์—์„œ ๋‘ ๊ฐœ์˜ ์„ธ๊ทธ๋จผํŠธ๋ฅผ ์„ ํƒํ•œ๋‹ค๊ณ  ๊ฐ€์ •.
  • ์ฒซ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•œ ํ›„ ์–ป์€ ๋‚ด์šฉ ํ‘œํ˜„์„ ์บ์‹œ(cache)์— ์ €์žฅ
  • ๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‘ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ, ์ด์ „ ์„ธ๊ทธ๋จผํŠธ์—์„œ ์–ป์€ ๋‚ด์šฉ ํ‘œํ˜„์„ ์‚ฌ์šฉํ•˜์—ฌ ์–ดํ…์…˜(attention)์„ ์—…๋ฐ์ดํŠธ ์‹œํ‚ด
  • ์ฆ‰, ๊ธฐ์กด์—๋Š” ๋‹จ์–ด ์ˆœ์„œ๋Œ€๋กœ ์ •๋ฆฌํ•ด์„œ factorization order๋ฅผ ์•Œ์•˜์–ด์•ผ ํ–ˆ๋Š”๋ฐ, segment recurrence mechanism์€ ๋ฌธ์žฅ ๋‹จ์œ„๋กœ ์ €์žฅ์„ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ํ•ด๋‹น segment ๋‚ด์šฉ ํ‘œํ˜„์„ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•œ ๋‹ค์Œ, ์ดํ›„ segment์—์„œ ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•จ → ์ด๋ฅผ ํ†ตํ•ด ๊ณผ๊ฑฐ segment์— ๋Œ€ํ•œ factorization order๋ฅผ ๊ณ ๋ คํ•˜์ง€ ์•Š๊ณ  memory์˜ caching๊ณผ reusing์ด ๊ฐ€๋Šฅํ•จ (์ด๊ฒŒ ์œ„์˜ ์˜๋ฏธ์™€ ๊ฐ™์€ ๋ง)
    1. "๋‚˜๋Š” ์‚ฌ๊ณผ๋ฅผ ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค."
    1. "์‚ฌ๊ณผ๋Š” ๋ง›์žˆ์–ด์š”."
    ์ฒซ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ์—์„œ๋Š” "๋‚˜๋Š” ์‚ฌ๊ณผ๋ฅผ ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค"๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ๊ณ , ๋‘ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ์—์„œ๋Š” "์‚ฌ๊ณผ๋Š” ๋ง›์žˆ์–ด์š”"๋ผ๋Š” ๋ฌธ์žฅ์ด ์žˆ์Šต๋‹ˆ๋‹ค.๋”ฐ๋ผ์„œ, ๋‘ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ์—์„œ "์‚ฌ๊ณผ๋Š” ๋ง›์žˆ์–ด์š”"๋ผ๋Š” ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ, ์ฒซ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ์˜ ์ •๋ณด์ธ "๋‚˜๋Š” ์‚ฌ๊ณผ๋ฅผ ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค"๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „ ์„ธ๊ทธ๋จผํŠธ์—์„œ ์–ป์€ ์ •๋ณด๋ฅผ ์žฌ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋‘ ์„ธ๊ทธ๋จผํŠธ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์ˆœ์—ด ๊ธฐ๋ฐ˜ ์„ค์ •์—์„œ๋Š” ์ฒซ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•œ ํ›„, ํ•ด๋‹น ์„ธ๊ทธ๋จผํŠธ์˜ ๋‚ด์šฉ ํ‘œํ˜„์„ ๋ฉ”๋ชจ๋ฆฌ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ์ดํ›„ ๋‘ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ๋ฅผ ์ฒ˜๋ฆฌํ•  ๋•Œ, ์ด์ „ ์„ธ๊ทธ๋จผํŠธ์—์„œ ์–ป์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ, ์ด์ „ ์„ธ๊ทธ๋จผํŠธ์˜ ์ธ์žํ™” ์ˆœ์„œ๋ฅผ ์•Œ ํ•„์š” ์—†์ด ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์บ์‹ฑํ•˜๊ณ  ์žฌ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ, ๋‘ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ์—์„œ "์‚ฌ๊ณผ๋Š” ๋ง›์žˆ์–ด์š”"๋ผ๋Š” ๋ฌธ์žฅ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ, ์ฒซ ๋ฒˆ์งธ ์„ธ๊ทธ๋จผํŠธ์˜ ์ •๋ณด์ธ "๋‚˜๋Š” ์‚ฌ๊ณผ๋ฅผ ์ข‹์•„ํ•ฉ๋‹ˆ๋‹ค"๋ฅผ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด์ „ ์„ธ๊ทธ๋จผํŠธ์—์„œ ์–ป์€ ์ •๋ณด๋ฅผ ์žฌ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ๋‘ ์„ธ๊ทธ๋จผํŠธ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

 

728x90
๋ฐ˜์‘ํ˜•

'Deep Learning > [๋…ผ๋ฌธ] Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

SegNet  (0) 2023.07.06
CycleGAN  (0) 2023.07.05
Inception-v4, Inception-ResNetand the Impact of Residual Connections on Learning  (0) 2023.07.05
Seq2Seq  (0) 2023.07.05
U-Net  (0) 2023.07.05