๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[๋…ผ๋ฌธ] Paper Review

GPT-1

by ์ œ๋ฃฝ 2023. 7. 5.
728x90
๋ฐ˜์‘ํ˜•

 

 

 

1. Intro
  • Text์˜ unlabeled๋œ ๋ฐ์ดํ„ฐ๋Š” ํ’๋ถ€ํ•จ
  • ๋ฐ˜๋ฉด์— labeled๋œ ๋ฐ์ดํ„ฐ๋Š” ํ’๋ถ€ํ•˜์ง€ ์•Š๊ณ  ๋นˆ์•ฝํ•จ
  • ๋”ฐ๋ผ์„œ model์ด ์ ์ ˆํ•œ ์ž‘์—…์„ ์ˆ˜ํ–‰ํ•˜๊ธฐ ์‰ฝ์ง€ ์•Š๋Š”๋‹ค๋Š” ๋ฌธ์ œ์  ๋ฐœ์ƒ
  • ๊ทธ๋ž˜์„œ ๋‚˜์˜จ ์•„์ด๋””์–ด๊ฐ€ unsupervisedํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋จผ์ € ํ•™์Šต์‹œํ‚ค๊ณ , label๊ฐ’์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋กœ ์žฌํ•™์Šต์‹œํ‚ค๋Š” ๋ฐฉ์‹์ด ๋‚˜์˜ค๊ฒŒ ๋จ.
2. Overall architectudre
  • unsupervised pre-training + supervised fine-tuning ๊ตฌ์กฐ๋กœ ์ด๋ฃจ์–ด์ง
3. Unsupervised pre-training
  1. label๊ฐ’์ด ์—†๋Š” unsupervised data๋ฅผ input์œผ๋กœ ๋„ฃ์Œ
  1. word embedding ์ง„ํ–‰ํ•˜๊ณ  positional encoding ํ•ด์คŒ
  1. decoder์˜ masked self-attention ๋ถ€๋ถ„์„ ๋”ฐ์˜จ 12๊ฐœ์˜ ์ธต์„ ๊ฑฐ์นจ
  1. ๊ฑฐ์น˜๊ณ  ๋‚˜์˜จ ๊ฐ’์ด ๋ฐ”๋กœ h1,h2…hm๊ฐ’.
  1. ์ด ๊ฐ’์„ linear ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์นจ ⇒ WeT๊ฐ’์„ ๊ณฑํ•จ(word embedding ๊ฐ’์˜ ์ „์น˜ ํ–‰๋ ฌ) ⇒ ์ž…์ถœ๋ ฅ ํฌ๊ธฐ ๊ฐ™๊ฒŒ ํ•ด์คŒ
  1. ์ดํ›„ softmax ํ•จ์ˆ˜๋ฅผ ๊ฑฐ์นœ ์ตœ์ข… output ๋„์ถœ
  1. L1(U)์˜ ๊ฒฝ์šฐ ๋‹จ์–ด๋“ค์˜ ๋ถ„ํฌ ํŒŒ์•…์„ ์œ„ํ•ด์„œ ๋‚˜์˜จ ์‹์ธ๋ฐ, softmax์—์„œ ๋„์ถœ๋œ ๊ฐ’์— log๋ฅผ ์”Œ์šด ํ›„ ํ•ฉํ•œ ๊ฐ’์ด ๋ฐ”๋กœ L1(U).
  1. ์ด ๊ณผ์ •์ด ๋ฐ”๋กœ pre-training
4. Supervised fine-tuning
  1. ์ด ์นœ๊ตฌ๋Š” ๋ฐ์ดํ„ฐ label์ด ์žˆ๋Š” ๋ฐ์ดํ„ฐ ์‚ฌ์šฉ( unsupervised data๋ž‘ ๋‹ค๋ฅธ ๋ฐ์ดํ„ฐ์…‹)
  1. unsupervised pre-training์„ ํ†ตํ•œ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ค๊ณ , input์œผ๋กœ labeled data๋ฅผ ๋„ฃ์Œ
  1. labeled data๋ฅผ pre-training ๋ชจ๋ธ์„ ๋„ฃ์–ด์„œ ๋ชจ๋ธ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฐ”๊ฟ”์ฃผ๋Š” ๊ฒƒ์ž„.
  1. ์ด ํ•™์Šต๋œ ๊ฐ’์„ Wy์™€ ๊ณฑํ•ด์คŒ ⇒ ์–˜๋Š” ํ•™์Šต์‹œํ‚ค๋ฉด์„œ ์ •ํ•ด์ง€๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ์ž„(๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ์ž…์ถœ๋ ฅ ํฌ๊ธฐ ๊ฐ™๊ฒŒ ํ•ด์ฃผ๋ ค๊ณ )
  1. h๊ฐ’๊ณผ Wy ๊ฐ’์„ ๊ณฑํ•ด์ค€ ๊ฒƒ์„ softmax ํ•จ์ˆ˜์— ๋„ฃ์Œ
  1. L2(C) ⇒ 4๋ฒˆ ๊ฐ’์— log ์”Œ์šฐ๊ณ  ๋‹ค ๋”ํ•ด์ค€ ๊ฐ’.
5. L3(C)
  1. L3(C)⇒ ์ตœ๋Œ€์šฐ๋„๋ฒ•
  2. : data์˜ ๋ถ„ํฌ๊ฐ€ ์–ด๋–ค ๋ถ„ํฌ๋ฅผ ๋”ฐ๋ฅด๋Š”์ง€ ์ฐพ๋Š” ๊ฒƒ(๋‹จ์–ด๋“ค์˜ ๋ถ„ํฌ ํŒŒ์•…)
  1. L3(C)= L2(C) + ๋žŒ๋‹ค * L1(C)⇒ ์ด๋•Œ L1(U)๊ฐ’๊ณผ L1(C)๋Š” ๊ฐ™์€ ๊ฐ’์ž„.
  2. ๋žŒ๋‹ค์˜ ๊ฒฝ์šฐ, ์šฐ๋ฆฌ๊ฐ€ ์ •ํ•˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ’.
6. Task Specific Input Transformation
  1. Classification : ๋ถ„๋ฅ˜ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฌธ์žฅ์„ GPT-1์— ๊ทธ๋ƒฅ ํ†ต๊ณผ์‹œํ‚จ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์–ป์€ ๋งˆ์ง€๋ง‰ ํ† ํฐ(<\s> ํ† ํฐ์ด ์ƒ์„ฑ๋  ์œ„์น˜)์˜ output์„ classification layer์— ์‹ค์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค.
  1. Entailment : <s>Premise$Hypothesis<\s>์˜ ํ˜•ํƒœ๋กœ ์ž…๋ ฅํ•˜๊ณ , ๋งˆ์ง€๋ง‰ ํ† ํฐ์˜ output์„ linear layer์— ์‹ค์—ˆ๋‹ค.
  1. Similarity : Entailment์™€ ๋‹ค๋ฅด๊ฒŒ ๋น„๊ตํ•˜๊ณ ์ž ํ•˜๋Š” ๋‘ ๋ฌธ์žฅ์ด ์–ด๋–ค ์ˆœ์„œ๋กœ ์ž…๋ ฅ๋˜์–ด์•ผ ํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ๊ทœ์น™์ด ์กด์žฌํ•˜์ง€ ์•Š๋Š”๋‹ค. ๊ทธ๋Ÿฌ๋ฏ€๋กœ ๊ฐ ๋ฌธ์žฅ์˜ ์ˆœ์„œ๋ฅผ ๋ฐ”๊ฟ”์„œ ๋‘ ๋ฒˆ ์‹ค๊ณ , ๋‘ ๊ฐœ์˜ ๋งˆ์ง€๋ง‰ ํ† ํฐ output์„ element-wise sumํ•˜์—ฌ linear layer์— ์‹ค์—ˆ๋‹ค.
  1. Multiple Choice : Context๋ฅผ ๋จผ์ € ๋„ฃ๊ณ  ๊ฐ ํ›„๋ณด Answer๋ฅผ ๋’ค์— ๋„ฃ์–ด ์–ป์€ ๋งˆ์ง€๋ง‰ ํ† ํฐ์˜ output์„ linear์™€ softmax layer์— ์ˆœ์ฐจ์ ์œผ๋กœ ์‹ค์—ˆ๋‹ค.

⇒ ์—ฌ๊ธฐ์„œ ๋งˆ์ง€๋ง‰ ํ† ํฐ output์„ ์‚ฌ์šฉํ•ด์„œ classification์„ ์ˆ˜ํ–‰ํ•˜๋Š” ์ด์œ ๋Š” decoder์˜ ๊ฒฝ์šฐ, ์•ž์—์„œ ๋’ค๋กœ๋งŒ ์ •๋ณด๊ฐ€ ํ๋ฅด๊ฒŒ ๋จ. ์ฆ‰, ๋ฌธ์žฅ์˜ ๋ชจ๋“  ํ† ํฐ์˜ ์ •๋ณด๋Š” ๋งจ ๋งˆ์ง€๋ง‰ ํ† ํฐ์˜ ์ •๋ณด๋งŒ์ด ๊ฐ–๊ณ  ์žˆ์Œ. ๊ทธ๋ ‡๊ธฐ์— ๋งˆ์ง€๋ง‰ output ํ† ํฐ์„ ์ด์šฉํ•ด์„œ classification ์ง„ํ–‰

7. ์žฅ์ 
  1. ๋ชจ๋ธ ๊ตฌ์กฐ์˜ ๋ณ€ํ˜•์ด ์—†์Œ⇒ ๊ธฐ์กด์˜ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ๋“ค์€ finetuning ์‹œ์— ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋ณ€ํ˜•ํ•ด์•ผ ํ•˜๋Š” ๋ฌธ์ œ์ ์ด ์กด์žฌํ–ˆ์—ˆ์Œ. ํ•˜์ง€๋งŒ GPT์˜ ๊ฒฝ์šฐ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ์ „ํ˜€ ๊ฑด๋“ค์ง€ ์•Š์•„ ์žฌํ•™์Šต์ด ๋งค์šฐ ์šฉ์ดํ•จ.
  1. ์ถ”๊ฐ€๋˜๋Š” ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ๋งค์šฐ ์ ์Œ⇒ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ ๋ชจ๋ธ ๊ตฌ์กฐ๋ฅผ ๋ณ€ํ˜•ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๊ฐ€ ์ ์„ ์ˆ˜ ๋ฐ–์—.

 

728x90
๋ฐ˜์‘ํ˜•

'Deep Learning > [๋…ผ๋ฌธ] Paper Review' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

U-Net  (0) 2023.07.05
Bert  (0) 2023.07.05
VIT [AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE]  (0) 2023.07.05
RetinaNet  (0) 2023.07.05
DeepLab V2: Semantic Image Segmentation with Convolutional Nets, Atrous Convolution and Fully Connected CRFs  (0) 2023.07.05