๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ
Deep Learning/[์ฝ”๋“œ ๊ตฌํ˜„] DL Architecture ๊ตฌํ˜„

[Transformer] ์•„ํ‚คํ…์ฒ˜ ๊ตฌํ˜„ํ•˜๊ธฐ - 1 (Pytorch)

by ์ œ๋ฃฝ 2024. 2. 17.
728x90
๋ฐ˜์‘ํ˜•

https://arxiv.org/pdf/1706.03762.pdf

 

 

Transformer๋Š” ๋…ผ๋ฌธ์œผ๋กœ๋งŒ ์ฝ์–ด๋ดค์ง€, ์ฝ”๋“œ๋กœ ๋œฏ์–ด๋ณด๋Š” ๊ฒƒ์€ ์ฒ˜์Œ์ด๋‹ค.

๋…ผ๋ฌธ ์ €์ž๋“ค์€ ์ •๋ง ์ฒœ์žฌ๊ฐ€ ๋งž๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

์œ ํŠœ๋ธŒ๋ฅผ ์ฐธ๊ณ ํ•ด์„œ ์ฝ”๋“œ๋ฅผ ๊ตฌํ˜„ํ•˜์˜€์œผ๋ฉฐ, ์ด๋ฒˆ ํฌ์ŠคํŒ…์€ ์˜ค๋กœ์ง€ ์•„ํ‚คํ…์ฒ˜์—๋งŒ ์ดˆ์ ์„ ๋งž์ท„๋‹ค.

๋ฐ์ดํ„ฐ ๋ถ€๋ถ„์€ ๋‹ค์Œ์ฃผ์— ์˜ฌ๋ฆด ์˜ˆ์ •.

 


 

1. Input Embedding ๊ตฌํ˜„ํ•˜๊ธฐ

math.sqrt ์ฐธ๊ณ 

import torch
import torch.nn as nn 
import math

#Input embedding 
class InputEmbeddings(nn.Module):
    #d ์ฐจ์› ์„ค์ •, vocab size ์„ค์ •(์–ผ๋งˆ๋‚˜ ๋งŽ์€ ๋‹จ์–ด ๋„ฃ์„๊ฑด์ง€)
    def __init__(self,d_model : int, vocab_size : int):
        super().__init__()
        self.d_model = d_model
        self.vocab_size = vocab_size
        #Input Embedding (๋‹จ์–ด ์‚ฌ์ด์ฆˆ์™€ ์ฐจ์›)
        self.embedding = nn.Embedding(vocab_size, d_model)
        
    def forward(self,x):
        #๋ฃจํŠธ(d์ฐจ์›)์„ ๊ณฑํ•ด์คŒ (๊ฐ€์ค‘์น˜ ๊ฐœ๋…์œผ๋กœ)
        return self.embedding(x) * math.sqrt(self.d_model)

 

 

2. Positional Encoding ๊ตฌํ˜„ํ•˜๊ธฐ

'seq_len, d_model ์ฐจ์› ์ฐธ๊ณ 
https://code-angie.tistory.com/9

#Positional Encoding
class PositionalEncoding(nn.Module):
    
    #ํ•จ์ˆ˜ ๋ฆฌํ„ด ๊ฐ’์˜ ์ฃผ์„ ์—ญํ• (-> None)
    #ํ•ด๋‹น ํ•จ์ˆ˜์˜ ๋ฐ˜ํ™˜ ํƒ€์ž…์˜ ์˜ˆ์ƒ ํƒ€์ž…์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•œ๋‹ค๊ณ (only for ์ฝ”๋“œ ๊ฐ€๋…์„ฑ)
    def __init__(self, d_model : int, seq_len: int, dropout:float) -> None:
        super().__init__()
        self.d_model = d_model
        self.seq_len = seq_len
        self.dropout = nn.Dropout(dropout)
        
        #1. ๋นˆ ํ…์„œ ์…์„ฑ (seq_len, d_model)
        pe = torch.zeros(seq_len,d_model)
        #2. row ๋ฐฉํ–ฅ์œผ๋กœ (0~seq_len) ์ƒ์„ฑ (unsqueeze(dim=1))
        # ๋‹จ์–ด์˜ ์œ„์น˜๋ฅผ ์˜๋ฏธํ•จ
        position = torch.arange(0,seq_len, dtype = torch.float).unsqueeze(1)
        #3. col ๋ฐฉํ–ฅ์œผ๋กœ step=2๋ฅผ ํ™œ์šฉํ•˜์—ฌ i์˜ 2๋ฐฐ์ˆ˜๋ฅผ ๋งŒ๋“ฆ (0~2i) 
        _2i = torch.exp(torch.arange(0,d_model,2,dtype=torch.float))
        #4. cos, sine ํ•จ์ˆ˜ ์ œ์ž‘
        #์—ด ๊ธฐ์ค€ step 2์”ฉ ๊ฐ€๊ฒ ๋‹ค๋Š” ์˜๋ฏธ (0::2)
        pe[:,0::2] = torch.sin(position/10000**(_2i/d_model))
        pe[:,1::2] = torch.cos(position/10000**(_2i/d_model))

        #์ฐจ์› ์ถ”๊ฐ€
        #(0๋ฒˆ์งธ์— => batch ์ฐจ์›์„ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•จ์ž„)
        #๊ธฐ์กด : seq_len, d_model => 1,seq_len,d_model
        pe = pe.unsqueeze(0) #(1,Seq_len,d_model)
        
        #https://velog.io/@nawnoes/pytorch-%EB%AA%A8%EB%8D%B8%EC%9D%98-%ED%8C%8C%EB%9D%BC%EB%AF%B8%ED%84%B0%EB%A1%9C-%EB%93%B1%EB%A1%9D%ED%95%98%EC%A7%80-%EC%95%8A%EA%B8%B0-%EC%9C%84%ED%95%9C-registerbuffer
        #๋ฒ„ํผ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ•œ ๊ณณ์—์„œ ๋‹ค๋ฅธ ํ•œ ๊ณณ์œผ๋กœ ์ „์†กํ•˜๋Š” ๋™์•ˆ ์ผ์‹œ์ ์œผ๋กœ ๊ทธ ๋ฐ์ดํ„ฐ๋ฅผ ๋ณด๊ด€ํ•˜๋Š” ๋ฉ”๋ชจ๋ฆฌ์˜ ์˜์—ญ
        #๋ชจ๋ธ์˜ ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ์•„๋‹ˆ์ง€๋งŒ ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ์ €์žฅ ๋ฐ ๋กœ๋“œ๋˜์–ด์•ผ ํ•˜๋Š” ์ƒํƒœ๋ฅผ ์˜๋ฏธํ•จ!
        #ํ•ด๋‹น ๋ชจ๋“ˆ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ์ด ์ €์žฅ๋  ๋•Œ๋‚˜ ๋ถˆ๋Ÿฌ์˜ฌ ๋•Œ ํ•ด๋‹น ๋ฒ„ํผ๋„ ํ•จ๊ป˜ ์ €์žฅ ๋ฐ ๋กœ๋“œ
        self.register_buffer('pe',pe)
    
    def forward(self,x):
        # input x + positional encoding
        # ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ๊ฐ ์œ„์น˜์— ํ•ด๋‹นํ•˜๋Š” ์œ„์น˜ ์ž„๋ฒ ๋”ฉ์„ ๊ฐ€์ ธ์˜ค๋Š” ๋ถ€๋ถ„
        # ์—ญ์ „ํŒŒ ํ•  ํ•„์š” x
        x = x + (self.pe[:,:x.shape[1],:]).requires_grad_(False)
        return self.dropout(x)

 

3. Layer Normalization ๊ตฌํ˜„ํ•˜๊ธฐ

#๋ ˆ์ด์–ด ์ •๊ทœํ™”   
class LayerNormalization(nn.Module):
    
    def __init__(self, eps: float = 10**-6) -> None : 
        super().__init__()
        self.eps = eps
        self.alpha = nn.Parameter(torch.ones(1)) #Multiplied
        self.bias = nn.Parameter(torch.zeros(1)) #Added
        
    def forward(self,x):
        mean = x.mean(dim =-1, keepdim=True)
        std = x.std(dim= -1, keepdim =True)
        return self.alpha * (x-mean) / (std+ self.eps) + self.bias

 

4. Feed Forward ๊ตฌํ˜„ํ•˜๊ธฐ

#FeedForward
class FeedForwardBlock(nn.Module):
    
    def __init__(self, d_model : int, d_ff : int, dropout : float) -> None:
        super().__init__()
        # feed forward upwards projection size(d_ff=2048)
        self.linear_1 = nn.Linear(d_model, d_ff)
        self.dropout = nn.Dropout(dropout)
        self.linear_2 = nn.Linear(d_ff, d_model) #์ฐจ์›์„ ๋‹ค์‹œ 512์ฐจ์›์œผ๋กœ
        
    def forward(self,x):
        
        return self.linear_2(self.dropout(torch.relu(self.linear_1(x))))

 

5. Multi-Head Attention ๊ตฌํ˜„ํ•˜๊ธฐ

https://www.youtube.com/watch?v=ISNdQcPhsts

#Multi-Head Attention
class MultiHeadAttentionBlock(nn.Module):
    
    #h : head ๊ฐœ์ˆ˜
    def __init__(self,d_model : int, h: int, dropout: float) -> None:
        super().__init__()
        self.d_model = d_model
        self.h = h
        #AssertionError ์‹คํ–‰
        #๋‚˜๋ˆ„์–ด ๋–จ์–ด์ง€์ง€ ์•Š์œผ๋ฉด ์ค‘์ง€์‹œํ‚ด
        assert d_model % h ==0, 'd_model is not divisible by h'      
        
        self.d_k = d_model // h
        self.w_q = nn.Linear(d_model, d_model) #Wq
        self.w_k = nn.Linear(d_model,d_model) #Wk
        self.w_v = nn.Linear(d_model,d_model) #Wv
        
        #concat ํ•˜๋Š” ๋ถ€๋ถ„์—์„œ์˜ wo๊ฐ’
        self.w_o = nn.Linear(d_model,d_model) #Wo
        self.dropout = nn.Dropout(dropout)
    
    #class ๋ฐ–์—์„œ ์„ ์–ธ๋œ def ํ•จ์ˆ˜์™€ ๊ฐ™์Œ(์ •์ ๋ฉ”์„œ๋“œ)
    #๊ตณ์ด ์ธ์Šคํ„ด์Šค๋ฅผ ์ƒ์„ฑํ•˜์ง€ ์•Š๊ณ ๋„ ํ˜ธ์ถœํ•  ์ˆ˜ ์žˆ๋‹ค.
    #ex) MultiHeadAttention.attention()
    #ํŠน์ • ์ธ์Šคํ„ด์Šค์˜ ์ƒํƒœ์— ์˜์กดํ•˜์ง€ ์•Š๊ณ  ํด๋ž˜์Šค ์ˆ˜์ค€์—์„œ ์ˆ˜ํ–‰๋˜์–ด์•ผ ํ•  ๋•Œ ์œ ์šฉ
    @staticmethod
    def attention(query, key, value, mask, dropout : nn.Dropout):
        d_k = query.shape[-1] #(Batch, seq_len, d_model)
        
        #(Batch, h, Seq_len, d_k) -> (Batch, h, Seq_len, Seq_len)
        #์ฟผ๋ฆฌ์™€ ํ‚ค ๊ฐ„์˜ ๋‚ด์  ๊ฐ’ ๊ตฌํ•˜๊ธฐ -> ์Šค์ผ€์ผ๋ง
        attention_scores = (query @ key.transpose(-2,-1)) / math.sqrt(d_k)
        #mask ๋ถ€๋ถ„ : ๋งŒ์•ฝ mask๊ฐ€ ์ฃผ์–ด์กŒ๋‹ค๋ฉด,
        #0์ด ์žˆ๋Š” ๋ถ€๋ถ„์„ ๋งค์šฐ ์ž‘์€ ๊ฐ’(-1e9)์œผ๋กœ ์ฑ„์›Œ ๋งˆ์Šคํ‚น
        if mask is not None:
            attention_scores.masked_fill_(mask==0,-1e9)
        #์†Œํ”„ํŠธ๋งฅ์Šค ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์–ดํ…์…˜ ์Šค์ฝ”์–ด๋ฅผ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜
        #๊ฐ ์œ„์น˜์— ๋Œ€ํ•œ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜๊ฐ€ ๊ณ„์‚ฐ
        attention_scores = attention_scores.softmax(dim=-1) #(Batch,h,seq_len, seq_len)
        #๋“œ๋กญ์•„์›ƒ์ด ์ œ๊ณต๋˜์—ˆ๋‹ค๋ฉด ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜์— ๋“œ๋กญ์•„์›ƒ์„ ์ ์šฉ
        if dropout is not None : 
            attention_scores = dropout(attention_scores)
        #์ตœ์ข…์ ์œผ๋กœ ์–ดํ…์…˜ ๊ฐ€์ค‘ํ•ฉ๋œ ๊ฒฐ๊ณผ์™€ ์–ดํ…์…˜ ๊ฐ€์ค‘์น˜๋ฅผ ๋ฐ˜ํ™˜
        return (attention_scores @ value), attention_scores
    
        
    def forward(self, q, k, v, mask):
        #1. Q,K,V๋ฅผ  d_k, d_k, d_v ์ฐจ์›์œผ๋กœ projection
        query = self.w_q(q) #(Batch, seq_len, d_model) -> (Batch, seq_len, d_model)
        key = self.w_k(k)
        value = self.w_v(v)
        
        #Q,K,V๋ฅผ head ์ˆ˜ ๋งŒํผ ๋ถ„๋ฆฌํ•ด์ฃผ๊ธฐ 
        #(Batch, seq_len, d_model) -> (Batch, Seq_len, h, d_k) -> (Batch, h, Seq_len, d_k)
        query = query.view(query.shape[0],query.shape[1], self.h, self.d_k).transpose(1,2)
        key = key.view(key.shape[0],key.shape[1], self.h, self.d_k).transpose(1,2)
        value = value.view(value.shape[0],value.shape[1], self.h, self.d_k).transpose(1,2)
        
        x, self.attention_scores = MultiHeadAttentionBlock.attention(query,key,value,mask,self.dropout)
        
        #(Batch, h, Seq_len, d_k) -> (Batch, Seq_len, h,  d_k) -> (Batch,Seq_len, d_k)
        #https://ebbnflow.tistory.com/351
        #contiguous(์ธ์ ‘ํ•œ) : Tensor์˜ ๊ฐ ๊ฐ’๋“ค์ด ๋ฉ”๋ชจ๋ฆฌ์—๋„ ์ˆœ์ฐจ์ ์œผ๋กœ ์ €์žฅ๋˜์–ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ์˜๋ฏธ
        x = x.transpose(1,2).contiguous().view(x.shape[0],-1, self.h*self.d_k) # -1์€ ๋‚˜๋จธ์ง€ ์ฐจ์›์„ ์ž๋™์œผ๋กœ ์กฐ์ •ํ•˜๋ผ๋Š” ์˜๋ฏธ
        
        #(Batch,Seq_len, d_model) -> (Batch, seq_len, d_model)
        return self.w_o(x)

 

6. ResidualConnection ๊ตฌํ˜„ํ•˜๊ธฐ

#ResidualConnection
class ResidualConnection(nn.Module):
    
    def __init__(self, dropout : float) -> None:
        super().__init__()
        self.dropout = nn.Dropout(dropout)
        self.norm = LayerNormalization()
    
    #sublayer ?    
    def forward(self,x,sublayer):
        return x + self.dropout(sublayer(self.norm(x)))

7. EncoderBlock ๊ตฌํ˜„ํ•˜๊ธฐ

class EncoderBlock(nn.Module):
    
    def __init__(self, self_attention_block: MultiHeadAttentionBlock, feed_forward_block : FeedForwardBlock, dropout : float) -> None:
        super().__init__()
        self.self_atttention_block = self_attention_block
        self.feed_forward_block = feed_forward_block
        #ResidualConnection(dropout) for _ in range(2) : self attention ๋ถ€๋ถ„, Feed Forward ๋ถ€๋ถ„์—์„œ ๋‘ ๋ฒˆ์˜ skip connection์ด ์‹คํ–‰
        self.residual_connections = nn.ModuleList([ResidualConnection(dropout) for _ in range(2)])
        
    def forward(self,x,src_mask):
        #attention ๋ถ€๋ถ„ skip connection ์‹คํ–‰
        x = self.residual_connections[0](x, lambda x: self.self_atttention_block(x,x,x,src_mask))
        #feed forward ๋ถ€๋ถ„ skip connection ์‹คํ–‰
        x = self.residual_connections[1](x, self.feed_forward_block)
        return x

8. Encoder ๊ตฌํ˜„ํ•˜๊ธฐ

class Encoder(nn.Module):
    
    def __init__(self, layers : nn.ModuleList) -> None : 
        super().__init__()
        self.layers = layers
        self.norm = LayerNormalization()
        
    def forward(self,x,mask):
        for layer in self.layers:
            x = layer(x,mask)
        return self.norm(x)

 

9. DecoderBlock ๊ตฌํ˜„ํ•˜๊ธฐ

#DecoderBlock
class DecoderBlock(nn.Module):
    
    def __init__(self,self_attention_block : MultiHeadAttentionBlock, cross_attention_block : MultiHeadAttentionBlock, feed_forward_block : FeedForwardBlock,dropout:float) -> None:
        super().__init__()
        self.self_attention_block = self_attention_block
        self.cross_attention_block = cross_attention_block
        self.feed_forward_block = feed_forward_block
        self.residual_connections = nn.Module([ResidualConnection(dropout) for _ in range(3)])
    
    #tgt_mask:  ๋””์ฝ”๋”์˜ ํ˜„์žฌ ์œ„์น˜ ์ดํ›„์˜ ๋‹จ์–ด๋“ค์„ ๊ฐ€๋ ค์ฃผ๋Š” ๋งˆ์Šคํฌ 
    #src_mask:  ์ธ์ฝ”๋” ์ถœ๋ ฅ์—์„œ ํŒจ๋”ฉ ํ† ํฐ์— ํ•ด๋‹นํ•˜๋Š” ์œ„์น˜๋ฅผ 0์œผ๋กœ, ์‹ค์ œ ๋‹จ์–ด์— ํ•ด๋‹นํ•˜๋Š” ์œ„์น˜๋ฅผ 1๋กœ ์ฑ„์šด ์ด์ง„ ๋งˆ์Šคํฌ
    #๋‹จ์–ด ๊ธธ์ด๋ฅผ ๋งž์ถ”๊ธฐ ์œ„ํ•จ(์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ)   
    def forward(self,x,encoder_output, src_mask, tgt_mask):
        x = self.residual_connections[0](x, lambda x: self.self_attention_block(x,x,x,tgt_mask))
        #encoder output ๊ฐ’์„ ๋ฐ›๋Š”๋‹ค(cross_attention)
        x = self.residual_connections[1](x, lambda x: self.cross_attention_block(x,encoder_output,encoder_output,src_mask))
        x = self.residual_connections[2](x, self.feed_forward_block)
        return x

10. Decoder ๊ตฌํ˜„ํ•˜๊ธฐ

#Decoder
class Decoder(nn.Module):
    
    def __init__(self,layers: nn.ModuleList) -> None:
        super().__init__()
        self.layers = layers
        self.norm  = LayerNormalization()
        
    def forward(self,x,encoder_output,src_mask,tgt_mask):
        for layer in self.layers:
            x = layer(x,encoder_output,src_mask,tgt_mask)
        return self.norm(x)

11. ProjectionLayer ๊ตฌํ˜„ํ•˜๊ธฐ

#ProjectionLayer
class ProjectionLayer(nn.Module):
    
    def __init__(self,d_model : int, vocab_size : int) -> None:
        super().__init__()
        self.proj = nn.Linear(d_model, vocab_size)
        
    def forward(self,x):
        #(Batch, seq_len,d_model) -> (Batch,seq_len,vocab_size)
        return torch.log_softmax(self.proj(x), dim=-1)

12. Tranfomer ๊ตฌํ˜„ํ•˜๊ธฐ

#Transformer
class Transformer(nn.Module):
    
    def __init__(self,encoder :Encoder, decoder : Decoder, src_embed : InputEmbeddings, tgt_embed : InputEmbeddings, src_pos : PositionalEncoding, tgt_pos :PositionalEncoding, projection_layer : ProjectionLayer) -> None: 
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = src_embed
        self.tgt_embed = tgt_embed
        self.src_pos = src_pos
        self.tgt_pos = tgt_pos
        self.projection_layer = projection_layer
        
        
    def encode(self, src,src_mask):
        src = self.src_embed(src)
        src = self.src_pos(src)
        return self.encoder(src,src_mask)
    
    def decode(self,encoder_output,src_mask,tgt,tgt_mask):
        tgt = self.tgt_embed(tgt)
        tgt = self.tgt_pos(tgt)
        return self.decoder(tgt, encoder_output, src_mask, tgt_mask)
    
    def project(self,x):
        return self.projection_layer(x)
    
def build_transformer(src_vocab_size : int, tgt_vocab_size : int, src_seq_len : int, tgt_seq_len : int, d_model : int=512, N:int = 6, h : int = 8, dropout : float=0.1, d_ff : int=2048 ) -> Transformer:
    #Create Embedding layers
    src_embed = InputEmbeddings(d_model, src_vocab_size)   
    tgt_embed = InputEmbeddings(d_model, tgt_vocab_size)
    
    # Create positional encoding layers
    src_pos = PositionalEncoding(d_model,src_seq_len,dropout)
    tgt_pos = PositionalEncoding(d_model,tgt_vocab_size,dropout)
    
    #Create encoder blocks
    encoder_blocks=[]
    for _ in range(N):
        encoder_self_attention_block = MultiHeadAttentionBlock(d_model,h,dropout)
        feed_forward_block = FeedForwardBlock(d_model,d_ff, dropout)
        encoder_block = EncoderBlock(encoder_self_attention_block, feed_forward_block, dropout)
        encoder_blocks.append(encoder_block)
        
    #Create decoder blocks
    decoder_blocks=[]
    for _ in range(N):
        decoder_self_attention_block = MultiHeadAttentionBlock(d_model,h,dropout)
        decoder_cross_attention_block = MultiHeadAttentionBlock(d_model,h,dropout)
        feed_forward_block = FeedForwardBlock(d_model,d_ff, dropout)
        decoder_block = DecoderBlock(decoder_self_attention_block, decoder_self_attention_block, feed_forward_block, dropout)
        decoder_blocks.append(decoder_block)

    #Create encoder and decoder
    encoder = Encoder(nn.ModuleList(encoder_blocks))
    decoder = Decoder(nn.ModuleList(decoder_blocks))
    
    #Create projection layer
    projection_layer = ProjectionLayer(d_model, tgt_vocab_size)
    
    #Create the transformer
    transformer = Transformer(encoder, decoder, src_embed,tgt_embed, src_pos, tgt_pos, projection_layer)
    
    #initial parameters
    for p in transformer.parameters():
        if p.dim() >1 : 
            nn.init.xavier_uniform_(p)
            
    return transformer

 

์ฐธ๊ณ 

https://www.youtube.com/watch?v=ISNdQcPhsts

https://code-angie.tistory.com/9

 

[๋”ฅ๋Ÿฌ๋‹ / PyTorch] Transformer ๊ตฌํ˜„ (2) Embedding : Positional Encoding

Embedding Transformer์—์„œ ์‚ฌ์šฉ๋˜๋Š” Embedding์€ Input Embedding๊ณผ Output Embedding์ด ์žˆ์œผ๋ฉฐ, ์ด๋“ค๊ณผ ๋”ํ•ด์ง€๋Š” Positional Encoding์ด ์žˆ๋‹ค. Input๊ณผ Output Embedding์€ torch์—์„œ ์ œ๊ณตํ•˜๋Š” nn.Embedding์„ ์‚ฌ์šฉํ•˜์˜€์œผ๋ฉฐ, Position

code-angie.tistory.com

https://code-angie.tistory.com/7#3-position-wise-fully-connected-feed-forward-network

 

[๋”ฅ๋Ÿฌ๋‹ / PyTorch] Transformer ๊ตฌํ˜„ (1) Sub Layers

Sub Layers Encoder์™€ Decoder๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” Sub Layer๋กœ๋Š” Multi-Head Attention๊ณผ Position-Wise Fully Connected Feed-Forward Network๊ฐ€ ์žˆ๋‹ค. Multi-Head Attention๊ฐ€ ๋‚ดํฌํ•˜๋Š” Scaled Dot-Product Attention๊นŒ์ง€ ์ด 3๊ฐ€์ง€์˜ Sub Layer class๋ฅผ

code-angie.tistory.com

๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

728x90
๋ฐ˜์‘ํ˜•