0. Abstract
Figure 1: Given any two unordered image collections X and Y , our algorithm learns to automatically “translate” an image from one into the other and vice versa: (left) Monet paintings and landscape photos from Flickr; (center) zebras and horses from ImageNet; (right) summer and winter Yosemite photos from Flickr. Example application (bottom): using a collection of paintings of famous artists, our method learns to render natural photographs into the respective styles.
κ·Έλ¦Ό 1: μμμ λ κ°μ μ΄λ―Έμ§ μ§ν© Xμ Yκ° μ£Όμ΄μ‘μ λ, μ°λ¦¬μ μκ³ λ¦¬μ¦μ ν μ΄λ―Έμ§λ₯Ό λ€λ₯Έ μ΄λ―Έμ§λ‘ "λ³ν"νκ³ κ·Έ λ°λλ‘ μννλ λ°©λ²μ μλμΌλ‘ νμ΅ν©λλ€. (μΌμͺ½) Monetμ κ·Έλ¦Όκ³Ό Flickrμ νκ²½ μ¬μ§; (κ°μ΄λ°) ImageNetμ μΌλ£©λ§κ³Ό λ§; (μ€λ₯Έμͺ½) Flickrμ μ¬λ¦κ³Ό κ²¨μΈ μμΈλ―Έν° μ¬μ§μ λλ€. μμ μμ© (μλ): μ λͺ ν μμ κ°λ€μ κ·Έλ¦Ό 컬λ μ μ μ¬μ©νμ¬ μ°λ¦¬μ λ°©λ²μ μμ° μ¬μ§μ ν΄λΉ μ€νμΌλ‘ λ λλ§νλ λ°©λ²μ νμ΅ν©λλ€.
Image-to-image translation is a class of vision and graphics problems where the goal is to learn the mapping between an input image and an output image using a training set of aligned image pairs. However, for many tasks, paired training data will not be available. We present an approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples. Our goal is to learn a mapping G : X → Y such that the distribution of images from G(X) is indistinguishable from the distribution Y using an adversarial loss. Because this mapping is highly under-constrained, we couple it with an inverse mapping F : Y → X and introduce a cycle consistency loss to enforce F(G(X)) ≈ X (and vice versa). Qualitative results are presented on several tasks where paired training data does not exist, including collection style transfer, object transfiguration, season transfer, photo enhancement, etc. Quantitative comparisons against several prior methods demonstrate the superiority of our approach.
μ΄λ―Έμ§ κ° λ³νμ μ λ ₯ μ΄λ―Έμ§μ μΆλ ₯ μ΄λ―Έμ§ μ¬μ΄μ 맀νμ νμ΅νλ λΉμ λ° κ·Έλν½μ€ λ¬Έμ μ ν μ νμ λλ€. μΌμΉλ μ΄λ―Έμ§ μμ νμ΅ μΈνΈλ₯Ό μ¬μ©νμ¬ μ λ ₯ μ΄λ―Έμ§μ μΆλ ₯ μ΄λ―Έμ§ κ°μ 맀νμ νμ΅νλ κ²μ΄ λͺ©νμ λλ€. κ·Έλ¬λ λ§μ μμ μμλ μμΌλ‘ μ΄λ£¨μ΄μ§ νλ ¨ λ°μ΄ν°λ₯Ό μ¬μ©ν μ μμ΅λλ€. μ°λ¦¬λ μ§μ§μ΄μ§ μμ κ° μμ λ μμ€ λλ©μΈ Xμμ λμ λλ©μΈ Yλ‘ μ΄λ―Έμ§λ₯Ό λ³ννλ λ°©λ²μ μ μν©λλ€. μ°λ¦¬μ λͺ©νλ G: X → YλΌλ 맀νμ νμ΅νλ κ²μΈλ°, μ΄λ G(X)μ μ΄λ―Έμ§ λΆν¬κ° μ λμ μμ€μ μ¬μ©νμ¬ λΆν¬ Yμ ꡬλ³ν μ μλλ‘ ν©λλ€. μ΄λ¬ν 맀νμ λ§€μ° λΆμΆ©λΆν μ μ½μ κ°μ§κ³ μμΌλ―λ‘ μ 맀ν F: Y → Xμ ν¨κ» κ²°ν©νκ³ F(G(X)) ≈ X (λ°λμ κ²½μ°λ λ§μ°¬κ°μ§)λ₯Ό κ°μ νκΈ° μν΄ μ¬μ΄ν΄ μΌκ΄μ± μμ€μ λμ ν©λλ€. μ§μ§μ΄μ§ νλ ¨ λ°μ΄ν°κ° μλ μ¬λ¬ μμ μ λν μ§μ μΈ κ²°κ³Όλ₯Ό μ μνλ©°, 컬λ μ μ€νμΌ λ³ν, κ°μ²΄ λ³ν, κ³μ λ³ν, μ¬μ§ κ°μ λ±μ ν¬ν¨ν©λλ€. κΈ°μ‘΄ λ°©λ²κ³Όμ μμ λΉκ΅λ μ°λ¦¬μ μ κ·Ό λ°©μμ μ°μμ±μ μ μ¦ν©λλ€.
1. Introduction
Figure 2: Paired training data (left) consists of training examples {xi , yi} N i=1, where the correspondence between xi and yi exists [22]. We instead consider unpaired training data (right), consisting of a source set {xi} N i=1 (xi ∈ X) and a target set {yj}M j=1 (yj ∈ Y ), with no information provided as to which xi matches which yj . Figure 2: 맀μΉλ νλ ¨ λ°μ΄ν° (μΌμͺ½)λ νλ ¨ μμ {xi, yi} N i=1λ‘ κ΅¬μ±λλ©°, xiμ yi μ¬μ΄μ λμμ΄ μ‘΄μ¬ν©λλ€ [22]. λ°λ©΄μ μ°λ¦¬λ 맀μΉλμ§ μμ νλ ¨ λ°μ΄ν° (μ€λ₯Έμͺ½)λ₯Ό κ³ λ €ν©λλ€. μ΄λ μμ€ μ§ν© {xi} N i=1 (xi ∈ X)μ νκ² μ§ν© {yj}M j=1 (yj ∈ Y)λ‘ κ΅¬μ±λλ©°, μ΄λ€ xiκ° μ΄λ€ yjμ μΌμΉνλμ§μ λν μ λ³΄κ° μ 곡λμ§ μμ΅λλ€.
What did Claude Monet see as he placed his easel by the bank of the Seine near Argenteuil on a lovely spring day in 1873 (Figure 1, top-left)? A color photograph, had it been invented, may have documented a crisp blue sky and a glassy river reflecting it. Monet conveyed his impression of this same scene through wispy brush strokes and a bright palette.
What if Monet had happened upon the little harbor in Cassis on a cool summer evening (Figure 1, bottom-left)? A brief stroll through a gallery of Monet paintings makes it possible to imagine how he would have rendered the scene: perhaps in pastel shades, with abrupt dabs of paint, and a somewhat flattened dynamic range. We can imagine all this despite never having seen a side by side example of a Monet painting next to a photo of the scene he painted. Instead, we have knowledge of the set of Monet paintings and of the set of landscape photographs. We can reason about the stylistic differences between thesetwo sets, and thereby imagine what a scene might look like if we were to “translate” it from one set into the other.
1873λ λ΄, ν΄λ‘λ λͺ¨λ€κ° μλ₯΄μ₯ν€ μΈκ·Ό μΈλ κ° λ μμ μ΄μ§λ₯Ό μΈμ°κ³ μμ λ (Figure 1, μΌμͺ½ μλ¨), κ·Έλ μ΄λ€ νκ²½μ 보μμκΉμ? λ§μ½ μ»¬λ¬ μ¬μ§μ΄ μ΄λ―Έ λ°λͺ λμ΄ μλ€λ©΄, μ²λͺ ν νλ νλκ³Ό κ·Έκ²μ λ°μνλ μ 리μ²λΌ λ§μ κ°μ΄ μ¬μ§μ λ΄κ²¨μμ μλ μμμ΅λλ€. λͺ¨λ€λ μ΄ κ°μ μ₯λ©΄μ μμ λΆμ§κ³Ό λ°μ νλ νΈλ₯Ό ν΅ν΄ μμ μ μΈμμ μ λ¬νμ΅λλ€. λ§μ½ λͺ¨λ€κ° μΊμμ€μ μμ νꡬμμ μλν μ¬λ¦ μ λ μ κ·Έκ³³μ λ°κ²¬νλ€λ©΄ (Figure 1, μΌμͺ½ νλ¨), λͺ¨λ€μ κ·Έλ¦Ό κ°€λ¬λ¦¬λ₯Ό μ μ λμ보면 κ·Έκ° μ΄λ»κ² κ·Έ μ₯λ©΄μ νννμμ§ μμν μ μμ΅λλ€. μλ§λ μ μ μμ‘°λ‘, κ°μμ€λ¬μ΄ νΈκΈ°λ‘, μ½κ° νΌμ³μ§ λ€μ΄λ΄λ―Ή λ μΈμ§λ‘ νννμ κ²μ λλ€.
μ°λ¦¬λ μ΄ λͺ¨λ κ²μ λͺ¨λ€μ κ·Έλ¦Όκ³Ό κ·Έκ° κ·Έλ¦° μ₯λ©΄μ μ¬μ§μ λλν λ³Έ μ μ΄ μμ΄λ μμν μ μμ΅λλ€. λμ , μ°λ¦¬λ λͺ¨λ€μ κ·Έλ¦Ό λͺ¨μκ³Ό νκ²½ μ¬μ§ λͺ¨μμ λν μ§μμ κ°μ§κ³ μμ΅λλ€. μ΄ λ μ§ν© κ°μ μ€νμΌμ μ°¨μ΄λ₯Ό μΆλ‘ ν μ μμΌλ©°, λ°λΌμ ν μ§ν©μμ λ€λ₯Έ μ§ν©μΌλ‘ "λ²μ"νλ€λ©΄ μ₯λ©΄μ΄ μ΄λ»κ² 보μΌμ§ μμν μ μμ΅λλ€.
In this paper, we present a method that can learn to do the same: capturing special characteristics of one image collection and figuring out how these characteristics could be translated into the other image collection, all in the absence of any paired training examples. This problem can be more broadly described as imageto-image translation [22], converting an image from one representation of a given scene, x, to another, y, e.g., grayscale to color, image to semantic labels, edge-map to photograph. Years of research in computer vision, image processing, computational photography, and graphics have produced powerful translation systems in the supervised setting, where example image pairs {xi , yi} N i=1 are available (Figure 2, left), e.g., [11, 19, 22, 23, 28, 33, 45, 56, 58, 62]. However, obtaining paired training data can be difficult and expensive. For example, only a couple of datasets exist for tasks like semantic segmentation (e.g., [4]), and they are relatively small. Obtaining input-output pairs for graphics tasks like artistic stylization can be even more difficult since the desired output is highly complex, typically requiring artistic authoring. For many tasks, like object transfiguration (e.g., zebra↔horse, Figure 1 top-middle), the desired output is not even well-defined.
μ΄ λ Όλ¬Έμμλ μ΄λ€ μ΄λ―Έμ§ λͺ¨μμ νΉμ§μ ν¬μ°©νκ³ μ΄λ¬ν νΉμ§μ λ€λ₯Έ μ΄λ―Έμ§ λͺ¨μμΌλ‘ μ΄λ»κ² λ³νν μ μλμ§ νμ΅νλ λ°©λ²μ μ μν©λλ€. μ΄λ μ΄λ ν λμλλ νλ ¨ μμλ μλ μν©μμ μ΄λ£¨μ΄μ§λ κ²μ λλ€. μ΄ λ¬Έμ λ μ΄λ―Έμ§ κ° λ³ν, μλ₯Ό λ€μ΄ κ·Έλ μ΄μ€μΌμΌμμ 컬λ¬λ‘, μ΄λ―Έμ§μμ μλ§¨ν± λ μ΄λΈλ‘, μ£μ§ 맡μμ μ¬μ§μΌλ‘μ λ³ν λ±μΌλ‘ λ λκ² μ€λͺ λ μ μμ΅λλ€. μ»΄ν¨ν° λΉμ , μ΄λ―Έμ§ μ²λ¦¬, κ³μ° μ¬μ§μ λ° κ·Έλν½μ€ λΆμΌμμ μλ κ°μ μ°κ΅¬λ‘λ μμ μ΄λ―Έμ§ μ {xi, yi} N i=1 μ΄ μ 곡λλ μ§λ νμ΅ νκ²½μμ κ°λ ₯ν λ³ν μμ€ν μ κ°λ°νμ΅λλ€(Figure 2, μΌμͺ½), μλ₯Ό λ€λ©΄ [11, 19, 22, 23, 28, 33, 45, 56, 58, 62] λ±μ΄ μμ΅λλ€.
κ·Έλ¬λ λμλλ νλ ¨ λ°μ΄ν°λ₯Ό μ»λ κ²μ μ΄λ ΅κ³ λΉμ©μ΄ λ§μ΄ λ€ μ μμ΅λλ€. μλ₯Ό λ€μ΄ μλ§¨ν± λΆν κ³Ό κ°μ μμ μλ λͺ κ°μ λ°μ΄ν°μ λ§ μ‘΄μ¬νλ©° μλμ μΌλ‘ μμ΅λλ€([4] λ±). μμ μ μΈ μ€νμΌνμ κ°μ κ·Έλν½ μμ μ λν΄ μ λ ₯-μΆλ ₯ μμ μ»λ κ²μ λμ± μ΄λ €μΈ μ μμ΅λλ€. μνλ μΆλ ₯μ λ§€μ° λ³΅μ‘νλ©° μΌλ°μ μΌλ‘ μμ μ μ μμ νμλ‘ ν©λλ€. μΌλ£©λ§ ↔ λ§κ³Ό κ°μ κ°μ²΄ λ³νκ³Ό κ°μ λ§μ μμ μμλ μνλ μΆλ ₯μ΄ λͺ ννκ² μ μλμ§ μμ μλ μμ΅λλ€.
We therefore seek an algorithm that can learn to translate between domains without paired input-output examples (Figure 2, right). We assume there is some underlying relationship between the domains – for example, that they are two different renderings of the same underlying scene – and seek to learn that relationship. Although we lack supervision in the form of paired examples, we can exploit supervision at the level of sets: we are given one set of images in domain X and a different set in domain Y . We may train a mapping G : X → Y such that the output yˆ = G(x), x ∈ X, is indistinguishable from images y ∈ Y by an adversary trained to classify yˆ apart from y. In theory, this objective can induce an output distribution over yˆ that matches the empirical distribution pdata(y) (in general, this requires G to be stochastic) [16]. The optimal G thereby translates the domain X to a domain Yˆ distributed identically to Y . However, such a translation does not guarantee that an individual input x and output y are paired up in a meaningful way – there are infinitely many mappings G that will induce the same distribution over yˆ. Moreover, in practice, we have found it difficult to optimize the adversarial objective in isolation: standard procedures often lead to the wellknown problem of mode collapse, where all input images map to the same output image and the optimization fails to make progress [15].
λ°λΌμ, μ°λ¦¬λ μ λ ₯-μΆλ ₯ μμ μμ μμ΄ λλ©μΈ κ°μ λ²μμ νμ΅ν μ μλ μκ³ λ¦¬μ¦μ μ°Ύκ³ μμ΅λλ€(Figure 2, μ€λ₯Έμͺ½). μ°λ¦¬λ λλ©μΈ κ°μ
μ΄λ€ κΈ°μ μ κ²½μ κ°μ§ κ΄κ³κ° μλ€κ³ κ°μ νκ³ κ·Έ κ΄κ³λ₯Ό νμ΅νλ €κ³ ν©λλ€. μμ μ ννλ‘λ μ§λ νμ΅μ κ°λ μ΄ λΆμ‘±νμ§λ§, μ°λ¦¬λ μ§ν© μμ€μμμ κ°λ μ νμ©ν μ μμ΅λλ€: μ°λ¦¬μκ²λ λλ©μΈ Xμ μ΄λ―Έμ§ μ§ν©κ³Ό λλ©μΈ Yμ λ€λ₯Έ μ΄λ―Έμ§ μ§ν©μ΄ μ£Όμ΄μ§λλ€. μ°λ¦¬λ 맀ν G: X → Yλ₯Ό νμ΅μμΌμ μΆλ ₯ yˆ = G(x), x ∈ Xκ° y ∈ Y μ΄λ―Έμ§μ μ°¨λ³νλμ§ μλλ‘ ν μ μμ΅λλ€. μ΄λ₯Ό μν΄ yˆμ yλ₯Ό λΆλ₯νλ λ°μ νλ ¨λ μ λμ μΈ λͺ¨λΈμ μ¬μ©ν©λλ€. μ΄λ‘ μ μΌλ‘, μ΄ λͺ©νλ yˆμ λν μΆλ ₯ λΆν¬λ₯Ό κ²½νμ λΆν¬ pdata(y)μ μΌμΉμν¬ μ μμ΅λλ€ (μΌλ°μ μΌλ‘ μ΄λ₯Ό μν΄μλ Gκ° νλ₯ μ μ΄μ΄μΌ ν¨) [16]. λ°λΌμ μ΅μ μ Gλ λλ©μΈ Xλ₯Ό λλ©μΈ Yˆλ‘ λ²μνλ©°, Yμ λμΌνκ² λΆν¬λ©λλ€.
κ·Έλ¬λ μ΄λ¬ν λ²μμ κ°λ³μ μΈ μ λ ₯ xμ μΆλ ₯ yκ° μλ―Έ μλ λ°©μμΌλ‘ λμλλ€λ κ²μ 보μ₯νμ§ μμ΅λλ€ - yˆμ λν΄ λμΌν λΆν¬λ₯Ό μ λνλ 무νν λ§μ 맀ν Gκ° μ‘΄μ¬ν μ μμ΅λλ€.κ²λ€κ° μ€μ λ‘λ μ λμ λͺ©μ μ λ 립μ μΌλ‘ μ΅μ ννκΈ°κ° μ΄λ €μ΄ κ²μΌλ‘ νλͺ λμμ΅λλ€. νμ€ μ μ°¨λ μ’ μ’ λͺ¨λ λΆκ΄΄λΌκ³ μλ €μ§ λ¬Έμ μ μ΄λ₯΄κ² λλλ°,
μ΄λ λͺ¨λ μ λ ₯ μ΄λ―Έμ§κ° λμΌν μΆλ ₯ μ΄λ―Έμ§λ‘ 맀νλκ³ μ΅μ νκ° μ§μ λμ§ μλ λ¬Έμ μ λλ€ [15].These issues call for adding more structure to our objective. Therefore, we exploit the property that translation should be “cycle consistent”, in the sense that if we translate, e.g., a sentence from English to French, and then translate it back from French to English, we should arrive back at the original sentence [3]. Mathematically, if we have a translator G : X → Y and another translator F : Y → X, then G and F should be inverses of each other, and both mappings should be bijections. We apply this structural assumption by training both the mapping G and F simultaneously, and adding a cycle consistency loss [64] that encourages F(G(x)) ≈ x and G(F(y)) ≈ y. Combining this loss with adversarial losses on domains X and Y yields our full objective for unpaired image-to-image translation. We apply our method to a wide range of applications, including collection style transfer, object transfiguration, season transfer and photo enhancement. We also compare against previous approaches that rely either on hand-defined factorizations of style and content, or on shared embedding functions, and show that our method outperforms these baselines. We provide both PyTorch and Torch implementations. Check out more results at our website. μ΄λ¬ν λ¬Έμ λ€μ
λͺ©μ μ λ λ§μ ꡬ쑰λ₯Ό μΆκ°ν νμκ° μλ€λ κ²μ μμ¬ν©λλ€. λ°λΌμ, μ°λ¦¬λ λ²μμ "cycle consistent" ν΄μΌ νλ€λ μμ±μ νμ©ν©λλ€. μ¦, μλ₯Ό λ€μ΄ μμ΄μμ νλμ€μ΄λ‘ λ¬Έμ₯μ λ²μνκ³ λ€μ νλμ€μ΄μμ μμ΄λ‘ λ²μνλ©΄ μλμ λ¬Έμ₯μΌλ‘ λμμμΌ ν©λλ€ [3]. μνμ μΌλ‘, λ§μ½ μ°λ¦¬μκ² G: X → Y λ²μκΈ°μ F: Y → X λ²μκΈ°κ° μλ€λ©΄, Gμ Fλ μλ‘μ μν¨μμ¬μΌ νλ©° λ 맀νμ μ λ¨μ¬ ν¨μ(bijection)μ΄μ΄μΌ ν©λλ€. μ°λ¦¬λ μ΄ κ΅¬μ‘°μ κ°μ μ μ μ©νκΈ° μν΄ λ§€ν Gμ Fλ₯Ό λμμ νμ΅νκ³ , F(G(x)) ≈ xμ G(F(y)) ≈ yλ₯Ό μ₯λ €νλ cycle consistency loss [64]λ₯Ό μΆκ°ν©λλ€. Xμ Y λλ©μΈμμμ μ λμ μμ€κ³Ό μ΄ μμ€μ κ²°ν©νμ¬ unpaired μ΄λ―Έμ§ κ° λ²μμ μν μμ ν λͺ©μ μ μ»μ΅λλ€.
μ°λ¦¬μ λ°©λ²μ 컬λ μ μ€νμΌ μ μ΄, κ°μ²΄ λ³ν, κ³μ μ μ΄ λ° μ¬μ§ κ°μ λ± λ€μν μμ© λΆμΌμ μ μ©ν©λλ€. λν, μ€νμΌκ³Ό μ½ν μΈ μ μλ μ μλ μμ λΆν΄λ 곡μ λ μλ² λ© ν¨μμ μμ‘΄νλ μ΄μ μ κ·Ό λ°©μκ³Ό λΉκ΅νκ³ , μ°λ¦¬μ λ°©λ²μ΄ μ΄λ¬ν κΈ°μ€ λͺ¨λΈμ λ₯κ°νλ€λ κ²μ 보μ¬μ€λλ€. μ°λ¦¬λ PyTorchμ Torch ꡬνμ μ 곡ν©λλ€. μμΈν κ²°κ³Όλ μ°λ¦¬μ μΉμ¬μ΄νΈλ₯Ό νμΈν΄μ£ΌμΈμ.
2. Related Work
Figure 3: (a) Our model contains two mapping functions G : X → Y and F : Y → X, and associated adversarial discriminators DY and DX. DY encourages G to translate X into outputs indistinguishable from domain Y , and vice versa for DX and F. To further regularize the mappings, we introduce two cycle consistency losses that capture the intuition that if we translate from one domain to the other and back again we should arrive at where we started: (b) forward cycle-consistency loss: x → G(x) → F(G(x)) ≈ x, and (c) backward cycle-consistency loss: y → F(y) → G(F(y)) ≈ y Figure 3: (a) μ°λ¦¬μ λͺ¨λΈμ λ κ°μ 맀ν ν¨μ G: X → Yμ F: Y → Xλ₯Ό ν¬ν¨νκ³ μμΌλ©°, μ΄μ μ°κ²°λ μ λμ νλ³μ DYμ DXκ° μμ΅λλ€. DYλ Gκ° Xλ₯Ό Y λλ©μΈκ³Ό ꡬλ³ν μ μλ μΆλ ₯μΌλ‘ λ²μνλλ‘ μ₯λ €νλ©°, DXμ Fμ λν΄μλ λ§μ°¬κ°μ§μ λλ€. 맀νμ λμ± κ·μ νκΈ° μν΄, μ°λ¦¬λ λ κ°μ μν μΌκ΄μ± μμ€μ λμ ν©λλ€. μ΄ μμ€μ ν λλ©μΈμμ λ€λ₯Έ λλ©μΈμΌλ‘ λ²μν λ€μ λ€μ λλμμ€λ©΄ μλμ μμΉλ‘ λλ¬ν΄μΌ νλ€λ μ§κ΄μ ν¬μ°©ν©λλ€: (b) μλ°©ν₯ μν μΌκ΄μ± μμ€: x → G(x) → F(G(x)) ≈ x, κ·Έλ¦¬κ³ (c) μλ°©ν₯ μν μΌκ΄μ± μμ€: y → F(y) → G(F(y)) ≈ y
Generative Adversarial Networks (GANs)
[16, 63] have achieved impressive results in image generation [6, 39], image editing [66], and representation learning [39, 43, 37]. Recent methods adopt the same idea for conditional image generation applications, such as text2image [41], image inpainting [38], and future prediction [36], as well as to other domains like videos [54] and 3D data [57]. The key to GANs’ success is the idea of an adversarial loss that forces the generated images to be, in principle, indistinguishable from real photos. This loss is particularly powerful for image generation tasks, as this is exactly the objective that much of computer graphics aims to optimize. We adopt an adversarial loss to learn the mapping such that the translated images cannot be distinguished from images in the target domain.
μμ±μ μ λ μ κ²½λ§(GANs)μ μ΄λ―Έμ§ μμ±, μ΄λ―Έμ§ νΈμ§ λ° νν νμ΅κ³Ό κ°μ μ΄λ―Έμ§ κ΄λ ¨ μμ μμ μΈμμ μΈ κ²°κ³Όλ₯Ό λ¬μ±νμ΅λλ€ [6, 39, 66]. μ΅κ·Όμ λ°©λ²λ€μ μ΄ μμ΄λμ΄λ₯Ό ν μ€νΈμμ μ΄λ―Έμ§λ‘μ μ‘°κ±΄λΆ μ΄λ―Έμ§ μμ±, μ΄λ―Έμ§ 보μ , λ―Έλ μμΈ‘κ³Ό κ°μ μμ© νλ‘κ·Έλ¨λΏλ§ μλλΌ λΉλμ€ [54] λ° 3D λ°μ΄ν° [57]μ κ°μ λ€λ₯Έ λλ©μΈμλ λμΌνκ² μ μ©ν©λλ€. GANμ μ±κ³΅μ ν΅μ¬μ μμ±λ μ΄λ―Έμ§κ° μμΉμ μΌλ‘ μ€μ μ¬μ§κ³Ό ꡬλ³ν μ μλλ‘ νλ μ λμ μμ€μ κ°λ μ λλ€. μ΄ μμ€μ μ΄λ―Έμ§ μμ± μμ μ λν΄ λ§€μ° κ°λ ₯νλ©°, μ΄κ²μ΄ μ»΄ν¨ν° κ·Έλν½μ€μ μ£Όμ λͺ©νμ λλ€. μ°λ¦¬λ μ λμ μμ€μ μ±ννμ¬ λ³νλ μ΄λ―Έμ§κ° λμ λλ©μΈμ μ΄λ―Έμ§μ ꡬλ³ν μ μλλ‘ λ§€νμ νμ΅ν©λλ€. Image-to-Image Translation
The idea of image-toimage translation goes back at least to Hertzmann et al.’s Image Analogies [19], who employ a non-parametric texture model [10] on a single input-output training image pair. More recent approaches use a dataset of input-output examples to learn a parametric translation function using CNNs (e.g., [33]). Our approach builds on the “pix2pix” framework of Isola et al. [22], which uses a conditional generative adversarial network [16] to learn a mapping from input to output images. Similar ideas have been applied to various tasks such as generating photographs from sketches [44] or from attribute and semantic layouts [25]. However, unlike the above prior work, we learn the mapping without paired training examples. μ΄λ―Έμ§ λ μ΄λ―Έμ§ λ³νμ μ μ΄λ Hertzmann et al.μ Image Analogies [19]μμ μμλ μμ΄λμ΄λ‘, λ¨μΌ μ λ ₯-μΆλ ₯ νλ ¨ μ΄λ―Έμ§ μμ λΉνλΌλ―Έν° ν μ€μ² λͺ¨λΈ [10]μ μ¬μ©ν©λλ€. μ΅κ·Όμ μ κ·Ό λ°©μμ μ λ ₯-μΆλ ₯ μμ λ°μ΄ν°μ μ μ¬μ©νμ¬ CNNμ μ¬μ©νμ¬ λ§€κ°λ³μνλ λ³ν ν¨μλ₯Ό νμ΅ν©λλ€ (μ: [33]). μ°λ¦¬μ μ κ·Ό λ°©μμ Isola et al.μ "pix2pix" νλ μμν¬ [22]λ₯Ό κΈ°λ°μΌλ‘ ν©λλ€. μ΄ νλ μμν¬λ μ‘°κ±΄λΆ μμ±μ μ λ μ κ²½λ§ [16]μ μ¬μ©νμ¬ μ λ ₯ μ΄λ―Έμ§μμ μΆλ ₯ μ΄λ―Έμ§λ‘μ 맀νμ νμ΅ν©λλ€. λΉμ·ν μμ΄λμ΄κ° μ€μΌμΉμμ μ¬μ§ λλ μμ± λ° μλ―Έμ λ μ΄μμμμ μ¬μ§μ μμ±νλ λ± λ€μν μμ μ μ μ©λμμ΅λλ€ [44, 25]. νμ§λ§, μ΄μ μ μμ κ³Ό λ¬λ¦¬ μ°λ¦¬λ νμ΄λ§λ νλ ¨ μμ μμ΄ λ§€νμ νμ΅ν©λλ€.
Unpaired Image-to-Image Translation
Several other methods also tackle the unpaired setting, where the goal is to relate two data domains: X and Y . Rosales et al. [42] propose a Bayesian framework that includes a prior based on a patch-based Markov random field computed from a source image and a likelihood term obtained from multiple style images. More recently, CoGAN [32] and cross-modal scene networks [1] use a weight-sharing strategy to learn a common representation across domains. Concurrent to our method, Liu et al. [31] extends the above framework with a combination of variational autoencoders [27] and generative adversarial networks [16]. Another line of concurrent work [46, 49, 2] encourages the input and output to share specific “content” features even though they may differ in “style“. These methods also use adversarial networks, with additional terms to enforce the output to be close to the input in a predefined metric space, such as class label space [2], image pixel space [46], and image feature space [49]. Unlike the above approaches, our formulation does not rely on any task-specific, predefined similarity function be tween the input and output, nor do we assume that the input and output have to lie in the same low-dimensional embedding space. This makes our method a general-purpose solution for many vision and graphics tasks. We directly compare against several prior and contemporary approaches in Section 5.1. μμ΄ μλ(unpaired) μ΄λ―Έμ§ λ μ΄λ―Έμ§ λ³νμλ λ€λ₯Έ λͺ κ°μ§ λ°©λ²λ€λ μμ΅λλ€. μ¬κΈ°μ λͺ©νλ Xμ YλΌλ λ λ°μ΄ν° λλ©μΈμ κ΄λ ¨μν€λ κ²μ λλ€. Rosales λ± [42]μ μμ€ μ΄λ―Έμ§λ‘λΆν° κ³μ°λ ν¨μΉ κΈ°λ° λ§λ₯΄μ½ν λλ€ νλλ₯Ό κΈ°λ°μΌλ‘ ν μ¬μ§κ³Ό μ¬λ¬ μ€νμΌ μ΄λ―Έμ§μμ μ»μ κ°λ₯λ νμ ν¬ν¨νλ λ² μ΄μ§μ νλ μμν¬λ₯Ό μ μν©λλ€. μ΅κ·Όμλ CoGAN [32]κ³Ό κ΅μ°¨ λͺ¨λ¬(scene) λ€νΈμν¬ [1]κ° λλ©μΈ κ°μ κ³΅ν΅ ννμ νμ΅νκΈ° μν΄ κ°μ€μΉ 곡μ μ λ΅μ μ¬μ©ν©λλ€. μ°λ¦¬μ λ°©λ²κ³Ό λμμ, Liu λ± [31]μ λ³μ΄ν μ€ν μΈμ½λ [27]μ μμ±μ μ λ μ κ²½λ§ [16]μ μ‘°ν©μΌλ‘ μμ νλ μμν¬λ₯Ό νμ₯ν©λλ€. λμμ λ€λ₯Έ λ°©λ² [46, 49, 2]μ "μ½ν μΈ (content)" νΉμ§μ 곡μ νλλ‘ μ λ ₯κ³Ό μΆλ ₯μ μ₯λ €νμ§λ§ "μ€νμΌ(style)"μ λ€λ₯Ό μ μλλ‘ ν©λλ€. μ΄λ¬ν λ°©λ²λ€λ μ λμ μ κ²½λ§μ μ¬μ©νλ©°, μΆκ°μ μΈ νλ€μ μ¬μ©νμ¬ μΆλ ₯μ΄ λ―Έλ¦¬ μ μλ λ©νΈλ¦ 곡κ°(μ: ν΄λμ€ λ μ΄λΈ κ³΅κ° [2], μ΄λ―Έμ§ ν½μ κ³΅κ° [46], μ΄λ―Έμ§ νΉμ§ κ³΅κ° [49])μμ μ λ ₯μ κ°κΉμμ§λλ‘ ν©λλ€. μμ μ κ·Ό λ°©μκ³Ό λ¬λ¦¬, μ°λ¦¬μ λ°©μμ μ λ ₯κ³Ό μΆλ ₯ μ¬μ΄μ μμ νΉμ μ¬μ μ μλ μ μ¬λ ν¨μμ μμ‘΄νμ§ μμΌλ©°, λν μ λ ₯κ³Ό μΆλ ₯μ΄ λμΌν μ μ°¨μ μλ² λ© κ³΅κ°μ μμ΄μΌ νλ€κ³ κ°μ νμ§ μμ΅λλ€. μ΄λ‘μ¨ μ°λ¦¬μ λ°©λ²μ λ§μ λΉμ λ° κ·Έλν½ μμ μ λν λ²μ©μ μΈ μ루μ μΌλ‘ μ¬μ©λ μ μμ΅λλ€. 5.1μ μμ μ΄μ μ λͺ κ°μ§ λ°©λ²κ³Ό νλμ μΈ λ°©λ²λ€κ³Ό μ§μ λΉκ΅ν©λλ€.
Cycle Consistency
The idea of using transitivity as a way to regularize structured data has a long history. In visual tracking, enforcing simple forward-backward consistency has been a standard trick for decades [24, 48]. In the language domain, verifying and improving translations via “back translation and reconciliation” is a technique used by human translators [3] (including, humorously, by Mark Twain [51]), as well as by machines [17]. More recently, higher-order cycle consistency has been used in structure from motion [61], 3D shape matching [21], cosegmentation [55], dense semantic alignment [65, 64], and depth estimation [14]. Of these, Zhou et al. [64] and Godard et al. [14] are most similar to our work, as they use a cycle consistency loss as a way of using transitivity to supervise CNN training. In this work, we are introducing a similar loss to push G and F to be consistent with each other. Concurrent with our work, in these same proceedings, Yi et al. [59] independently use a similar objective for unpaired image-to-image translation, inspired by dual learning in machine translation [17]. νν μΌκ΄μ± (Cycle Consistency)μ ꡬ쑰νλ λ°μ΄ν°λ₯Ό κ·μ νλ λ°©λ²μΌλ‘μ μ€λ μμ¬λ₯Ό κ°μ§κ³ μμ΅λλ€. μκ° μΆμ μμλ κ°λ¨ν μλ°©ν₯-μλ°©ν₯ μΌκ΄μ±μ κ°μ νλ κ²μ΄ μμ λ κ° νμ€μ μΈ κΈ°λ²μΌλ‘ μ¬μ©λμμ΅λλ€ [24, 48]. μΈμ΄ λΆμΌμμλ μΈκ° λ²μκ°λ€μ΄ [3] (Mark Twainλ μ¬λ―Έμκ² μΈκΈν) λ°± λ²μκ³Ό μ‘°μ μ ν΅ν΄ λ²μμ νμΈνκ³ κ°μ νλ κΈ°μ μ μ¬μ©νλ©°, κΈ°κ³ λ²μμμλ μ΄λ¬ν λ°©μμ μ μ©ν©λλ€ [17]. μ΅κ·Όμλ κ³΅κ° κ΅¬μ‘° μΆλ‘ [61], 3D νμ λ§€μΉ [21], 곡μ λΆν [55], λ°μ§ μλ§¨ν± μ λ ¬ [65, 64], κΉμ΄ μΆμ [14] λ±μμ κ³ μ°¨ νν μΌκ΄μ±μ΄ μ¬μ©λμμ΅λλ€. μ΄ μ€μμλ Zhou λ± [64]κ³Ό Godard λ± [14]λ μ°λ¦¬μ μμ κ³Ό κ°μ₯ μ μ¬νλ©°, μ΄λ€μ νν μΌκ΄μ± μμ€μ μ¬μ©νμ¬ μ μ΄μ±μ μ΄μ©νμ¬ CNN νμ΅μ κ°λ ν©λλ€. λ³Έ λ Όλ¬Έμμλ Gμ Fκ° μλ‘ μΌκ΄λλλ‘ νλ μ μ¬ν μμ€μ λμ ν©λλ€. λμμ, κ°μ λ Όλ¬Έμμλ Yi λ± [59]μ΄ κΈ°κ³ λ²μμ μ΄μ€ νμ΅μμ μκ°μ λ°μ λΉμ·ν λͺ©μ μ μν΄ λ 립μ μΌλ‘ λΉμ·ν λͺ©νλ₯Ό μ¬μ©ν©λλ€ [17].
Neural Style Transfer
[13, 23, 52, 12] is another way to perform image-to-image translation, which synthesizes a novel image by combining the content of one image with the style of another image (typically a painting) based on matching the Gram matrix statistics of pre-trained deep features. Our primary focus, on the other hand, is learning the mapping between two image collections, rather than between two specific images, by trying to capture correspondences between higher-level appearance structures. Therefore, our method can be applied to other tasks, such as painting→ photo, object transfiguration, etc. where single sample transfer methods do not perform well. We compare these two methods in Section 5.2. μ κ²½ μ€νμΌ μ μ΄ (Neural Style Transfer) [13, 23, 52, 12]μ μ΄λ―Έμ§ κ° μ νμ μννλ λ λ€λ₯Έ λ°©λ²μΌλ‘, μ¬μ μ νλ ¨λ κΉμ νΉμ§μ Gram νλ ¬ ν΅κ³λ₯Ό 맀μΉνμ¬ ν μ΄λ―Έμ§μ μ½ν μΈ μ λ€λ₯Έ μ΄λ―Έμ§(μΌλ°μ μΌλ‘ κ·Έλ¦Ό)μ μ€νμΌμ κ²°ν©νμ¬ μλ‘μ΄ μ΄λ―Έμ§λ₯Ό ν©μ±ν©λλ€. κ·Έλ¬λ μ ν¬μ μ£Όμ κ΄μ¬μ λ κ°μ μ΄λ―Έμ§ 컬λ μ κ°μ 맀νμ νμ΅νλ κ²μ΄λ©°, νΉμ μ΄λ―Έμ§ κ°μ μ ν보λ€λ λ λμ μμ€μ μΈκ΄ ꡬ쑰 κ°μ λμμ ν¬μ°©νλ €λ κ²μ λλ€. λ°λΌμ μ ν¬μ λ°©λ²μ νκ° → μ¬μ§, 물체 λ³ν λ±κ³Ό κ°μ λ¨μΌ μν μ ν λ°©λ²μ΄ μ μλνμ§ μλ λ€λ₯Έ μμ μλ μ μ©ν μ μμ΅λλ€. μ΄λ¬ν λ κ°μ§ λ°©λ²μ 5.2μ μμ λΉκ΅ν©λλ€.
3. Formulation
Figure 4: The input images x, output images G(x) and the reconstructed images F(G(x)) from various experiments. From top to bottom: photo↔Cezanne, horses↔zebras, winter→summer Yosemite, aerial photos↔Google maps Figure 4: λ€μν μ€νμμμ μ λ ₯ μ΄λ―Έμ§ x, μΆλ ₯ μ΄λ―Έμ§ G(x), κ·Έλ¦¬κ³ μ¬κ΅¬μ±λ μ΄λ―Έμ§ F(G(x))μ λλ€. μμμ μλλ‘: μ¬μ§↔μΈμ, λ§↔μΌλ£©λ§, 겨μΈ→μ¬λ¦ μμΈλ―Έν°, ν곡 μ¬μ§↔κ΅¬κΈ μ§λ
Our goal is to learn mapping functions between two domains X and Y given training samples {xi} N i=1 where xi ∈ X and {yj}M j=1 where yj ∈ Y 1 . We denote the data distribution as x ∼ pdata(x) and y ∼ pdata(y). As illustrated in Figure 3 (a), our model includes two mappings G : X → Y and F : Y → X. In addition, we introduce two adversarial discriminators DX and DY , where DX aims to distinguish between images {x} and translated images {F(y)}; in the same way, DY aims to discriminate between {y} and {G(x)}. Our objective contains two types of terms: adversarial losses [16] for matching the distribution of generated images to the data distribution in the target domain; and cycle consistency losses to prevent the learned mappings G and F from contradicting each other.
μ°λ¦¬μ λͺ©νλ νλ ¨ μν {xi}N i=1 (μ¬κΈ°μ xi ∈ X)μ {yj}M j=1 (μ¬κΈ°μ yj ∈ Y)μ΄ μ£Όμ΄μ§ λ λλ©μΈ Xμ Y μ¬μ΄μ 맀ν ν¨μλ₯Ό νμ΅νλ κ²μ λλ€. λ°μ΄ν° λΆν¬λ x ∼ pdata(x)μ y ∼ pdata(y)λ‘ νκΈ°λ©λλ€. Figure 3 (a)μ λμ μλ κ²μ²λΌ, μ°λ¦¬μ λͺ¨λΈμλ λ κ°μ 맀ν G: X → Yμ F: Y → Xκ° ν¬ν¨λμ΄ μμ΅λλ€. μΆκ°λ‘, μ°λ¦¬λ DXμ DYλΌλ λ κ°μ μ λμ νλ³μλ₯Ό μκ°ν©λλ€. DXλ μ΄λ―Έμ§ μ§ν© {x}μ λ³νλ μ΄λ―Έμ§ μ§ν© {F(y)} μ¬μ΄λ₯Ό ꡬλ³νλ €κ³ νλ©°, λ§μ°¬κ°μ§λ‘ DYλ {y}μ {G(x)} μ¬μ΄λ₯Ό ꡬλ³νλ €κ³ ν©λλ€. μ°λ¦¬μ λͺ©μ μ λ κ°μ§ μ νμ νλͺ©μΌλ‘ ꡬμ±λμ΄ μμ΅λλ€: μμ±λ μ΄λ―Έμ§μ λΆν¬λ₯Ό λμ λλ©μΈμ λ°μ΄ν° λΆν¬μ μΌμΉμν€κΈ° μν μ λμ μμ€ [16] λ° νμ΅λ 맀ν Gμ Fκ° μλ‘ λͺ¨μλμ§ μλλ‘ νκΈ° μν μ¬μ΄ν΄ μΌκ΄μ± μμ€μ λλ€.
3.1. Adversarial Loss
We apply adversarial losses [16] to both mapping functions. For the mapping function G : X → Y and its discriminator DY , we express the objective as: LGAN(G, DY , X, Y ) = Ey∼pdata(y) [log DY (y)] + Ex∼pdata(x) [log(1 − DY (G(x))], (1) where G tries to generate images G(x) that look similar to images from domain Y , while DY aims to distinguish between translated samples G(x) and real samples y. G aims to minimize this objective against an adversary D that tries to maximize it, i.e., minG maxDY LGAN(G, DY , X, Y ). We introduce a similar adversarial loss for the mapping function F : Y → X and its discriminator DX as well: i.e., minF maxDX LGAN(F, DX, Y, X).
3.1. μ λμ μμ€
μ°λ¦¬λ λ κ°μ 맀ν ν¨μμ μ λμ μμ€ [16]μ μ μ©ν©λλ€. 맀ν ν¨μ G : X → Yμ κ·Έ νλ³μ DYμ λν΄, μ°λ¦¬λ λ€μκ³Ό κ°μ΄ λͺ©μ ν¨μλ₯Ό ννν©λλ€: LGAN(G, DY , X, Y ) = Ey∼pdata(y) [log DY (y)] + Ex∼pdata(x) [log(1 − DY (G(x))], (1)
μ¬κΈ°μ Gλ λλ©μΈ Yμ μ΄λ―Έμ§μ μ μ¬ν μ΄λ―Έμ§ G(x)λ₯Ό μμ±νλ €κ³ νλ©°, DYλ λ²μλ μν G(x)κ³Ό μ€μ μν yλ₯Ό ꡬλ³νλ €κ³ ν©λλ€. Gλ μ λμ μΈ μλμΈ Dκ° μ΄ λͺ©μ ν¨μλ₯Ό μ΅λννλ €κ³ ν λ μ΄λ₯Ό μ΅μννλ €κ³ ν©λλ€. μ¦, minG maxDY LGAN(G, DY , X, Y)λ₯Ό λͺ©νλ‘ ν©λλ€. μ°λ¦¬λ 맀ν ν¨μ F : Y → Xμ κ·Έ νλ³μ DXμ λν΄μλ μ μ¬ν μ λμ μμ€μ λμ ν©λλ€: minF maxDX LGAN(F, DX, Y, X).
3.2. Cycle Consistency Loss
Adversarial training can, in theory, learn mappings G and F that produce outputs identically distributed as target domains Y and X respectively (strictly speaking, this requires G and F to be stochastic functions) [15]. However, with large enough capacity, a network can map the same set of input images to any random permutation of images in the target domain, where any of the learned mappings can induce an output distribution that matches the target distribution. Thus, adversarial losses alone cannot guarantee that the learned function can map an individual input xi to a desired output yi . To further reduce the space of possible mapping functions, we argue that the learned mapping functions should be cycle-consistent: as shown in Figure 3 (b), for each image x from domain X, the image translation cycle should be able to bring x back to the original image, i.e., x → G(x) → F(G(x)) ≈ x. We call this forward cycle consistency. Similarly, as illustrated in Figure 3 (c), for each image y from domain Y , G and F should also satisfy backward cycle consistency: y → F(y) → G(F(y)) ≈ y. We incentivize this behavior using a cycle consistency loss: Lcyc(G, F) = Ex∼pdata(x) [kF(G(x)) − xk1] + Ey∼pdata(y) [kG(F(y)) − yk1]. (2)
3.2. μν μΌκ΄μ± μμ€
μ΄λ‘ μ μΌλ‘ μ λμ νλ ¨μ 맀ν ν¨μ Gμ Fκ° κ°κ° λμ λλ©μΈ Yμ Xλ‘λΆν° λμΌν λΆν¬λ₯Ό κ°λ μΆλ ₯μ μμ±νλλ‘ νμ΅ν μ μμ΅λλ€ (μλ°ν λ§νλ©΄, μ΄λ₯Ό μν΄μλ Gμ Fκ° νλ₯ μ μΈ ν¨μμ΄μ΄μΌ ν¨) [15]. κ·Έλ¬λ μΆ©λΆν μ©λμ κ°μ§ λ€νΈμν¬λ μ λ ₯ μ΄λ―Έμ§μ λμΌν μ§ν©μ λμ λλ©μΈμ μ΄λ―Έμ§μ μμμ μμ΄λ‘ 맀νν μ μμ΅λλ€. μ΄λ νμ΅λ 맀ν μ€ μ΄λ€ κ²μ΄λ λμ λΆν¬μ μΌμΉνλ μΆλ ₯ λΆν¬λ₯Ό μ λν μ μμ΅λλ€. λ°λΌμ μ λμ μμ€λ§μΌλ‘λ νμ΅λ ν¨μκ° κ°λ³ μ λ ₯ xiλ₯Ό μνλ μΆλ ₯ yiλ‘ λ§€νν μ μλμ§λ₯Ό 보μ₯ν μ μμ΅λλ€. κ°λ₯ν 맀ν ν¨μμ 곡κ°μ λμ± μ€μ΄κΈ° μν΄, νμ΅λ 맀ν ν¨μλ€μ μν μΌκ΄μ±μ κ°μ ΈμΌ νλ€κ³ μ£Όμ₯ν©λλ€. Figure 3 (b)μ λμ μλ κ²μ²λΌ λλ©μΈ Xμ κ° μ΄λ―Έμ§ xμ λν΄ μ΄λ―Έμ§ λ³ν μνμ xλ₯Ό μλμ μ΄λ―Έμ§λ‘ λλ릴 μ μμ΄μΌ ν©λλ€. μ¦, x → G(x) → F(G(x)) ≈ xκ° λμ΄μΌ ν©λλ€. μ΄λ₯Ό μλ°©ν₯ μν μΌκ΄μ±μ΄λΌκ³ ν©λλ€. λ§μ°¬κ°μ§λ‘ Figure 3 (c)μμ 보μ¬μ§λ―μ΄ λλ©μΈ Yμ κ° μ΄λ―Έμ§ yμ λν΄μλ Gμ Fλ μλ°©ν₯ μν μΌκ΄μ±μ λ§μ‘±ν΄μΌ ν©λλ€: y → F(y) → G(F(y)) ≈ y. μ΄λ¬ν λμμ μ₯λ €νκΈ° μν΄ μν μΌκ΄μ± μμ€μ λμ ν©λλ€: Lcyc(G, F) = Ex∼pdata(x) [kF(G(x)) − xk1] + Ey∼pdata(y) [kG(F(y)) − yk1]. (2)
In preliminary experiments, we also tried replacing the L1 norm in this loss with an adversarial loss between F(G(x)) and x, and between G(F(y)) and y, but did not observe improved performance.
The behavior induced by the cycle consistency loss can be observed in Figure 4: the reconstructed images F(G(x)) end up matching closely to the input images x.
μλΉ μ€νμμλ μ΄ μμ€μ L1 λ Έλ¦μ F(G(x))μ x, G(F(y))μ y κ°μ μ λμ μμ€λ‘ λμ²΄ν΄ λ³΄μμ§λ§ κ°μ λ μ±λ₯μ κ΄μ°°νμ§ λͺ»νμ΅λλ€.
μν μΌκ΄μ± μμ€μ μν΄ μ λλλ λμμ Figure 4μμ κ΄μ°°ν μ μμ΅λλ€: μ¬κ΅¬μ±λ μ΄λ―Έμ§ F(G(x))λ μ λ ₯ μ΄λ―Έμ§ xμ κ·Όμ νκ² μΌμΉν©λλ€.
3.3. Full Objective
Our full objective is: L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F), where λ controls the relative importance of the two objectives. We aim to solve: G ∗ , F∗ = arg min G,F max Dx,DY L(G, F, DX, DY ). (4) Notice that our model can be viewed as training two “autoencoders” [20]: we learn one autoencoder F β¦ G : X → X jointly with another Gβ¦F : Y → Y . However, these autoencoders each have special internal structures: they map an image to itself via an intermediate representation that is a translation of the image into another domain. Such a setup can also be seen as a special case of “adversarial autoencoders” [34], which use an adversarial loss to train the bottleneck layer of an autoencoder to match an arbitrary target distribution. In our case, the target distribution for the X → X autoencoder is that of the domain Y .
μ ν¬μ μ 체 λͺ©μ μ λ€μκ³Ό κ°μ΅λλ€: L(G, F, DX, DY ) = LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F), μ¬κΈ°μ λλ λ λͺ©μ μ μλμ μ€μλλ₯Ό μ‘°μ ν©λλ€. μ ν¬λ λ€μμ ν΄κ²°νκΈ° μν΄ λ Έλ ₯ν©λλ€: G ∗ , F∗ = arg min G,F max Dx,DY L(G, F, DX, DY ). (4) μ ν¬ λͺ¨λΈμ λ κ°μ "μλ μΈμ½λ" [20]λ₯Ό νμ΅νλ κ²μΌλ‘ λ³Ό μ μμ΅λλ€: νλλ F β¦ G : X → X μλ μΈμ½λμ΄κ³ , λ€λ₯Έ νλλ Gβ¦F : Y → Y μλ μΈμ½λμ λλ€. κ·Έλ¬λ μ΄λ¬ν μλ μΈμ½λλ κ°κ° νΉλ³ν λ΄λΆ ꡬ쑰λ₯Ό κ°μ§κ³ μμ΅λλ€: μ΄λ―Έμ§λ₯Ό μ€κ° ννμΌλ‘ λ³ννμ¬ μμ μμ μκ² λ§€νν©λλ€. μ΄λ¬ν μ€μ μ "μ λμ μλ μΈμ½λ" [34]μ νΉμν κ²½μ°λ‘λ λ³Ό μ μμ΅λλ€. μ΄ κ²½μ°, X → X μλ μΈμ½λμ λμ λΆν¬λ λλ©μΈ Yμ λΆν¬μ μΌμΉνλλ‘ μλ μΈμ½λμ λ³λͺ© κ³μΈ΅μ μ λμ μμ€λ‘ νμ΅ν©λλ€.
In Section 5.1.4, we compare our method against ablations of the full objective, including the adversarial loss LGAN alone and the cycle consistency loss Lcyc alone, and empirically show that both objectives play critical roles in arriving at high-quality results. We also evaluate our method with only cycle loss in one direction and show that a single cycle is not sufficient to regularize the training for this under-constrained problem.
5.1.4μ μμλ μ°λ¦¬μ λ°©λ²μ μ 체 λͺ©μ κ³Ό λΉκ΅νμ¬ κ²ν νλ©°, λ¨λ μΌλ‘ μ¬μ©λλ μ λμ μμ€ LGANκ³Ό μν μΌκ΄μ± μμ€ Lcycλ₯Ό ν¬ν¨ν μΆλ‘ μ λν΄ λΉκ΅ν©λλ€. μ°λ¦¬λ μ€νμ μΌλ‘ μ΄ λ κ°μ§ λͺ©μ μ΄ λͺ¨λ κ³ νμ§ κ²°κ³Όμ μ€μν μν μ νλ κ²μ μ μ¦ν©λλ€. λν λ¨μΌ λ°©ν₯μ μν μμ€λ§ μ¬μ©νμ¬ μ°λ¦¬μ λ°©λ²μ νκ°νκ³ , μ΄λ κ² νλ©΄ μ΄λ° λ―Έμ μ λ¬Έμ μ λν νλ ¨μ κ·μ νκΈ°μλ λ¨μΌ μνλ§μΌλ‘λ μΆ©λΆνμ§ μμμ 보μ¬μ€λλ€.
4. Implementation
4.1 Network Architecture We adopt the architecture for our generative networks from Johnson et al. [23] who have shown impressive results for neural style transfer and superresolution. This network contains three convolutions, several residual blocks [18], two fractionally-strided convolutions with stride 1 2 , and one convolution that maps features to RGB. We use 6 blocks for 128 × 128 images and 9 blocks for 256×256 and higher-resolution training images. Similar to Johnson et al. [23], we use instance normalization [53]. For the discriminator networks we use 70 × 70 PatchGANs [22, 30, 29], which aim to classify whether 70 × 70 overlapping image patches are real or fake. Such a patch-level discriminator architecture has fewer parameters than a full-image discriminator and can work on arbitrarilysized images in a fully convolutional fashion [22]. μ°λ¦¬λ Johnson λ± [23]μ μμ± λ€νΈμν¬ μν€ν μ²λ₯Ό μ±ννμ΅λλ€. μ΄λ€μ μ κ²½ μ€νμΌ μ μ‘κ³Ό μ΄ν΄μλμ λν΄ μΈμμ μΈ κ²°κ³Όλ₯Ό 보μ¬μ£Όμμ΅λλ€. μ΄ λ€νΈμν¬λ μΈ κ°μ ν©μ±κ³±, μ¬λ¬ κ°μ μμ°¨ λΈλ‘ [18], μ€νΈλΌμ΄λ 1/2λ‘ κ΅¬μ±λ λ κ°μ λΆμ μ€νΈλΌμ΄λ ν©μ±κ³± λ° νΉμ§μ RGBλ‘ λ§€ννλ ν©μ±κ³±μΌλ‘ ꡬμ±λ©λλ€. μ°λ¦¬λ 128×128 μ΄λ―Έμ§μ 6κ°μ λΈλ‘μ μ¬μ©νκ³ , 256×256 λ° λ λμ ν΄μλμ νλ ¨ μ΄λ―Έμ§μλ 9κ°μ λΈλ‘μ μ¬μ©ν©λλ€. Johnson λ± [23]κ³Ό λ§μ°¬κ°μ§λ‘ μΈμ€ν΄μ€ μ κ·ν [53]λ₯Ό μ¬μ©ν©λλ€. νλ³μ λ€νΈμν¬μλ 70×70 ν¨μΉ λ¨μ νλ³μμΈ PatchGAN [22, 30, 29]μ μ¬μ©ν©λλ€. μ΄λ 70×70 κ²ΉμΉλ μ΄λ―Έμ§ ν¨μΉκ° μ€μ μΈμ§ κ°μ§μΈμ§λ₯Ό λΆλ₯νλ κ²μ λͺ©νλ‘ ν©λλ€. μ΄λ¬ν ν¨μΉ μμ€μ νλ³μ μν€ν μ²λ μ 체 μ΄λ―Έμ§ νλ³μλ³΄λ€ λ§€κ°λ³μκ° μ μΌλ©°, μμ 컨볼루μ λ°©μμΌλ‘ μμ ν¬κΈ°μ μ΄λ―Έμ§μμ μλν μ μμ΅λλ€ [22].
4.2 Training details We apply two techniques from recent works to stabilize our model training procedure. First, for LGAN (Equation 1), we replace the negative log likelihood objective by a least-squares loss [35]. This loss is more stable during training and generates higher quality results. In particular, for a GAN loss LGAN(G, D, X, Y ), we train the G to minimize Ex∼pdata(x) [(D(G(x)) − 1)2 ] and train the D to minimize Ey∼pdata(y) [(D(y) − 1)2 ] + Ex∼pdata(x) [D(G(x))2 ]. Second, to reduce model oscillation [15], we follow Shrivastava et al.’s strategy [46] and update the discriminators using a history of generated images rather than the ones produced by the latest generators. We keep an image buffer that stores the 50 previously created images. For all the experiments, we set λ = 10 in Equation 3. We use the Adam solver [26] with a batch size of 1. All networks were trained from scratch with a learning rate of 0.0002. We keep the same learning rate for the first 100 epochs and linearly decay the rate to zero over the next 100 epochs. Please see the appendix (Section 7) for more details about the datasets, architectures, and training procedures. μ ν¬λ μ΅κ·Ό μ°κ΅¬μμ λͺ¨λΈ νλ ¨ μ μ°¨λ₯Ό μμ ννκΈ° μν΄ λ κ°μ§ κΈ°λ²μ μ μ©ν©λλ€. 첫째λ‘, LGAN (μ 1)μ λν΄ μμ λ‘κ·Έ μ°λ λͺ©μ μ μ΅μ μ κ³± μμ€ [35]λ‘ λ체ν©λλ€. μ΄ μμ€μ νλ ¨ μ€μ λ μμ μ μ΄λ©° λμ νμ§μ κ²°κ³Όλ₯Ό μμ±ν©λλ€. νΉν, GAN μμ€ LGAN(G, D, X, Y)μ λν΄ Gλ₯Ό Ex∼pdata(x) [(D(G(x)) - 1)2]λ₯Ό μ΅μννλλ‘ νλ ¨νκ³ , Dλ₯Ό Ey∼pdata(y) [(D(y) - 1)2] + Ex∼pdata(x) [D(G(x))2]λ₯Ό μ΅μννλλ‘ νλ ¨ν©λλ€.
λμ§Έλ‘, λͺ¨λΈ μ§λμ μ€μ΄κΈ° μν΄ Shrivastava λ±μ μ λ΅ [46]μ λ°λΌ νλ³μλ₯Ό μ΅μ μμ±μκ° μμ±ν μ΄λ―Έμ§ λμ μ΄μ μ μμ±λ μ΄λ―Έμ§μ κΈ°λ‘μ μ¬μ©νμ¬ μ λ°μ΄νΈν©λλ€. μ°λ¦¬λ 50κ°μ μ΄μ μ μμ±λ μ΄λ―Έμ§λ₯Ό μ μ₯νλ μ΄λ―Έμ§ λ²νΌλ₯Ό μ μ§ν©λλ€. λͺ¨λ μ€νμμ μ 3μμ λ = 10μΌλ‘ μ€μ ν©λλ€. λ°°μΉ ν¬κΈ°κ° 1μΈ Adam μλ² [26]λ₯Ό μ¬μ©ν©λλ€. λͺ¨λ λ€νΈμν¬λ νμ΅λ₯ 0.0002λ‘ μ²μλΆν° νλ ¨λμμ΅λλ€. μ²μ 100 epochμ λν΄ λμΌν νμ΅λ₯ μ μ μ§νκ³ λ€μ 100 epoch λμ νμ΅λ₯ μ μ νμ μΌλ‘ κ°μμμΌ°μ΅λλ€. λ°μ΄ν°μ , μν€ν μ² λ° νλ ¨ μ μ°¨μ λν μμΈν λ΄μ©μ λΆλ‘ (μΉμ 7)λ₯Ό μ°Έμ‘°ν΄μ£Όμμμ€.
5. Results
We first compare our approach against recent methods for unpaired image-to-image translation on paired datasets where ground truth input-output pairs are available for evaluation. We then study the importance of both the adversarial loss and the cycle consistency loss and compare our full method against several variants. Finally, we demonstrate the generality of our algorithm on a wide range of applications where paired data does not exist. For brevity, we refer to our method as CycleGAN. The PyTorch and Torch code, models, and full results can be found at our website.
μ°λ¦¬λ λ¨Όμ , νκ°λ₯Ό μν΄ μ λ ₯-μΆλ ₯ μμ κ·ΈλΌμ΄λ νΈλ£¨μ€κ° μλ νμ΄ λ°μ΄ν°μ μμ λΉμ§μ§μ΄μ§ μ΄λ―Έμ§ κ° λ³νμ λν μ΅κ·Ό λ°©λ²λ€κ³Ό μ°λ¦¬μ μ κ·Όλ²μ λΉκ΅ν©λλ€. κ·Έλ° λ€μ, μ λμ μμ€κ³Ό μ¬μ΄ν΄ μΌκ΄μ± μμ€μ μ€μμ±μ μ°κ΅¬νκ³ , μ¬λ¬ κ°μ§ λ³νμ λν΄ μ°λ¦¬μ μ 체 λ°©λ²μ λΉκ΅ν©λλ€. λ§μ§λ§μΌλ‘, λΉμ§μ§μ΄μ§ λ°μ΄ν°κ° μ‘΄μ¬νμ§ μλ λ€μν μμ© λΆμΌμμ μ°λ¦¬μ μκ³ λ¦¬μ¦μ μΌλ°μ±μ 보μ¬μ€λλ€. κ°κ²°ν¨μ μν΄, μ°λ¦¬μ λ°©λ²μ CycleGANμ΄λΌκ³ μ§μΉν©λλ€. PyTorchμ Torch μ½λ, λͺ¨λΈ, κ·Έλ¦¬κ³ μ 체 κ²°κ³Όλ μ°λ¦¬μ μΉμ¬μ΄νΈμμ μ°Ύμ μ μμ΅λλ€.
5.1. Evaluation Using the same evaluation datasets and metrics as “pix2pix” [22], we compare our method against several baselines both qualitatively and quantitatively. The tasks include semantic labels↔photo on the Cityscapes dataset [4], and map↔aerial photo on data scraped from Google Maps. We also perform ablation study on the full loss function.
5.1. νκ° "pix2pix" [22]μ λμΌν νκ° λ°μ΄ν°μ κ³Ό νκ° μ§νλ₯Ό μ¬μ©νμ¬ μ°λ¦¬μ λ°©λ²μ μ¬λ¬ κΈ°μ€κ³Ό μμ μΌλ‘ λΉκ΅ν©λλ€. μμ μ Cityscapes λ°μ΄ν°μ [4]μμμ μλ§¨ν± λ μ΄λΈ↔μ¬μ§ λ° Google Mapsμμ ν¬λ‘€λ§ν μ§λ↔ν곡 μ¬μ§ λ±μ ν¬ν¨ν©λλ€. λν μ 체 μμ€ ν¨μμ λν μ€κΈ° μ°κ΅¬λ μνν©λλ€.
5.1.1 Evaluation Metrics AMT perceptual studies On the map↔aerial photo task, we run “real vs fake” perceptual studies on Amazon Mechanical Turk (AMT) to assess the realism of our outputs. We follow the same perceptual study protocol from Isola et al. [22], except we only gather data from 25 participants per algorithm we tested. Participants were shown a sequence of pairs of images, one a real photo or map and one fake (generated by our algorithm or a baseline), and asked to click on the image they thought was real. The first 10 trials of each session were practice and feedback was given as to whether the participant’s response was correct or incorrect. The remaining 40 trials were used to assess the rate at which each algorithm fooled participants. Each session only tested a single algorithm, and participants were only allowed to complete a single session. The numbers we report here are not directly comparable to those in [22] as our ground truth images were processed slightly differently 2 and the participant pool we tested may be differently distributed from those tested in [22] (due to running the experiment at a different date and time). Therefore, our numbers should only be used to compare our current method against the baselines (which were run under identical conditions), rather than against [22].
5.1.1 νκ° μ§ν AMT μΈμ§ μ°κ΅¬ μ§λ↔ν곡 μ¬μ§ μμ μμ μ°λ¦¬μ μΆλ ₯λ¬Όμ νμ€μ±μ νκ°νκΈ° μν΄ Amazon Mechanical Turk (AMT)μμ "μ€μ vs κ°μ§" μΈμ§ μ°κ΅¬λ₯Ό μ§ννμ΅λλ€. μ°λ¦¬λ Isola et al. [22]μ μΈμ§ μ°κ΅¬ νλ‘ν μ½μ λ°λ₯΄λ, ν μ€νΈν κ° μκ³ λ¦¬μ¦λ§λ€ 25λͺ μ μ°Έκ°μλ‘λΆν° λ°μ΄ν°λ₯Ό μμ§νμ΅λλ€. μ°Έκ°μλ€μ μ€μ μ¬μ§ λλ μ§λμ κ°μ§ μ΄λ―Έμ§ (μ°λ¦¬μ μκ³ λ¦¬μ¦ λλ κΈ°μ€μ μμ μμ±λ)λ‘ μ΄λ£¨μ΄μ§ μ΄λ―Έμ§ μμ μΌλ ¨μ μνμ€λ₯Ό λ³΄κ³ μ΄λ€ μ΄λ―Έμ§κ° μ€μ μΈμ§ ν΄λ¦νλλ‘ μμ²λ°μμ΅λλ€. κ° μΈμ μ μ²μ 10λ²μ μ°μ΅μ΄λ©° μ°Έκ°μμ μλ΅μ΄ μ¬λ°λ₯Έμ§ μ¬λΆμ λν νΌλλ°±μ΄ μ 곡λμμ΅λλ€. λλ¨Έμ§ 40λ²μ κ° μκ³ λ¦¬μ¦μ΄ μ°Έκ°μλ₯Ό μμΌ λΉμ¨μ νκ°νλ λ° μ¬μ©λμμ΅λλ€. κ° μΈμ μ νλμ μκ³ λ¦¬μ¦λ§μ ν μ€νΈνλ©°, μ°Έκ°μλ ν λ²μ μΈμ λ§ μλ£ν μ μμμ΅λλ€. μ¬κΈ°μμ λ³΄κ³ νλ μ«μλ [22]μ μ«μμ μ§μ μ μΌλ‘ λΉκ΅ν μ μμ΅λλ€. μ°λ¦¬μ Ground Truth μ΄λ―Έμ§λ μ½κ° λ€λ₯΄κ² μ²λ¦¬λμμΌλ©°, μ°λ¦¬κ° ν μ€νΈν μ°Έκ°μ μ§λ¨μ [22]μμ ν μ€νΈλ μ§λ¨κ³Ό λ€λ₯Έ λΆν¬μΌ μ μμ΅λλ€ (μ€νμ λ€λ₯Έ λ μ§μ μκ°μ μ§ννκΈ° λλ¬Έμ). λ°λΌμ μ°λ¦¬μ μ«μλ μ°λ¦¬μ νμ¬ λ°©λ²μ κΈ°μ€μ κ³Ό (λμΌν 쑰건μμ μ€νλ) λΉκ΅νκΈ° μν΄ μ¬μ©λμ΄μΌ νλ©°, [22]μμ λΉκ΅μλ μ¬μ©λμ΄μλ μ λ©λλ€.
FCN score Although perceptual studies may be the gold standard for assessing graphical realism, we also seek an automatic quantitative measure that does not require human experiments. For this, we adopt the “FCN score” from [22], and use it to evaluate the Cityscapes labels→photo task. The FCN metric evaluates how interpretable the generated photos are according to an off-the-shelf semantic segmentation algorithm (the fully-convolutional network, FCN, from [33]). The FCN predicts a label map for a generated photo. This label map can then be compared against the input ground truth labels using standard semantic segmentation metrics described below. The intuition is that if we generate a photo from a label map of “car on the road”, then we have succeeded if the FCN applied to the generated photo detects “car on the road”. Semantic segmentation metrics To evaluate the performance of photo→labels, we use the standard metrics from the Cityscapes benchmark [4], including per-pixel accuracy, per-class accuracy, and mean class Intersection-Over-Union (Class IOU) [4].
FCN μ μ μΈμ§ μ°κ΅¬λ κ·Έλν½μ μΈ νμ€μ±μ νκ°νκΈ° μν κΈκΈ°μ μΈ κΈ°μ€μΌ μ μμ§λ§, μΈκ° μ€νμ΄ νμνμ§ μμ μλμ μΈ μμ μΈ‘μ λ°©λ²λ νμν©λλ€. μ΄λ₯Ό μν΄ μ°λ¦¬λ [22]μ "FCN μ μ"λ₯Ό μ±ννκ³ Cityscapes λΌλ²¨→μ¬μ§ μμ μ νκ°νλ λ° μ¬μ©ν©λλ€. FCN μ μλ μ€ν λ μ ν μλ§¨ν± λΆν μκ³ λ¦¬μ¦ (Fully Convolutional Network, FCN)μ ν΅ν΄ μμ±λ μ¬μ§μ ν΄μ κ°λ₯μ±μ νκ°ν©λλ€. FCNμ μμ±λ μ¬μ§μ λν΄ λ μ΄λΈ 맡μ μμΈ‘ν©λλ€. μ΄ λ μ΄λΈ 맡μ μ λ ₯λ κ·ΈλΌμ΄λ νΈλ£¨μ€ λ μ΄λΈκ³Ό νμ€ μλ§¨ν± λΆν μ§νλ₯Ό μ¬μ©νμ¬ λΉκ΅ν μ μμ΅λλ€. μ§κ΄μ μΌλ‘, "λλ‘ μμ μ°¨"λΌλ λ μ΄λΈ 맡μμ μ¬μ§μ μμ±νλ€λ©΄, μμ±λ μ¬μ§μ μ μ©λ FCNμ΄ "λλ‘ μμ μ°¨"λ₯Ό κ°μ§νλ κ²½μ° μ±κ³΅ν κ²μ λλ€.
μλ§¨ν± λΆν μ§ν μ¬μ§→λ μ΄λΈμ μ±λ₯μ νκ°νκΈ° μν΄ Cityscapes λ²€μΉλ§ν¬ [4]μ νμ€ μ§νλ₯Ό μ¬μ©ν©λλ€. μ΄μλ ν½μ λΉ μ νλ (per-pixel accuracy), ν΄λμ€λ³ μ νλ (per-class accuracy), ν΄λμ€κ° κ΅μ°¨μμ-μ°ν© (Class IOU)μ νκ· ν΄λμ€ (mean class Intersection-Over-Union) λ±μ΄ ν¬ν¨λ©λλ€.
<λ Όλ¬Έ 리뷰>
1. Intro
- GANμ μ λ ₯μ λν΄ νλμ μ΄λ―Έμ§λ§ μμ±νλ€λ νΉμ§μ μ§λ.
- unpaired dataλ xμ yκ° λ§€μΉλμ΄μμ§ μκΈ° λλ¬Έμ, μ΄λ ν μ λ ₯μ΄λ―Έμ§κ° λ€μ΄μμ λ, μ΄λ ν μ λ΅ μ΄λ―Έμ§κ° λ§λ€μ΄μ ΈμΌ νλμ§ λͺ¨λ¦.
- μ¦, 맀μΉλλ yμμ΄ λ¨μν μ λ ₯ xμ νΉμ±μ λλ©μΈ yλ‘ λ°κΎΈκ³ μ ν¨. κ·Έλ κ² λλ©΄, μ΄λ€ μ λ ₯μ΄ λ€μ΄μλ λ΄κ° λ§λ€κ³ μ νλ νΉμ νλμ outputμ μμ±νλ €κ³ νλ©΄, νλμ λμΌν output λ§ λμΆνκ² λλ€λ κ²μ μλ―Έν¨.
⇒ xλ₯Ό μ λ ₯μΌλ‘ λ£μ΄μ λ§λ€μ΄μ§λ yκ° μ€μ xμ 맀μΉλλ μ μλ―Έν yκ° μλ μλ μλ€λ κ²μ μλ―Έν¨.
⇒ λ€μ λ§ν΄, xμ λν μ 보λ€μ΄ λ€ λ€λ₯΄μ§λ§, outputμ λμΌν νλμ κ°μΌλ‘ λ°ννκΈ° λλ¬Έμ, xμ μ 보λ₯Ό λ³κ²½ν΄λ²λ¦¬κ² λ¨.
⇒ μ΄λ₯Ό μν μΆκ°μ μΈ μ μ½μ‘°κ±΄μ΄ νμν¨. ⇒ cycle consistent loss μ μ¬μ©
2. Related Work
- CycleGANμ G(x)κ° λ€μ μλ³Έ μ΄λ―Έμ§ xλ‘ μ¬κ΅¬μ±λ μ μλλ‘ ν¨
- μ¦, μλ³Έ μ΄λ―Έμ§μ contentλ 보쑴νλ, λλ©μΈκ³Ό κ΄λ ¨λ νΉμ±λ§ λ°κΎΈλλ‘ ν¨ (μ΄μ μ introμμ λμλ νκ³ λ³΄μ) → μ΄μ μλ xμ λν μ 보λ₯Ό μμμλλ°, μ΄λ₯Ό λ§κΈ° μν΄ xμ μ 보λ₯Ό 보쑴νλ λ°©μμΌλ‘ μ¬κ΅¬μ±νκ³ μ ν¨!
- GANμ λκ°λ₯Ό λ§λ λ€ λΌκ³ μκ°νλ©΄ λ¨. Gμ F ⇒ μμ±μλ 2κ°, νλ³μλ 2κ°
- Gμ Fλ μν¨μ κ΄κ³
- Dx, Dy ⇒ νΉμ μ΄λ―Έμ§κ° x λλ©μΈμ μ΄λ―Έμ§λ‘ κ·Έλ΄μΈνμ§ μλμ§ νλ³ν μ μλλ‘ λ§λλ μν
- λͺ©ν: F(G(x)) ~ x , G(F(y)) ~ yλ‘ (μλ³Έμ΄λ―Έμ§λ‘ 볡ꡬλ μ μλ ννλ‘ νμ΅)
- μλ³Έ μ΄λ―Έμ§ ⇒ y λλ©μΈμ μ΄λ―Έμ§ (μΌλ£©λ§λ‘ λ³ν) ⇒ λ€μ μλ³Έ μ΄λ―Έμ§λ‘
3. Formulation
- L(GAN): νΉμ λλ©μΈμ μ΄λ―Έμ§λ‘ κ·Έλ΄μΈν μ΄λ―Έμ§λ‘ λ§λ€ μ μκ²λ νλ μ (conditional gan μκ³Ό λμΌ)⇒ G: xλ₯Ό y μΌλ‘ λ°κΏμ€ μ μλλ‘
- L(cyc): λ€μ μλ³Έ μ΄λ―Έμ§λ‘ λμμ¬ μ μλλ‘ νλ μ⇒ x → μλ³Έλλ‘ λμμ€λλ‘ (forward)⇒ y → μλ³Έλλ‘ λμμ€λλ‘ (backward)
- μ¦, G(F(y)): μμΈ‘λ yμ μ΄λ―Έμ§ - μ€μ y μ μ°¨μ΄
- μ¦, F(G(x)): μμΈ‘λ μ΄λ―Έμ§ - μ€μ x inputμ μ°¨μ΄
4. Implementation
4.1 Network Architecture
- residual block νμ© + instance normalization νμ©
- μ΄λ―Έμ§ λ΄ ν¨μΉ λ¨μλ‘ μ§μ μ¬λΆ νλ³νλ (νλ³μ) discriminator μ¬μ© (patchGAN)⇒ μ₯μ : NxN patch λ¨μλ‘ prediction μ§ν. μ΄λ patch size Nμ μ 체 μ΄λ―Έμ§ ν¬κΈ°λ³΄λ€ ν¨μ¬ μκΈ° λλ¬Έμ λ μμ parametersμ λΉ λ₯Έ running timeμ κ°μ§
4.2 Training details
- Least-squares loss: κΈ°μ‘΄μ cross-entropy κΈ°λ°μ loss λμ μ MSE κΈ°λ°μ loss μ¬μ©⇒ νμ΅μ΄ λ μμ μ μΌλ‘ λ¨. μ€μ μ΄λ―Έμ§ λΆν¬μ λ κ°κΉμ΄ μ΄λ―Έμ§ μμ± κ°λ₯
- Replay buffer: μ΄μ μ μμ±λ μ΄λ―Έμ§μ κΈ°λ‘μ μ¬μ©νμ¬ μ λ°μ΄νΈ⇒ μμ±μκ° λ§λ μ΄μ 50κ°μ μ΄λ―Έμ§ μ μ₯ ν΄λκ³ , μ΄λ₯Ό νμ©ν΄ νλ³μ μ λ°μ΄νΈ μ§ν
- ⇒ λͺ¨λΈμ μ§λμ μ€μ΄κΈ° μν¨. μμ μ νμ΅ κ°λ₯..
5. Experiments/ Limitations
- μ¬κΈ°μ λ€μ μλ κ° L identityλ, cycleGANμ νλ κ³Όμ μμ ν΄κ° λ¨λ κ²μ ν΄κ° μ 무λ κ²μ ꡬλΆνκΈ° μν lossμ
- μ¦ μΈνκ³Ό μμνμ μꡬμ±μ 보쑴νκΈ° μν΄ μΆκ°ν loss
- μΌλ£©λ§μ ν μ¬λ μ΄λ―Έμ§λ μκΈ° λλ¬Έμ μ±λ₯μ΄ μ’μ§ μμ.
- λͺ¨μμ λ°κΏ μ μμ.
'Deep Learning > [λ Όλ¬Έ] Paper Review' μΉ΄ν κ³ λ¦¬μ λ€λ₯Έ κΈ
ELMO (0) | 2023.07.06 |
---|---|
SegNet (0) | 2023.07.06 |
XLNet: Generalized Autoregressive Pretraining for Language Understanding (1) | 2023.07.05 |
Inception-v4, Inception-ResNetand the Impact of Residual Connections on Learning (0) | 2023.07.05 |
Seq2Seq (0) | 2023.07.05 |