\modelname: Text-to-Image Generation With Realistic Hand Appearances (2024)

Supreeth Narasimhaswamy*1absent1{}^{*1}start_FLOATSUPERSCRIPT * 1 end_FLOATSUPERSCRIPT, Uttaran Bhattacharya22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Xiang Chen22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT,
Ish*ta Dasgupta22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, Saayan Mitra22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT, and Minh Hoai11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPT

11{}^{1}start_FLOATSUPERSCRIPT 1 end_FLOATSUPERSCRIPTStony Brook University, USA, 22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTAdobe Research, USA

Work started when Supreeth was an intern at Adobe Research

Abstract

Text-to-image generative models can generate high-quality humans, but realism is lost when generating hands. Common artifacts include irregular hand poses, shapes, incorrect numbers of fingers, and physically implausible finger orientations. To generate images with realistic hands, we propose a novel diffusion-based architecture called \modelnamethat achieves realism by injecting hand embeddings in the generative process. \modelnameconsists of two components: a Text-to-Hand-Paramsdiffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts, and a Text-Guided Hand-Params-to-Imagediffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation, including 3D shapes and joint-level finger positions, orientations and articulations, for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands. Project page: https://supreethn.github.io/research/handiffuser/index.html

1 Introduction

\modelname: Text-to-Image Generation With Realistic Hand Appearances (1)

Text-to-Image (T2I) generative models have shown impressive advancement in recent years. Generative models such as Stable Diffusion[56], Imagen[58], and GLIDE[45] can generate high quality, photorealistic images. However, these methods often struggle to synthesize high-quality and realistic hands. The generated hands often have improbable hand poses, irregular hand shapes, incorrect number of fingers, and poor hand-object interactions (Fig.1).

Generating images with high-quality hands is a challenging problem since hands often take up a small part of the image, but are highly articulate. They have high degrees of freedom, with a wide variety of flexibility where fingers can bend to various degrees relatively independently. Hands can also occur in various shapes, sizes, and orientations and can be occluded with other human body parts. Further, hands often interact with objects and can have a wide range of grasps depending on the object’s size, shape, and affordance. Therefore, capturing such a vast range of articulations and interactions directly from text inputs remains challenging. Despite having billions of parameters and several millions of trainable images, T2I models struggle to generate realistic hands.

A central challenge in hand image generation is learning diverse hand poses and configurations at scale. Existing hand representations based on keypoint skeletons and shape formats[34, 57] are useful for generative tasks in pose animation[66] and hand-object interactions[12]. These representations provide a grounded understanding of plausible hand shapes and postures, especially in relation to the rest of the body and different interacting objects. However, the necessary steps to incorporate these hand representations into T2I pipelines, in terms of both learning these representations from text prompts and mapping these representations into the pixel space of images, remain open problems. These problems are exacerbated when we consider naturally constructed prompts, which often imply rather than specify hand postures and articulations (e.g., all prompts in Fig.1). Prompt engineering[33, 16], focusing on hand descriptions, can potentially improve the generation quality. But it comes with the cost of distilling and learning appropriate prompts from large-scale data, and with the caveats of learning spurious inter-relationships between the prompt and the hands or between the hands and the rest of the images.

In this paper, we propose a learning-based model to generate images containing realistic hands in an end-to-end fashion from text prompts. Our model, called \modelname, consists of two key trainable components.The first component, Text-to-Hand-Params(T2H), generates parameters of a hand model[34, 57] conditioned on the input text prompts. The second component, Text-Guided Hand-Params-to-Image(T-H2I), uses the hand parameters and the input text prompts as conditions to generate images. By conditioning the image generation on accurate hand models, \modelnamecan generate high-quality hands with plausible hand poses, shapes, and finger articulations.Specifically, we consider three aspects of hand representation, each serving a unique purpose. These include the spatial locations of hand joints to capture the hand pose, the joint rotations to capture the finger orientations and articulations, and the hand vertices to capture the overall hand shape. We design a novel Text+Hand Encoder by extending the CLIP encoder[54] to obtain joint embeddings for these three representations together with the text. We use the proposed joint embeddings to condition the image generation, allowing us to generate images by conditioning on both the hand parameters and the text.

We train the two components of \modelnameindependently. We train T2Husing around 450K text and 3D human pairs and fine-tune T-H2Iusing around 900K text and image pairs. Once trained, we use the two components end-to-end in a single inference pipeline to generate images from text prompts. We conduct extensive experiments and user studies to show the effectiveness of the \modelnamein generating images with high-quality hands.

In short, the contributions of our paper are:

  • \modelname

    , a generative model to synthesize images with high-quality hands by conditioning on text and hand embeddings. It has two novel components: Text-to-Hand-Paramsand Text-Guided Hand-Params-to-Image.

  • Text-to-Hand-Params, a diffusion model to generate SMPL-Body and MANO-Hand parameters from text inputs. The generated MANO-Hands are used to further condition the image generation.

  • Text-Guided Hand-Params-to-Image, a diffusion model to generate images with high-quality hands by conditioning on hand and text embeddings. We design hand embeddings to capture hand shape, pose, and finger orientations and articulations.

\modelname: Text-to-Image Generation With Realistic Hand Appearances (2)

2 Related Work

We briefly summarize related work on text-to-image generation, concurrent work on text-to-human generation, and commonly used hand representations.

Text-to-Image Generation. Text-guided image generation is a well-studied problem, with modern approaches ranging from GANs[68, 78, 70, 65], autoregressive generation approaches[37, 55], and VQ-VAE transformers[13] to state-of-the-art diffusion models[61, 21, 11, 22]. Text-to-image generation using diffusion models often bootstrap the generative pipeline with pre-trained language models, such as BERT[10] or CLIP[54], to efficiently learn from the text information[73, 38, 17, 45, 4]. More recently, Stable Diffusion[56] performs diffusion in the latent image space to generate high-quality images at low computational costs. Imagen[58], by contrast, diffuses the pixels directly in a hierarchical fashion. ControlNet[74] provides additional controllability in the image generation process in the form of conditioning signals ranging from sketches to pose priors. Latest software products for text-to-image generation include Midjourney[39], DALL-E 3[46], and Firefly[1]. While the advancements in this area have been rapid and significant, generating highly articulate hands remains prone to unrealistic artifacts.

Text-to-Human Generation. Alongside image generation, there has also been considerable progress in human pose and motion generation from text prompts. Recent generative methods typically follow various skeletal joint formats, such as OpenPose[5], or combined joint and mesh formats, such as SMPL[34], to represent the human body. They train on various large-scale pose and motion datasets, including KIT[51], AMASS[36], BABEL[53], and HumanML3D[18]. To efficiently map text prompts to motion sequences, text-to-motion generation methods learn combined representations of language and pose using techniques ranging from recurrent neural networks[2, 30], hierarchical pose embeddings[3, 14], and VQ-VAE transformers[49, 50] to motion diffusion models[75, 66, 28, 7]. Some approaches even generate 3D meshes on top of pose motion synthesis to synthesize fully rendered humans[23, 71]. Separate from motion synthesis, there are works on generating parametric pose models from text[8, 47, 9].However, these methods focus on the human body and ignore the hand regions. As a result, they cannot generate articulate hands. There is a method[43] to generate plausible hands using ControlNet but it requires a hand skeleton or mesh as an additional input. A concurrent work [35] proposes an inpainting approach to refine hands. Given a generated image, the method reconstructs 3D meshes for hands and further refines hand regions. As a result, the quality of the reconstructed hand mesh, and consequently the final refined hand, depends on the quality of the initially generated hand. On the contrary, our method first generates hand mesh parameters from the prompt and further conditions the image generation on such intermediate hand parameters. Moreover, [35] ignores the hand-object interactions in the initial image and might not preserve hand-object occlusions and interactions when refining hands.

Hand Representations. Available datasets on hand configurations, gestures, and hand-object interactions offer hand representations in a variety of formats, including bounding boxes, silhouettes, depth maps[40, 29, 31, 42, 41, 25, 60, 44], keypoints and parametric models[57, 27, 26, 12]. These representations are useful for multiple hand-centric tasks, including detection[67], gesture and pose recognition[76], motion generation[77, 64], and hand-object interactions[24, 63, 15]. Our work combines representations based on keypoint and parametric models to efficiently encode diverse hand shapes and highly articulated finger movements.

3 \modelname

Fig.2 illustrates the proposed \modelnamearchitecture. Given a text input, \modelnamefirst uses a novel Text-to-Hand-Paramsdiffusion model to generate the parameters of the human body and hand models. The second component is the Text-Guided Hand-Params-to-Imagediffusion model that generates the output image by conditioning on the hand model and the text. This section provides more detailed insights into the Text-to-Hand-Paramsand Text-Guided Hand-Params-to-Imagemodels, following a brief introduction to the fundamentals of human models and stable diffusion.

3.1 Preliminaries

SMPL-H. Our Text-to-Hand-Paramsmodel generates parameters of human body and hand models from text inputs. We use SMPL[34] and MANO[57] as our body and hand model, respectively. The SMPL is a differentiable function b(θb,βb)subscript𝑏subscript𝜃𝑏subscript𝛽𝑏\mathcal{M}_{b}(\theta_{b},\beta_{b})caligraphic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ) that takes a pose parameter θb69subscript𝜃𝑏superscript69\theta_{b}\in\mathbb{R}^{69}italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 69 end_POSTSUPERSCRIPT and shape parameter βb10subscript𝛽𝑏superscript10\beta_{b}\in\mathbb{R}^{10}italic_β start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, and returns the body mesh b6890×3subscript𝑏superscript68903\mathcal{M}_{b}\in\mathbb{R}^{6890\times 3}caligraphic_M start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 6890 × 3 end_POSTSUPERSCRIPT with 6890 vertices. Similarly, MANO is a differentiable function h(θh,βh,s)subscriptsubscript𝜃subscript𝛽𝑠\mathcal{M}_{h}(\theta_{h},\beta_{h},s)caligraphic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_s ) that takes the hand pose parameter θh48subscript𝜃superscript48\theta_{h}\in\mathbb{R}^{48}italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 48 end_POSTSUPERSCRIPT, hand shape parameter βh10subscript𝛽superscript10\beta_{h}\in\mathbb{R}^{10}italic_β start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, and the hand side s{left,right}𝑠leftrights\in\{\text{left},\text{right}\}italic_s ∈ { left , right }, and returns hand mesh h778×3subscriptsuperscript7783\mathcal{M}_{h}\in\mathbb{R}^{778\times 3}caligraphic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 778 × 3 end_POSTSUPERSCRIPT with 778 vertices. The 3D hand joint locations Jhk×3=𝒲hhsubscript𝐽superscript𝑘3subscript𝒲subscriptJ_{h}\in\mathbb{R}^{k\times 3}=\mathcal{W}_{h}\mathcal{M}_{h}italic_J start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_k × 3 end_POSTSUPERSCRIPT = caligraphic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT can be regressed from vertices using a pre-trained linear regressor 𝒲hsubscript𝒲\mathcal{W}_{h}caligraphic_W start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT. The SMPL-H model combines the body, left hand, and right hand model into a single differntiable function (θ,β)𝜃𝛽\mathcal{M}(\theta,\beta)caligraphic_M ( italic_θ , italic_β ) with pose parameters θ=(θb,θlh,θrh)𝜃subscript𝜃𝑏subscript𝜃𝑙subscript𝜃𝑟\theta{=}(\theta_{b},\theta_{lh},\theta_{rh})italic_θ = ( italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT ) and shape parameters β𝛽\betaitalic_β. The pose parameters θbsubscript𝜃𝑏\theta_{b}italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT, θlhsubscript𝜃𝑙\theta_{lh}italic_θ start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT, and θrhsubscript𝜃𝑟\theta_{rh}italic_θ start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT captures the root-relative joint rotations for body, left hand, and right hand, respectively. The shape parameter β𝛽\betaitalic_β captures the scale of the person.

Stable Diffusion. Our Text-Guided Hand-Params-to-Image model is built upon Stable Diffusion[56]. Stable Diffusion is a latent diffusion model consisting of an auto-encoder, a U-Net for noise estimation, and a CLIP text encoder. The encoder \mathcal{E}caligraphic_E encodes an image x𝑥xitalic_x into a latent representation z=(x)𝑧𝑥z=\mathcal{E}(x)italic_z = caligraphic_E ( italic_x ) that the diffusion process operates on. The decoder 𝒟𝒟\mathcal{D}caligraphic_D reconstructs the image from x^=𝒟(z)^𝑥𝒟𝑧\hat{x}=\mathcal{D}(z)over^ start_ARG italic_x end_ARG = caligraphic_D ( italic_z ) from the latent z𝑧zitalic_z. The U-Net is conditioned on the denoising stept𝑡titalic_t and the text τtext(text)subscript𝜏𝑡𝑒𝑥𝑡𝑡𝑒𝑥𝑡\tau_{text}(text)italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_t italic_e italic_x italic_t ), where τtext(text)subscript𝜏𝑡𝑒𝑥𝑡𝑡𝑒𝑥𝑡\tau_{text}(text)italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_t italic_e italic_x italic_t ) is a CLIP[54] text encoder that projects a sequence of tokenized texts into an embedding space. To jointly condition the image generation on hand parameters and the text, we replace the text encoder τtext(text)subscript𝜏𝑡𝑒𝑥𝑡𝑡𝑒𝑥𝑡\tau_{text}(text)italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t end_POSTSUBSCRIPT ( italic_t italic_e italic_x italic_t ) with a novel Text+Hand encoder τtext+h(text,hand)subscript𝜏𝑡𝑒𝑥𝑡𝑡𝑒𝑥𝑡𝑎𝑛𝑑\tau_{text+h}(text,hand)italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t + italic_h end_POSTSUBSCRIPT ( italic_t italic_e italic_x italic_t , italic_h italic_a italic_n italic_d ) that jointly embeds the text and hand parameters into a common embedding space.

3.2 Text-to-Hand-ParamsDiffusion

The Text-to-Hand-Paramsdiffusion model takes a text as input and generates the pose parameters θ=(θb,θlh,θrh)𝜃subscript𝜃𝑏subscript𝜃𝑙subscript𝜃𝑟\theta=(\theta_{b},\theta_{lh},\theta_{rh})italic_θ = ( italic_θ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT ) and shape parameters β𝛽\betaitalic_β for the SMPL-H model by conditioning on the text.

We define x:=(θ,β)assign𝑥𝜃𝛽x:=(\theta,\beta)italic_x := ( italic_θ , italic_β ) and model the forward diffusion process by iteratively adding Gaussian noise to x𝑥xitalic_x for T𝑇Titalic_T time steps:

q(xt|xt1)=𝒩(αtxt1,(1αt)I),𝑞conditionalsubscript𝑥𝑡subscript𝑥𝑡1𝒩subscript𝛼𝑡subscript𝑥𝑡11subscript𝛼𝑡𝐼q\left(x_{t}|x_{t-1}\right)=\mathcal{N}\left(\sqrt{\alpha_{t}}x_{t-1},(1-%\alpha_{t})I\right),italic_q ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( square-root start_ARG italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , ( 1 - italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_I ) ,(1)

where αt(0,1)subscript𝛼𝑡01\alpha_{t}\in(0,1)italic_α start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ ( 0 , 1 ) are constant hyper-parameters.

We model the text-conditioned SMPL-H generation distribution p(x0|c)𝑝conditionalsubscript𝑥0𝑐p(x_{0}|c)italic_p ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_c ) as the reverse diffusion process of gradually denoising xTsubscript𝑥𝑇x_{T}italic_x start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT. Following[66], we learn the denoising by directly predicting x^0=G(xt,t,c)subscript^𝑥0𝐺subscript𝑥𝑡𝑡𝑐\hat{x}_{0}=G(x_{t},t,c)over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_G ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) using a model G𝐺Gitalic_G. We train the reverse diffusion using the training objective:

1=𝔼x0q(x0|c),t[1,T]x0G(xt,t,c)22.subscript1subscript𝔼formulae-sequencesimilar-tosubscript𝑥0𝑞conditionalsubscript𝑥0𝑐similar-to𝑡1𝑇subscriptsuperscriptnormsubscript𝑥0𝐺subscript𝑥𝑡𝑡𝑐22\mathcal{L}_{1}=\mathbb{E}_{x_{0}\sim q(x_{0}|c),t\sim[1,T]}||x_{0}-G(x_{t},t,%c)||^{2}_{2}.caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_q ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT | italic_c ) , italic_t ∼ [ 1 , italic_T ] end_POSTSUBSCRIPT | | italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_G ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_c ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .(2)

We get the conditional text embeddings c𝑐citalic_c by encoding the text using CLIP[54]. We implement G𝐺Gitalic_G using a transformer encoder-only architecture similar to MDM[66].

Given a text during inference, we conditionally sample x=(θ,β)𝑥𝜃𝛽x=(\theta,\beta)italic_x = ( italic_θ , italic_β ). We use the shape and pose parameters to obtain the joints Jlh,Jrhsubscript𝐽𝑙subscript𝐽𝑟J_{lh},J_{rh}italic_J start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , italic_J start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT and vertices lh,rhsubscript𝑙subscript𝑟\mathcal{M}_{lh},\mathcal{M}_{rh}caligraphic_M start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT for left and right hands using MANO-Hand model. We also choose camera parameters and project Jlh,Jrhsubscript𝐽𝑙subscript𝐽𝑟J_{lh},J_{rh}italic_J start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , italic_J start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT into an image space and obtain the corresponding image-space joint locations Jlh2D,Jrh2Dsubscriptsuperscript𝐽2𝐷𝑙subscriptsuperscript𝐽2𝐷𝑟J^{2D}_{lh},J^{2D}_{rh}italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT. We use the joint rotations θlh,θrhsubscript𝜃𝑙subscript𝜃𝑟\theta_{lh},\theta_{rh}italic_θ start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT, hand vertices lh,rhsubscript𝑙subscript𝑟\mathcal{M}_{lh},\mathcal{M}_{rh}caligraphic_M start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT, and spatial joint locations Jlh2D,Jrh2Dsubscriptsuperscript𝐽2𝐷𝑙subscriptsuperscript𝐽2𝐷𝑟J^{2D}_{lh},J^{2D}_{rh}italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_h end_POSTSUBSCRIPT , italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_h end_POSTSUBSCRIPT to condition the image generation in the next stage.

3.3 Text-Guided Hand-Params-to-ImageDiffusion

The Text-Guided Hand-Params-to-Imagediffusion model is built upon Stable Diffusion[56] and conditions the image generation on hand parameters generated from the Text-to-Hand-Paramsmodel and the text. Specifically, Text-Guided Hand-Params-to-Image uses a novel Text+Hand Encoder τtext+hsubscript𝜏𝑡𝑒𝑥𝑡\tau_{text+h}italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t + italic_h end_POSTSUBSCRIPT to first obtain joint embeddings for text and hand parameters. It then uses the joint hand and text embeddings to condition the image generation. We provide more details on this below.

Text+Hand Encoder. Given the provided text, along with the spatial joint locations Jh2Dsubscriptsuperscript𝐽2𝐷J^{2D}_{h}italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, vertices hsubscript\mathcal{M}_{h}caligraphic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT, and joint rotations θhsubscript𝜃\theta_{h}italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT of the hand, our goal is to generate D𝐷Ditalic_D-dimensional embeddings to encode both the text and hand parameters. Here D𝐷Ditalic_D denotes the CLIP[54] token embedding dimension. To encode hand joint locations in the image space, we follow[6, 69] and introduce additional positional tokens. We quantize the image height and width uniformly into Nbinssubscript𝑁𝑏𝑖𝑛𝑠N_{bins}italic_N start_POSTSUBSCRIPT italic_b italic_i italic_n italic_s end_POSTSUBSCRIPT bins. This allows us to approximate and tokenize any normalized spatial coordinate into one of Nbinssubscript𝑁binsN_{\text{bins}}italic_N start_POSTSUBSCRIPT bins end_POSTSUBSCRIPT tokens. We then encode the text tokens and the hand joint spatial tokens into a D𝐷Ditalic_D-dimensions using ftext+Jh2Dsubscript𝑓𝑡𝑒𝑥𝑡subscriptsuperscript𝐽2𝐷f_{text+J^{2D}_{h}}italic_f start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t + italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Specifically, we construct ftext+Jh2Dsubscript𝑓𝑡𝑒𝑥𝑡subscriptsuperscript𝐽2𝐷f_{text+J^{2D}_{h}}italic_f start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t + italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT by introducing an additional Nbins×Dsubscript𝑁bins𝐷N_{\text{bins}}\times Ditalic_N start_POSTSUBSCRIPT bins end_POSTSUBSCRIPT × italic_D embedding layer into an existing CLIP token embedder and finetune it during training. To encode hand vertices, we transform them to basis point set (BPS)[52] representations and pass through fhsubscript𝑓subscriptf_{\mathcal{M}_{h}}italic_f start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT, a Multi-Layer Perceptron (MLP) consisting of fully-connected linear and ReLU layers. Similarly, we encode 6D hand joint rotations θhsubscript𝜃\theta_{h}italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT using an MLP fθhsubscript𝑓subscript𝜃f_{\theta_{h}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT consisting of fully-connected linear and ReLU layers. Finally, we concatenate embeddings from text, spatial hand joints, hand vertices, and hand joint rotations to produce joint text and hand embeddings.

Diffusion. We instantiate the Text-Guided Hand-Params-to-Image using Stable Diffusion[56] and train using the following objective

2=𝔼(x),ϵ𝒩(0,1),t,yϵF(zt,t,τtext+h(y))22.subscript2subscript𝔼formulae-sequencesimilar-to𝑥italic-ϵ𝒩01𝑡𝑦subscriptsuperscriptnormitalic-ϵ𝐹subscript𝑧𝑡𝑡subscript𝜏𝑡𝑒𝑥𝑡𝑦22\mathcal{L}_{2}=\mathbb{E}_{\mathcal{E}(x),\epsilon\sim\mathcal{N}(0,1),t,y}||%\epsilon-F\left(z_{t},t,\tau_{text+h}(y)\right)||^{2}_{2}.caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT caligraphic_E ( italic_x ) , italic_ϵ ∼ caligraphic_N ( 0 , 1 ) , italic_t , italic_y end_POSTSUBSCRIPT | | italic_ϵ - italic_F ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t , italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t + italic_h end_POSTSUBSCRIPT ( italic_y ) ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .(3)

In the above equation, the condition y=(text,Jh2D,h,θh)𝑦textsubscriptsuperscript𝐽2𝐷subscriptsubscript𝜃y=(\text{text},J^{2D}_{h},\mathcal{M}_{h},\theta_{h})italic_y = ( text , italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , caligraphic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT ) denotes the combination of the text and the hand parameters, which include spatial joint locations, vertices, and joint rotations. The function F𝐹Fitalic_F is a denoising U-Net to predict the noise, τtext+hsubscript𝜏𝑡𝑒𝑥𝑡\tau_{text+h}italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t + italic_h end_POSTSUBSCRIPT is the trainable Text+Hand encoder. We refer the readers to [56] for more details regarding Eq.(3).

MethodFID \downarrowKID \downarrowFID-H \downarrowKID-H \downarrowHand Conf. \uparrow
Stable Diffusion29.00529.00529.00529.0059.63×1039.63superscript1039.63{\times}{10}^{-3}9.63 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT34.37234.37234.37234.3724.63×1024.63superscript1024.63{\times}{10}^{-2}4.63 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT0.8870.8870.8870.887
Stable Diffusion Fine-tuned20.05620.05620.05620.0567.91×1037.91superscript1037.91{\times}{10}^{-3}7.91 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT31.21931.21931.21931.2193.09×1023.09superscript1023.09{\times}{10}^{-2}3.09 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT0.9130.9130.9130.913
ControlNet18.69418.69418.69418.6945.93×1035.93superscript1035.93{\times}{10}^{-3}5.93 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT28.09128.09128.09128.0912.19×1022.19superscript1022.19{\times}{10}^{-2}2.19 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT0.9690.9690.9690.969
\modelnamew/o 2D hand joints16.83916.83916.83916.8395.21×1035.21superscript1035.21{\times}{10}^{-3}5.21 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT29.90229.90229.90229.9022.46×1022.46superscript1022.46{\times}{10}^{-2}2.46 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT0.9530.9530.9530.953
\modelnamew/o 3D joint rotation and vertices14.58614.58614.58614.5864.14×1034.14superscript1034.14{\times}{10}^{-3}4.14 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT28.18628.18628.18628.1862.21×1022.21superscript1022.21{\times}{10}^{-2}2.21 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT0.9610.9610.9610.961
\modelname(proposed)13.91813.918\mathbf{13.918}bold_13.9184.07×𝟏𝟎𝟑4.07superscript103\mathbf{4.07{\times}{10}^{-3}}bold_4.07 × bold_10 start_POSTSUPERSCRIPT - bold_3 end_POSTSUPERSCRIPT27.55027.550\mathbf{27.550}bold_27.5502.11×𝟏𝟎𝟐2.11superscript102\mathbf{2.11{\times}{10}^{-2}}bold_2.11 × bold_10 start_POSTSUPERSCRIPT - bold_2 end_POSTSUPERSCRIPT0.9780.978\mathbf{0.978}bold_0.978

Generating SMPL-H vs Skeletons in Text-to-Hand-ParamsDiffusion. We design the first component of \modelnameto generate pose and shape parameters of SMPL-H instead of keypoints or skeletons since SMPL-H encodes topological and geometric priors about humans and encodes richer information than skeletons. Also, SMPL-H parameters tend to be more robust to noise than skeletons; we can still get plausible poses even with noisy SMPL-H parameters, whereas noisy joint locations lead to implausible poses. Since we are generating parameters (51 joint rotations and 10 shape parameters) of SMPL-H mesh, the Text-to-Hand-Paramscomponent is computationally lighter compared to the second component, Text-Guided Hand-Params-to-Image.

\modelname: Text-to-Image Generation With Realistic Hand Appearances (3)

4 Experiments

This section describes the datasets used to train \modelname, implementation details, and evaluation metrics used. We also present qualitative and quantitative results, and user studies to show the efficacy of \modelname in generating images with high-quality hands.

4.1 Datasets

We train the two components of \modelname using our own curated datasets. We start with 930K paired text and images, then curate it to remove inappropriate and harmful content and validate the quality of the images through independent content creators. We randomly split the dataset to obtain 900K train and 30K test text-image pairs. We further preprocess the dataset to obtain SMPL-H parameters. Specifically, we use[62] to obtain SMPL parameters for the body and[57] to obtain MANO parameters for hands. We reject estimated SMPL body and MANO hands that have low confidence scores. Finally, we curate two datasets using the estimated hand and body parameters. The first dataset consists of tuples of the form (text, SMPL-H). We keep such tuples only for images where we can reliably estimate SMPL-H parameters. This dataset has 450K tuples (text, SMPL-H) and is used to train the first component of the \modelname, Text-to-Hand-Params. The second dataset consists of triplets of the form (text, image, SMPL-H), and we keep all 930K triplets. We use this dataset to train the second component of \modelname, Text-Guided Hand-Params-to-Image. During training, we only conditioned the image generation on hand parameters when the SMPL-H parameters were reliably estimated.

4.2 Implementation Details

We train the Text-to-Hand-Paramsdiffusion to generate the SMPL-H pose θ𝜃\thetaitalic_θ and shape β𝛽\betaitalic_β by conditioning on the text. We encode the text using a frozen CLIP-ViT-B/32 model[54]. We train the Text-to-Hand-Paramsmodel using a classifier-free guidance[20] by randomly setting 10%percent1010\%10 % of the text conditions to be empty. We train this model for 100 epochs on a single A100 GPU using a batch size of 64. We use 1000 steps and a guidance scale of s=2.5𝑠2.5s=2.5italic_s = 2.5 during the inference. We fine-tune Text-Guided Hand-Params-to-Image starting from the Stable Diffusion v1.4 checkpoint. To implement the Text+Hand encoder τtext+hsubscript𝜏𝑡𝑒𝑥𝑡\tau_{text+h}italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t + italic_h end_POSTSUBSCRIPT, we start with the CLIP ViT-L/14 model and introduce additional Nbins=1000subscript𝑁𝑏𝑖𝑛𝑠1000N_{bins}=1000italic_N start_POSTSUBSCRIPT italic_b italic_i italic_n italic_s end_POSTSUBSCRIPT = 1000 positional tokens for spatial hand joints. We choose simple three-layer MLPs fhsubscript𝑓subscriptf_{\mathcal{M}_{h}}italic_f start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT and fθhsubscript𝑓subscript𝜃f_{\theta_{h}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT end_POSTSUBSCRIPT to encode hand vertices and joint rotations, respectively. We fine-tune Text-Guided Hand-Params-to-Image, including the Text+Hand encoder τtext+hsubscript𝜏𝑡𝑒𝑥𝑡\tau_{text+h}italic_τ start_POSTSUBSCRIPT italic_t italic_e italic_x italic_t + italic_h end_POSTSUBSCRIPT, for 20 epochs on eight A100 GPUs using a batch size of 8 and AdamW optimizer with a constant learning rate of 104superscript10410^{-4}10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. We perform inference with 50 PLMS[32] steps using a classifier-free guidance[20] of 4.0.

\modelname

Inference. Given a text input, we first sample SMPL-H parameters using our trained Text-to-Hand-Paramsmodel. We then extract the MANO hand parameters from SMPL-H and choose camera parameters randomly with some constraints to make hands somewhat visible in the image and obtain spatial hand joint joints. Finally, we use these spatial hand joints, MANO parameters, and the text to conditionally sample an image from our trained Text-Guided Hand-Params-to-Imagemodel.

4.3 Evaluation Metrics

We access the quality of generated images from \modelname using the Frechet Inception Distance (FID) and Kernel Inception Distance (KID)[19, 48]. Since FID and KID measures the overall quality of the image, we also compute FID-H and KID-H to measure the quality of images only in the hand regions. We perform this by first extracting crops using hand bounding boxes and then computing FID and KID using such hand crops. We also measure the quality of hands using average hand detection confidence scores. Specifically, we run an off-the-shelf hand detector[72] on generated images and compute the confidence scores for detection. Higher confidence scores mean that the hand detector is more confident of a region being a hand, indicating higher-quality hand generations.

4.4 Quantitative Results and Ablation Studies

We compare the proposed \modelnamewith three different methods and report these results in Table1. First, we use an off-the-shelf Stable Diffusion[56] model pre-trained on the LAION-5B[59] dataset. The LAION-5B dataset is a general-purpose text and image pairs dataset and does not necessarily focus on humans. Therefore, a Stable Diffusion model trained on the LAION-5B dataset does not perform well on images solely focused on humans. Second, we fine-tuned Stable Diffusion on our dataset and observed that it generated better images than the pre-trained model. While this generates better performance than the pre-trained model, the performance is still low compared to the proposed \modelname. Third, we experiment with ControlNet[74], a popular latent diffusion model that uses spatial control images to condition the image generation process. We train a ControlNet architecture on our dataset using hand-pose skeleton images as controls. However, unlike \modelname, which generates images directly from text input, ControlNet requires an additional hand pose skeleton image as input during inference. To address this, we directly use the ground-truth skeleton images from the test data as control images. Despite employing these ground-truth control images, ControlNet does not perform as well as \modelname.

It is important to note that the reported FID-H, KID-H, and hand confidence scores for Stable Diffusion and Stable Diffusion Fine-tuned in Table1 are optimistic performance measures. To evaluate the performance of these two methods, we first ran a hand detector[72] to obtain hand crops. However, this approach is biased towards rejecting bad hand generations since the hand detector cannot localize low-quality hand generations, leaving out unrealistic-looking generated hands from evaluation. On the contrary, the corresponding metrics for ControlNet and \modelnamein Table1 are the true performance measures since both the methods generate images conditioned on hands, allowing us to crop every generated hand precisely.

We also study the benefits of different hand representations that are used to condition the hand generation in \modelname. First, we evaluate \modelnameby omitting the spatial hand joint locations Jh2Dsubscriptsuperscript𝐽2𝐷J^{2D}_{h}italic_J start_POSTSUPERSCRIPT 2 italic_D end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT in hand embeddings. Second, we evaluate \modelnameby omitting the hand joint rotations θhsubscript𝜃\theta_{h}italic_θ start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT and hand vertices hsubscript\mathcal{M}_{h}caligraphic_M start_POSTSUBSCRIPT italic_h end_POSTSUBSCRIPT in hand embeddings. We report these results in the fourth and fifth row of Table1. These results show that all three hand representations help in generating quality hands.

\modelname: Text-to-Image Generation With Realistic Hand Appearances (4)

4.5 Qualitative Results and Failure Cases

We report some good qualitative results in Fig.3. We compare results from Stable Diffusion, ControlNet[74], the proposed \modelnamewithout hand joint rotations and vertices embeddings, and \modelname. Stable Diffusion does not generate realistic hands even after fine-tuning on human-centric datasets. It generates an incorrect number of hand fingers, poor hand-object interactions, implausible finger orientations, and hand shapes. ControlNet generates better-looking results but requires hand skeleton control images as additional input. We can see that \modelnamegenerates hands with plausible hand poses by conditioning the image generation on spatial hand joint locations. Further conditioning on hand joint rotations and vertices enables \modelnameto generate high-quality, detailed hands with plausible orientations and shapes.

Fig.4 shows a few SMPL-H results generated from text inputs using our Text-to-Hand-Params. While we only use hand parameters from these SMPL-H outputs, our Text-to-Hand-Params can be directly used in other applications that require generating SMPL-H models from text inputs. We also show how Text-Guided Hand-Params-to-Imagemaps these SMPL-H results to generated images in Fig.5. Fig.6 shows some failure cases of \modelname.

4.6 User Studies

We evaluate the quality of both our generated images and the intermediate outputs of our approach through two different user studies. We evaluate the generated images in two aspects. The first is (A) plausibility, which considers how natural the hands look, for example, in terms of hand shapes, finger orientations, number of fingers, hands, and how clearly the hands are in focus in the image. The second is (B) relevance, which considers how natural the hand poses or gestures appear given the prompt, for example, holding objects or gesticulating conventionally (unless otherwise specified in the prompt).

We evaluate the intermediate SMPL-H outputs for generating the images in three aspects: (A) plausibility of the pose, (B) relevance to prompt, and (C) consistency with the generated image.

\modelname: Text-to-Image Generation With Realistic Hand Appearances (5)
\modelname: Text-to-Image Generation With Realistic Hand Appearances (6)
\modelname: Text-to-Image Generation With Realistic Hand Appearances (7)
\modelname: Text-to-Image Generation With Realistic Hand Appearances (8)

Setup.We compare three methods in the user study to evaluate the generated images: fine-tuned Stable Diffusion (SD-FT), \modelnametrained with only 2D hand joints (\modelname-2D), and \modelnametrained with all its model components. We show participants 20 sets of images. Each set consists of a unique prompt randomly selected from the test partition of LAION-5B[59] and the images generated by the three methods given that prompt. We arrange the three images within each set in a random order not known to the participants. For each image in each set, we ask them to respond to two questions: “How is the visual quality of the hands?” (image plausibility) and “How well do the hands follow the prompt?” (image relevance). We collect responses to the two questions on a 5-point Likert scale, consisting of the following choices: “Poor (e.g., too many or severe mistakes)”, “Bad (e.g., some aspects reasonable but still many or severe mistakes)”, “Fair (e.g., some aspects are plausible but some mistakes visible)”, “Good (e.g., most aspects are plausible but a few mistakes visible)”, and “Excellent (e.g., everything looks good, no visible mistakes)”.Note that we perform the user study on methods that only require text prompts to generate image outputs at test time. Therefore, we exclude methods such as ControlNet[74], which additionally requires pose information to generate similar images. Moreover, the overall performance of ControlNet is also at the same level as \modelname-2D (Table1 rows 3 and 4) even if we manually provide the ground-truth poses, leading to no meaningful differences between their responses in a pilot study.

Image AspectSD-FT\modelname-2D\modelname(proposed)
Plausibility \uparrow2.74±0.08plus-or-minus2.740.082.74\pm 0.082.74 ± 0.083.30±0.11plus-or-minus3.300.113.30\pm 0.113.30 ± 0.113.51±0.11plus-or-minus3.510.11\mathbf{3.51\pm 0.11}bold_3.51 ± bold_0.11
Relevance \uparrow3.83±0.12plus-or-minus3.830.123.83\pm 0.123.83 ± 0.124.11±0.17plus-or-minus4.110.174.11\pm 0.174.11 ± 0.174.23±0.18plus-or-minus4.230.18\mathbf{4.23\pm 0.18}bold_4.23 ± bold_0.18

For the user study to evaluate the intermediate SMPL-H outputs, we show participants 9 random triplets of (prompt, SMPL-H poses, generated image), where the prompts are randomly selected from the test partition of LAION-5B[59] and the SMPL-H poses and generated images come from our approach. For each triplet, we ask the participants to respond to three questions: “How plausible is the pose?” (SMPL-H plausibility), “How relevant are the hands in the pose given the prompt?” (SMPL-H relevance), and “How consistent are the hands in the pose with the hands in the image?” (SMPL-H consistency). We collect responses to the three questions on the same 5-point Likert scale as above. ask them to evaluate the poses on 5-point Likert scales for each of the three aspects of plausibility, relevance, and consistency. To evaluate consistency, we additionally ask participants to focus primarily on the hand configurations and gestures described or implied in the text. We ask them to ignore distractors, such as the quality of any facial expressions (or lack thereof), any component of the 3D pose that is not visible in the image, and the differences in orientations of the body between the pose and the image.

\modelname: Text-to-Image Generation With Realistic Hand Appearances (9)

Results.Our user study to evaluate the generated images was completed by 35 participants, resulting in a total of 700 responses over the 20 image sets. We did not observe any notable response differences across genders and age groups. We report the distribution of scores for the two aspects across all the responses in Fig.7. We also summarize the mean and the standard deviation of the scores of the two aspects for each of the three methods in Table2. To compute these values, we assign numbers 1111 through 5555 to the response choices Poor through Excellent. Consequently, higher scores indicate better performance.\modelnameoutperforms the other methods in both aspects for generated images. Looking at the distribution of image plausibility scores (Fig.6(a)), we observe the mode of SD-FT on “Fair”, while the modes of both \modelnameversions are a point higher, on “Good”. Overall, 55%percent5555\%55 % of \modelnamescores are “Good” or better, compared to 47%percent4747\%47 % of \modelname-2D scores and 27%percent2727\%27 % of SD-FT scores. Looking at the distribution of image relevance scores (Fig.6(b)), we observe the modes of all the three methods on “Excellent”, indicating their efficacy in generating hand appearances aligned with text prompts. Among the three methods, we note a relatively higher distribution of good responses for HandDiffuser variants. Specifically, 78%percent7878\%78 % of \modelnamescores are “Good” or better, compared to 75%percent7575\%75 % of \modelname-2D scores and 65%percent6565\%65 % of SD-FT scores. Looking at the mean scores across all the 700 responses (Table2), we note marked improvements for \modelname. Its image plausibility scores are 0.770.770.770.77 points (or 15%percent1515\%15 % on the 5-point scale) higher than SD-FT and 0.210.210.210.21 points (or 4%percent44\%4 %) higher than \modelname-2D. Correspondingly, its image relevance scores are 0.400.400.400.40 points (or 8%percent88\%8 %) higher than SD-FT and 0.120.120.120.12 points (or 2%percent22\%2 %) higher than \modelname-2D.

Our user study to evaluate the intermediate SMPL-H poses was completed by 18 participants, resulting in a total of 171 responses over the 9 triplets. We did not observe any notable response differences across genders and age groups. We report the distribution of scores for the two aspects across all the responses in Fig.8. We also summarize the mean and the standard deviation of the scores of the two aspects for each of the three methods in Table3. To compute these values, we assign numbers 1111 through 5555 to the response choices Poor through Excellent. Consequently, higher scores indicate better performance.

Plausibility \uparrowRelevance \uparrowConsistency \uparrow
4.00±0.81plus-or-minus4.000.814.00\pm 0.814.00 ± 0.813.83±0.98plus-or-minus3.830.983.83\pm 0.983.83 ± 0.983.67±1.04plus-or-minus3.671.043.67\pm 1.043.67 ± 1.04

5 Conclusions and Limitations

We have presented \modelname, an end-to-end model to generate images with realistic hand appearances from text prompts. Our model explicitly learns hand embeddings based on hand shapes, poses and finger-level articulations, and combines them with text embeddings to generate images with high-quality hands. We demonstrate the state-of-the-art performance of our method on the benchmark T2I dataset both quantitatively, through multiple evaluation metrics, and qualitatively, through a user study.

In the future, we plan to extend our model to more unexplored territories of hand generation. These include images consisting of multiple people, complex hand-object interactions, prompts describing highly specialized hand activities (e.g., origami), the same person handling multiple objects simultaneously, hand-hand interactions of two or more people, and non-anthropomorphic hands (e.g., a dog a using a computer). A concurrent future direction is to make the hand generation pipeline style- and shape-aware, such that it can consistently generate the same hands when asked to generate the same person in different images.

Acknowledgements. This project was partially supported by US National Science Foundation Award NSDF DUE-2055406.

References

  • Adobe [2023]Adobe.Firefly, https://www.adobe.com/sensei/generative-ai/firefly.html, 2023.
  • Ahn etal. [2018]Hyemin Ahn, Timothy Ha, Yunho Choi, Hwiyeon Yoo, and Songhwai Oh.Text2action: Generative adversarial synthesis from language to action.In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 5915–5920, 2018.
  • Ahuja and Morency [2019]Chaitanya Ahuja and Louis-Philippe Morency.Language2pose: Natural language grounded pose forecasting.In 2019 International Conference on 3D Vision (3DV), pages 719–728, 2019.
  • Alembics [2023]Alembics.Disco-Diffusion, https://github.com/alembics/disco-diffusion, 2023.
  • Cao etal. [2019]Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y.A. Sheikh.Openpose: Realtime multi-person 2d pose estimation using part affinity fields.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
  • Chen etal. [2022]Ting Chen, Saurabh Saxena, Lala Li, DavidJ. Fleet, and Geoffrey Hinton.Pix2seq: A language modeling framework for object detection.In International Conference on Learning Representations (ICLR), 2022.
  • Dabral etal. [2023]Rishabh Dabral, MuhammadHamza Mughal, Vladislav Golyanik, and Christian Theobalt.Mofusion: A framework for denoising-diffusion-based motion synthesis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9760–9770, 2023.
  • Delmas, Ginger and Weinzaepfel, Philippe and Lucas, Thomas and Moreno-Noguer, Francesc and Rogez, Grégory [2022]Delmas, Ginger and Weinzaepfel, Philippe and Lucas, Thomas and Moreno-Noguer, Francesc and Rogez, Grégory.PoseScript: 3D Human Poses from Natural Language.In ECCV, 2022.
  • Delmas, Ginger and Weinzaepfel, Philippe and Moreno-Noguer, Francesc and Rogez, Grégory [2023]Delmas, Ginger and Weinzaepfel, Philippe and Moreno-Noguer, Francesc and Rogez, Grégory.PoseFix: Correcting 3D Human Poses with Natural Language.In ICCV, 2023.
  • Devlin etal. [2019]Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.BERT: Pre-training of deep bidirectional transformers for language understanding.In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
  • Dhariwal and Nichol [2021]Prafulla Dhariwal and Alexander Nichol.Diffusion models beat gans on image synthesis.In Advances in Neural Information Processing Systems, pages 8780–8794. Curran Associates, Inc., 2021.
  • Fan etal. [2023]Zicong Fan, Omid Taheri, Dimitrios Tzionas, Muhammed Kocabas, Manuel Kaufmann, MichaelJ. Black, and Otmar Hilliges.Arctic: A dataset for dexterous bimanual hand-object manipulation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12943–12954, 2023.
  • Gafni etal. [2022]Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman.Make-a-scene: Scene-based text-to-image generation withhuman priors.In Computer Vision – ECCV 2022, pages 89–106, Cham, 2022. Springer Nature Switzerland.
  • Ghosh etal. [2021]Anindita Ghosh, Noshaba Cheema, Cennet Oguz, Christian Theobalt, and Philipp Slusallek.Synthesis of compositional animations from textual descriptions.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1396–1406, 2021.
  • Ghosh etal. [2023]Anindita Ghosh, Rishabh Dabral, Vladislav Golyanik, Christian Theobalt, and Philipp Slusallek.Imos: Intent-driven full-body motion synthesis for human-object interactions.Computer Graphics Forum, 42(2):1–12, 2023.
  • Gu etal. [2023]Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, and Philip Torr.A systematic survey of prompt engineering on vision-language foundation models.arXiv preprint arXiv:2307.12980, 2023.
  • Gu etal. [2022]Shuyang Gu, Dong Chen, Jianmin Bao, Fang Wen, Bo Zhang, Dongdong Chen, Lu Yuan, and Baining Guo.Vector quantized diffusion model for text-to-image synthesis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10696–10706, 2022.
  • Guo etal. [2022]Chuan Guo, Shihao Zou, Xinxin Zuo, Sen Wang, Wei Ji, Xingyu Li, and Li Cheng.Generating diverse and natural 3d human motions from text.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5152–5161, 2022.
  • Heusel etal. [2017]Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter.Gans trained by a two time-scale update rule converge to a local nash equilibrium.In Advances in Neural Information Processing Systems (NeurIPS), 2017.
  • Ho and Salimans [2021]Jonathan Ho and Tim Salimans.Classifier-free diffusion guidance.In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  • Ho etal. [2020]Jonathan Ho, Ajay Jain, and Pieter Abbeel.Denoising diffusion probabilistic models.In Advances in Neural Information Processing Systems, pages 6840–6851. Curran Associates, Inc., 2020.
  • Ho etal. [2022]Jonathan Ho, Chitwan Saharia, William Chan, DavidJ. Fleet, Mohammad Norouzi, and Tim Salimans.Cascaded diffusion models for high fidelity image generation.J. Mach. Learn. Res., 23(1), 2022.
  • Hong etal. [2022]Fangzhou Hong, Mingyuan Zhang, Liang Pan, Zhongang Cai, Lei Yang, and Ziwei Liu.Avatarclip: Zero-shot text-driven generation and animation of 3d avatars.ACM Transactions on Graphics (TOG), 41(4):1–19, 2022.
  • Hu etal. [2022]Hezhen Hu, Weilun Wang, Wengang Zhou, and Houqiang Li.Hand-object interaction image generation.In Advances in Neural Information Processing Systems, pages 23805–23817. Curran Associates, Inc., 2022.
  • Huang etal. [2022]Mingzhen Huang, Supreeth Narasimhaswamy, Saif Vazir, Haibin Ling, and Minh Hoai.Forward propagation, backward regression and pose association for hand tracking in the wild.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Jian etal. [2023]Juntao Jian, Xiuping Liu, Manyi Li, Ruizhen Hu, and Jian Liu.Affordpose: A large-scale dataset of hand-object interactions with affordance-driven hand pose.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14713–14724, 2023.
  • Kapitanov etal. [2022]Alexander Kapitanov, Andrey Makhlyarchuk, and Karina Kvanchiani.Hagrid - hand gesture recognition image dataset.arXiv preprint arXiv:2206.08219, 2022.
  • Kim etal. [2023]Jihoon Kim, Jiseob Kim, and Sungjoon Choi.Flame: Free-form language-based motion synthesis and editing.Proceedings of the AAAI Conference on Artificial Intelligence, 37(7):8255–8263, 2023.
  • Koller etal. [2015]Oscar Koller, Jens Forster, and Hermann Ney.Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers.Computer Vision and Image Understanding, 141:108–125, 2015.
  • Lin etal. [2018]AngelaS Lin, Lemeng Wu, Rodolfo Corona, Kevin Tai, Qixing Huang, and RaymondJ Mooney.Generating animated videos of human activities from natural language descriptions.Proceedings of the Visually Grounded Interaction and Language Workshop at NeurIPS, 2018(1), 2018.
  • Liu etal. [2020]Jun Liu, Amir Shahroudy, Mauricio Perez, Gang Wang, Ling-Yu Duan, and AlexC. Kot.Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10):2684–2701, 2020.
  • Liu etal. [2022]Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao.Pseudo numerical methods for diffusion models on manifolds.In International Conference on Learning Representations (ICLR), 2022.
  • Liu and Chilton [2022]Vivian Liu and LydiaB Chilton.Design guidelines for prompt engineering text-to-image generative models.In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 2022. Association for Computing Machinery.
  • Loper etal. [2015]Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and MichaelJ. Black.SMPL: A skinned multi-person linear model.In ACM Trans. Graphics (Proc. SIGGRAPH Asia), 2015.
  • Lu etal. [2023]Wenquan Lu, Yufei Xu, Jing Zhang, Chaoyue Wang, and Dacheng Tao.Handrefiner: Refining malformed hands in generated images by diffusion-based conditional inpainting, 2023.
  • Mahmood etal. [2019]Naureen Mahmood, Nima Ghorbani, NikolausF. Troje, Gerard Pons-Moll, and MichaelJ. Black.Amass: Archive of motion capture as surface shapes.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • Mansimov etal. [2016]Elman Mansimov, Emilio Parisotto, Jimmy Ba, and Ruslan Salakhutdinov.Generating images from captions with attention.In International Conference on Learning Representations (ICLR), 2016.
  • Meng etal. [2022]Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Ermon.SDEdit: Guided image synthesis and editing with stochastic differential equations.In International Conference on Learning Representations, 2022.
  • Midjourney [2023]Midjourney.https://www.midjourney.com, 2023.
  • Narasimhaswamy etal. [2019]Supreeth Narasimhaswamy, Zhengwei Wei, Yang Wang, Justin Zhang, and Minh Hoai.Contextual attention for hand detection in the wild.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  • Narasimhaswamy etal. [2020]Supreeth Narasimhaswamy, Trung Nguyen, and Minh Hoai.Detecting hands and recognizing physical contact in the wild.In Advances in Neural Information Processing Systems, 2020.
  • Narasimhaswamy etal. [2022]Supreeth Narasimhaswamy, Thanh Nguyen, Mingzhen Huang, and Minh Hoai.Whose hands are these? hand detection and hand-body association in the wild.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Narasimhaswamy etal. [2023]Supreeth Narasimhaswamy, Uttaran Bhattacharya, Xiang Chen, Ish*ta Dasgupta, and Saayan Mitra.Text-to-hand-image generation using pose- and mesh-guided diffusion.In IEEE/CVF International Conference on Computer Vision (ICCV), International Workshop on Observing and Understanding Hands in Action, 2023.
  • Narasimhaswamy etal. [2024]Supreeth Narasimhaswamy, Huy Nguyen, Lihan Huang, and Minh Hoai.Hoist-former: Hand-held objects identification, segmentation, and tracking in the wild.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  • Nichol etal. [2022]Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen.Glide: Towards photorealistic image generation and editing with text-guided diffusion models.In International Conference on Machine Learning(ICML), 2022.
  • OpenAI [2023]OpenAI.Dall-E 3, https://openai.com/dall-e-3, 2023.
  • Oreshkin etal. [2022]BorisN Oreshkin, Florent Bocquelet, FelixG Harvey, Bay Raitt, and Dominic Laflamme.Protores: Proto-residual network for pose authoring via learned inverse kinematics.In The Tenth International Conference on Learning Representations (ICLR), 2022.
  • Parmar etal. [2022]Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu.On aliased resizing and surprising subtleties in GAN evaluation.In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, 2022.
  • Petrovich etal. [2021]Mathis Petrovich, MichaelJ. Black, and Gül Varol.Action-conditioned 3d human motion synthesis with transformer vae.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10985–10995, 2021.
  • Petrovich etal. [2022]Mathis Petrovich, MichaelJ. Black, and Gül Varol.TEMOS: Generating diverse human motions from textual descriptions.In European Conference on Computer Vision (ECCV), 2022.
  • Plappert etal. [2016]Matthias Plappert, Christian Mandery, and Tamim Asfour.The kit motion-language dataset.Big data, 4(4):236–252, 2016.
  • Prokudin etal. [2019]Sergey Prokudin, Christoph Lassner, and Javier Romero.Efficient learning on point clouds with basis point sets.In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2019.
  • Punnakkal etal. [2021]AbhinandaR. Punnakkal, Arjun Chandrasekaran, Nikos Athanasiou, Alejandra Quiros-Ramirez, and MichaelJ. Black.Babel: Bodies, action and behavior with english labels.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 722–731, 2021.
  • Radford etal. [2021]Alec Radford, JongWook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever.Learning transferable visual models from natural language supervision.In International Conference on Machine Learning (ICML), 2021.
  • Ramesh etal. [2021]Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever.Zero-shot text-to-image generation.In Proceedings of the 38th International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  • Rombach etal. [2022]Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer.High-resolution image synthesis with latent diffusion models.In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  • Romero etal. [2017]Javier Romero, Dimitrios Tzionas, and MichaelJ. Black.Embodied hands: Modeling and capturing hands and bodies together.ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 2017.
  • Saharia etal. [2022]Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, EmilyL Denton, Kamyar Ghasemipour, Raphael GontijoLopes, Burcu KaragolAyan, Tim Salimans, Jonathan Ho, DavidJ Fleet, and Mohammad Norouzi.Photorealistic text-to-image diffusion models with deep language understanding.In Advances in Neural Information Processing Systems(NeurIPS), 2022.
  • Schuhmann etal. [2022]Christoph Schuhmann, Romain Beaumont, Richard Vencu, CadeW Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, SrivatsaR Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev.LAION-5b: An open large-scale dataset for training next generation image-text models.In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
  • Shilkrot etal. [2019]Roy Shilkrot, Supreeth Narasimhaswamy, Saif Vazir, and Minh Hoai.WorkingHands: A hand-tool assembly dataset for image segmentation and activity mining.In Proceedings of British Machine Vision Conference, 2019.
  • Sohl-Dickstein etal. [2015]Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli.Deep unsupervised learning using nonequilibrium thermodynamics.In Proceedings of the 32nd International Conference on Machine Learning, pages 2256–2265, Lille, France, 2015. PMLR.
  • Sun etal. [2021]Yu Sun, Qian Bao, Wu Liu, Yili Fu, Black MichaelJ., and Tao Mei.Monocular, One-stage, Regression of Multiple 3D People.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  • Taheri etal. [2022]Omid Taheri, Vasileios Choutas, MichaelJ. Black, and Dimitrios Tzionas.Goal: Generating 4d whole-body motion for hand-object grasping.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13263–13273, 2022.
  • Taheri etal. [2023]Omid Taheri, Yi Zhou, Dimitrios Tzionas, Yang Zhou, Duygu Ceylan, Soren Pirk, and MichaelJ Black.Grip: Generating interaction poses using latent consistency and spatial cues.arXiv preprint arXiv:2308.11617, 2023.
  • Tao etal. [2022]Ming Tao, Hao Tang, Fei Wu, Xiao-Yuan Jing, Bing-Kun Bao, and Changsheng Xu.Df-gan: A simple and effective baseline for text-to-image synthesis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16515–16525, 2022.
  • Tevet etal. [2023]Guy Tevet, Sigal Raab, Brian Gordon, Yoni Shafir, Daniel Cohen-or, and AmitHaim Bermano.Human motion diffusion model.In The Eleventh International Conference on Learning Representations (ICLR), 2023.
  • Ultralytics [2023]Ultralytics.YOLOv8, https://github.com/ultralytics/ultralytics, 2023.
  • Xu etal. [2018]Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He.Attngan: Fine-grained text to image generation with attentional generative adversarial networks.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • Yang etal. [2023]Zhengyuan Yang, Jianfeng Wang, Zhe Gan, Linjie Li, Kevin Lin, Chenfei Wu, Nan Duan, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang.Reco: Region-controlled text-to-image generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14246–14255, 2023.
  • Ye etal. [2021]Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji.Improving text-to-image synthesis using contrastive learning.The 32nd British Machine Vision Conference (BMVC), 2021.
  • Youwang etal. [2022]Kim Youwang, Kim Ji-Yeon, and Tae-Hyun Oh.Clip-actor: Text-driven recommendation and stylization for animating human meshes.In European Conference on Computer Vision (ECCV), 2022.
  • Zhang etal. [2020]Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann.Mediapipe hands: On-device real-time hand tracking.arXiv preprint arXiv:2006.10214, 2020.
  • Zhang etal. [2021]Han Zhang, JingYu Koh, Jason Baldridge, Honglak Lee, and Yinfei Yang.Cross-modal contrastive learning for text-to-image generation.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–842, 2021.
  • Zhang etal. [2023]Lvmin Zhang, Anyi Rao, and Maneesh Agrawala.Adding conditional control to text-to-image diffusion models.In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023.
  • Zhang etal. [2022]Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu.Motiondiffuse: Text-driven human motion generation with diffusion model.arXiv preprint arXiv:2208.15001, 2022.
  • Zheng etal. [2023]Jiangbin Zheng, Yile Wang, Cheng Tan, Siyuan Li, Ge Wang, Jun Xia, Yidong Chen, and StanZ. Li.Cvt-slr: Contrastive visual-textual transformation for sign language recognition with variational alignment.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23141–23150, 2023.
  • Zhou etal. [2022]Keyang Zhou, BharatLal Bhatnagar, JanEric Lenssen, and Gerard Pons-Moll.Toch: Spatio-temporal object-to-hand correspondence for motion refinement.In European Conference on Computer Vision (ECCV), 2022.
  • Zhu etal. [2019]Minfeng Zhu, Pingbo Pan, Wei Chen, and Yi Yang.Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis.In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

6 Additional Results

We show additional qualitative comparisons with the Stable Diffusion[56] baseline and the proposed \modelnamein Fig.9 and Fig.10.

We also show additional qualitative results and intermediate outputs from the two components of \modelname in Fig.11 and Fig.12.

\modelname: Text-to-Image Generation With Realistic Hand Appearances (10)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (11)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (12)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (13)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (14)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (15)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (16)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (17)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (18)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (19)

\modelname: Text-to-Image Generation With Realistic Hand Appearances (2024)

References

Top Articles
Latest Posts
Article information

Author: Dean Jakubowski Ret

Last Updated:

Views: 6196

Rating: 5 / 5 (50 voted)

Reviews: 89% of readers found this page helpful

Author information

Name: Dean Jakubowski Ret

Birthday: 1996-05-10

Address: Apt. 425 4346 Santiago Islands, Shariside, AK 38830-1874

Phone: +96313309894162

Job: Legacy Sales Designer

Hobby: Baseball, Wood carving, Candle making, Jigsaw puzzles, Lacemaking, Parkour, Drawing

Introduction: My name is Dean Jakubowski Ret, I am a enthusiastic, friendly, homely, handsome, zealous, brainy, elegant person who loves writing and wants to share my knowledge and understanding with you.