DreamID-Omni

Unified Framework for Controllable Human-Centric Audio-Video Generation

* Equal contribution, Project lead, § Corresponding author

This work is purely academic and non-commercial.
Demo reference images/videos are from public domains or AI-generated.
For copyright concerns, please contact us for the removal of relevant content.

Research Paper DreamID-Omni GitHub DreamID-V GitHub

A short demo video by DreamID-Omni. Please turn on the sound for watching.

Human-Reference Audio-Video Generation (R2AV)
Given reference images and voice timbres, generate synchronized video and audio content. Please turn on the sound for watching. Hover over the video to reveal the text prompt.
Loading...
A modern gym with bright overhead lights. sub1 is a woman with thick dark wavy hair tied into a practical ponytail. She wears a fitted black sports top, lightweight jacket unzipped. sub1 Lift the dumbbells in the gym and then lower them, then talking. sub1 keeps eye contact between breaths, speaks with determination, and says, "I am very tired now, but I don't intend to stop.".
Timbre:
Loading...
A lantern-lit riverside kitchen stall at night with light rain and wind; steam rises from a wok and red lanterns sway behind him. sub1 is a man with short dark hair. He wears a dark red fitted jacket with black bracers and a thin pendant necklace. sub1 stir-fries in a wok, flips the food, then turns his head toward the camera. sub1 looks directly at the camera, speaks firmly, and says,"Trust my cooking skills. You will definitely be amazed.".
Timbre:
Loading...
A lively open-kitchen café at night; stove flames flare, steam rises, and warm pendant lights swing slightly as staff move behind her. sub1 is a young woman with thick dark wavy hair and a side part. She wears a fitted black top under a light apron, a thin gold chain necklace, and small stud earrings. sub1 tastes the sauce with a spoon, then turns her face toward the camera while still holding the spoon, her expression shifting from focused to conflicted. sub1 maintains eye contact, swallows as if choosing her words, and says, "I keep telling myself I am fine,but some nights it feels like I am just performing calm.".
Timbre:
Loading...
The scene is set in a clothing store with artificial lighting. Racks of clothes are visible in the background, creating a retail environment. The shot is an upper-body close-up. sub1 is wearing a blue tank top, colorful bracelets on his wrist, and a necklace with multiple beads. sub1 holds up a pink tank top on a hanger and speaks directly to the camera, his expression engaged as if providing a review or presentation. With an engaging expression, sub1 looks at the camera and says, "This tank top is exactly what you need for this summer. Place your order now!".
Timbre:
Loading...
A well-lit room serving as a streaming or recording studio, illuminated with vibrant neon lights in shades of purple and blue. A curtain and a lamp are visible in the background, creating a cozy yet energetic atmosphere. The shot is an upper-body close-up. sub1, a man with a beard and short dark hair, wears his blue collared shirt and a pair of red headphones. sub1 is seated in front of a professional microphone, leaning in slightly. sub1's expression is engaged and friendly, indicating sub1 is actively speaking or singing. sub1 looks directly at the camera with a welcoming expression, speaking clearly into the microphone, and says, "Hey everyone, welcome back to the channel! It's great to have you all here today.".
Timbre:
Loading...
The scene is set on open plains stretching into the distance, illuminated by the warm glow of the setting sun. The static camera shot is an upper-body close-up. sub1 is a rugged man with a reddish mustache and goatee, a tanned, weathered face, and a defiant smirk. sub1 wears a brown leather jacket over a slightly dusty beige shirt. sub1 faces slightly to the left, holding a revolver in his right hand. As the video progresses, sub1 raises the revolver towards the camera, his smirk unwavering. While raising the revolver, sub1 looks towards the camera and speaks with a defiant tone, "Sorry, I'm going to end your life right now.".
Timbre:
Loading...
A medium shot set against a dark, indistinct background. The primary light source is a flickering torch, casting warm, golden light and creating dynamic shadows. sub1, whose features match the reference image, is wearing a straw hat. sub1 holds a lit torch, which illuminates his face as he speaks directly to the camera. As the torch flame flickers, sub1 looks forward with a steady gaze and says, "Don't be afraid of the darkness. There are always ways to illuminate it.".
Timbre:
Loading...
The setting is a playful room with colorful toys scattered on a soft rug. Sunlight streams through a nearby window, creating a lighthearted and fun atmosphere. sub1 is a baby with blonde hair and blue eyes, wearing a bright superhero cape that flutters in the light. sub1 stands confidently with arms raised in a powerful pose. sub1 has a determined look, with wide eyes and lips pursed in concentration, as if ready to take on a challenge in an adorable attempt at bravery. With a concentrated and heroic expression, sub1 looks forward as if truly ready to save the day and says, "I'm a superhero, let me handle this urgent mess now! leave it to me.".
Timbre:
Loading...
The scene is set inside a moving vintage car, with a steering wheel and rearview mirror visible. The atmosphere is tense and serious, suggesting a high-stakes situation or an important mission. sub1 is a man with fair skin and graying hair. sub1 wears a black fedora hat, dark sunglasses, and a black suit with a white shirt underneath. sub1 sits in the driver's seat, looking towards the left of the frame, off-camera. expression is serious and determined as sub1 speaks. With a serious and determined expression, sub1 looks to the left of the frame and speaks, "Stick to the plan. keep to our original schedule firmly. We will complete it.".
Timbre:
Loading...
A high-end fashion magazine's editorial office, with mood boards and designer sketches on the walls. sub1 is the editor-in-chief, is on the left. sub2 is a senior stylist, is on the right. sub1 Hand over a document to sub2 and speaking. sub2 took the document and listens attentively, nodding in agreement. sub1 looks serious and suggests, "We should start planning the cover shoot soon." sub2 nods intently and replies, " I've already been thinking about some options." as she smiles slightly.
Timbre 1:
Timbre 2:
Loading...
A sunlit, modern cafe with minimalist decor. A slow pan reveals a bustling street outside the large window. sub1 is on the left, wearing a simple dark sweater. sub2 is on the right, wearing a light-colored polo shirt. They are sitting across from each other at a small wooden table. sub1 holds a coffee cup and gestures thoughtfully while speaking. sub2 leans forward, listening intently with an engaged expression, then replies. sub1 looks at sub2 with a focused gaze and says, "Simplicity is the ultimate sophistication." sub2 nods eagerly and replies, "I completely agree, it's about the core idea." while a look of inspiration dawns on his face.
Timbre 1:
Timbre 2:
Loading...
A university alumni event, with banners and other attendees mingling in the softly blurred background. The shot is a medium close-up. sub1, a professor emeritus, is on the left. sub2, a student wearing the graduation cap and gown, is on the right. Both are framed from the chest up, faces each occupying about 1/5 of the screen. sub1 began to shake hands with sub2, sub1 is speaking,and sub2 listens attentively and then smiles radiantly as she responds. sub1 offers a gentle smile and asks, "Congratulations on your graduation!" sub2 nods happily and replies, "Thank you so much!"
Timbre 1:
Timbre 2:
Loading...
A cozy living room in the afternoon, with warm sunlight filtering through a window. sub1 is on the left, wearing a casual grey knit sweater. sub2 is on the right, in a simple white blouse. They are sitting comfortably on a sofa. sub1 is looking at sub2 and speaking with a gentle smile. sub2 listens intently, her expression warm and receptive, before laughing softly and replying. sub1 looks at sub2 and says warmly, "That's a very funny story." sub2 laughs and responds, "I have many more to share."
Timbre 1:
Timbre 2:
Loading...
A bright, sunny day on a mountain hiking trail. Lush green trees and a clear blue sky are visible in the background. sub1 is on the left, wearing a blue hiking shirt. sub2 is on the right, wearing a dark athletic tank top. sub1 looks at sub2 with a puzzled and slightly worried expression. sub2 maintains a calm, steady gaze, looking forward. sub1 furrows brow and asks, "Are you sure this is the right path?" sub2 gives a slight, reassuring nod and replies firmly, "Absolutely. The viewpoint is just ahead."
Timbre 1:
Timbre 2:
Loading...
A quiet, dimly lit study or private lounge area with warm ambient lighting. sub1 stands in the center wearing her black dress with the gold bow. sub2 stands to her left in his tuxedo. sub3 stands to her right in his leather jacket. sub1 looks excitedly between the two men. sub2 listens attentively with a gentle smile. sub3 maintains a serious but interested demeanor, nodding slightly. sub1 smiles broadly and says, "I discovered something fascinating." sub2 leans in slightly and responds, "I'd love to hear more." sub3 looks at sub1 intently and says, "Please share. This could be valuable." while nodding.
Timbre 1:
Timbre 2:
Timbre 3:
Human-Reference Video Editing (RV2AV)
Edit the identity and voice in a source video based on reference images and timbres. Please turn on the sound for watching. Hover over the video to reveal the text prompt.
Loading...
Original video
Loading...
The scene is set in a brightly lit, modern office environment. Large, out-of-focus windows are visible in the background, suggesting a high-rise building.sub1 is a man wearing thick-rimmed glasses. sub1's attire consists of a brown plaid suit jacket, a black vest, a white collared shirt, and a dark red and black patterned tie sub1 is actively speaking, with sub1's mouth open.sub1 speaks with a formal and deliberate tone. sub1 explains: "since their impending merger with BMC.".
Timbre:
Loading...
Original video
Loading...
The scene takes place in a modern indoor setting, characterized by cool, blue - toned lighting that creates a serious atmosphere. The background is softly out of focus. The main subject is sub1 wearing a professional dark blue blazer over a simple white t - shirt. The shot is a medium close - up focused on sub1 as she is in the middle of a heated conversation. Her facial expressions are highly emotive. The scene opens with a close - up on sub1, her expression one of serious concern as her lip movements match her spoken words Her emotion then abruptly escalates into pure, wide - eyed disbelief and anger.her expression a mask of outrage. "in today, You used your child as bait for a monster.".
Timbre:
Loading...
Original video
Loading...
The scene is a close-up shot, framed over the right shoulder of an unseen person. The background is dimly lit and out of focus, with some blurred lights suggesting an indoor. sub1 is wearing a white, ribbed turtleneck sweater and has a lanyard around sub1's neck.sub1 is actively speaking to the person in front of sub1, maintaining direct and unwavering eye contact. In a tight, over-the-shoulder shot, sub1 with a serious expression looks intently at the person sub1 is speaking to. sub1 states with conviction: "dried blood on the outside of that bag." sub1's brow slightly furrowed, concluding with absolute certainty, "I'm certain of it." The combination of sub1's unwavering stare, determined tone."
Timbre:
Loading...
Original video
Loading...
The scene is set outdoors during the day, with soft, natural lighting. The background is slightly out of focus, featuring a wooden structure and greenery, suggesting an outdoor market or park setting. sub1 is a woman with brown hair, wearing a dark red jacket over a patterned top. sub1 is actively speaking. Her facial expression is serious and intent, and her lip movements are clear and correspond to her speech. The camera focuses in a close-up on sub1, capturing her serious and direct expression. she said "I do need you to take a closer look at Buck."
Timbre:
Loading...
Original video
Loading...
The scene is a medium close - up shot set in what appears to be a dimly lit, professional office. sub1 is a man with short hair. wearing blue T-shirt. sub1's expression is calm but serious. sub1 is speaking directly to someone,with sub1's lip movements perfectly synchronized with sub1's speech. sub1 with a professional demeanor looks directly forward and speaks in a measured, calm tone. sub1 says, "I'm just introducing the idea. Something for you to think about."
Timbre:
Loading...
Original video
Loading...
The scene is a close-up shot set within a futuristic interior, likely a spaceship corridor or room. sub1 has brown hair. wearing a sports short-sleeved shirt. sub1 is actively speaking, facial expressions shifting to convey a sense of genuine sincerity and apology. sub1 initially looks slightly down before raising sub1's gaze to make direct eye contact with someone presumably just off - camera. sub1's lip movements are distinct and match the audible speech. The camera holds a close - up on sub1 as sub1 speaks with a sincere and apologetic expression. he says, "I know, I I really am genuinely sorry, Kel, that sucks."
Timbre:
Human-Reference Audio-Driven Video Animation (RA2V)
Given a reference image and driving audio, generate synchronized video animation. Please turn on the sound for watching. Hover over the video to reveal the text prompt.
Loading...
A warm, soft light illuminates sub1 from the front. The main subject sub1 has short, curly, light brown hair and a light beard. sub1 is dressed in a black suit, a white collared shirt, and a dark, subtly patterned tie. sub1 is standing and speaking directly into a black microphone positioned in front of him. sub1 looks directly forward and speaks with a serious and respectful tone into the microphone. "Today he receives the silver star for bravery and valor."
Audio:
Loading...
The scene is set in a dimly lit, enclosed space. The background is out of focus, suggesting a large, industrial environment. sub1, a middle - aged Caucasian man with short, thinning hair. He is wearing a blue and grey digital camouflage uniform and a grey neck gaiter. Small blue wired earbuds are visible in his ears. His face is slightly flushed and appears sweaty, with an intense expression. "Nice work. Tell DCA, get a fire team."
Audio:
Loading...
The main subject is sub1, who has long, straight, dark hair parted in the middle. sub1 is wearing a light grey, long - sleeved crewneck shirt. sub1's expression is serious, concerned, and focused, said"really bad guy, someone who might be threatening girls with scissors or a knife."
Audio:
Loading...
The scene is set in a dimly lit room. sub1 has long blonde hair and is wearing a light-colored collared shirt. sub1's facial expression is serious and concerned, with sub1's eyebrows slightly furrowed. sub1 is looking down, sub1's gaze fixed on something just below the frame, presumably a computer screen, sub1 said "increasingly powerful bursts of aggression, uh, persecution, anxiety."
Audio:
Loading...
"The scene is filmed outdoors The background is softly blurred with indistinct greenery. The primary subject is sub1 (the man with shoulder - length dark brown hair, green eyes, and a tense, concerned facial expression). He is wearing a beige jacket over a white, patterned collared shirt. On the right side of the frame, only the back of the head and shoulder of the person he is addressing are visible. sub1 is looking intently at the other person. He looks directly at the person in front of him, he speaks, "about you. About how you're changing."
Audio:
Loading...
"The scene takes place in a brightly lit, spacious indoor setting.sub1 is wearing a light blue, unbuttoned collared shirt over a white ribbed tank top, with a thin silver chain around his neck. He states, "cash. He was supposed to come back the next day for his shirt." He continues, "But get this. He never showed up. It was his wedding shirt."
Audio:
Our Framework
Overview of DreamID-Omni framework. We integrate reference-based generation (R2AV), editing (RV2AV), and animation (RA2V) using a Symmetric Conditional DiT trained via a multi-task progressive training strategy. Structured Caption and Syn-RoPE ensure robust dual-level disentanglement in multi-person scenarios.
InsertPipe
R2AV Comparison
Compared with other methods, our method shows strong ability on text following, subject preservation and binding ID with timbre in multi-person scenarios.
A sunlit, modern cafe with minimalist decor. A slow pan reveals a bustling street outside the large window. sub1 is on the left, wearing a simple dark sweater. sub2 is on the right, wearing a light-colored polo shirt. They are sitting across from each other at a small wooden table. sub1 holds a coffee cup and gestures thoughtfully while speaking. sub2 leans forward, listening intently with an engaged expression, then replies. sub1 looks at sub2 with a focused gaze and says, "Simplicity is the ultimate sophistication." sub2 nods eagerly and replies, "I completely agree, it's about the core idea." while a look of inspiration dawns on his face.
Loading...
Ours
Timbre 1:
Timbre 2:
Loading...
Wan2.6
Loading...
Qwen-Image+Ovi
Loading...
Qwen-Image+LTX-2
Loading...
Phantom
Loading...
VACE
A bright, comfortable meeting space with soft lighting. sub1 stands on the left wearing a white lace top. sub2 is positioned on the right, smiling with her hand near her chin. sub3 stands centrally between them, wearing a purple scarf and gold locket. The group is engaged in a serious discussion. sub1 looks at the others with urgency, sub2 nods with a thoughtful smile, and sub3 looks forward decisively. sub1 looks at her companions and says, "We need to decide soon." turning slightly, sub2 smiles warmly and replies, "Option A gives us flexibility." nodding in agreement, sub3 states firmly, "I agree. Let's go with that." as the group settles on the decision.
Loading...
Ours
T1:
T2:
T3:
Loading...
Wan2.6
Loading...
Qwen-Image+Ovi
Loading...
Qwen-Image+LTX-2
Loading...
Phantom
Loading...
VACE
RV2AV Comparison
Please turn on the sound for watching. Compared with other methods, our method shows strong ability on text following, subject preservation and audio-visual sync.
Loading...
Original video
The scene is set in a brightly lit, modern office environment. Large, out-of-focus windows are visible in the background, suggesting a high-rise building.sub1 is a man wearing thick-rimmed glasses. sub1's attire consists of a brown plaid suit jacket, a black vest, a white collared shirt, and a dark red and black patterned tie sub1 is actively speaking, with sub1's mouth open.sub1 speaks with a formal and deliberate tone. sub1 explains: "since their impending merger with BMC.".
Timbre:
Loading...
Ours
Loading...
HunyuanCustom
Loading...
VACE
Loading...
Original video
The scene is set outdoors during the day, with soft, natural lighting. The background is slightly out of focus, featuring a wooden structure and greenery, suggesting an outdoor market or park setting. sub1 is a woman with brown hair, wearing a dark red jacket over a patterned top. sub1 is actively speaking. Her facial expression is serious and intent, and her lip movements are clear and correspond to her speech. The camera focuses in a close-up on sub1, capturing her serious and direct expression. she said "I do need you to take a closer look at Buck."
Timbre:
Loading...
Ours
Loading...
HunyuanCustom
Loading...
VACE
RA2V Comparison
Please turn on the sound for watching. Compared with other methods, our method shows strong ability on text following, subject preservation and audio-visual sync.
Loading...
Ours
The scene is set in a dimly lit, dark environment.sub1, with long, wavy brown hair, a man with dark hair, wearing a grey collared shirt with a visible blue name tag, is partially in focus. sub1 is actively speaking, looking slightly upward as if addressing someone or a group. The man behind her remains still, listening intently.sub1 said: "How about how I feel now, seeing my name on a list of people who should be dead?" The man in the background listens with a grim expression, reinforcing the seriousness of the revelation.
Audio:
Loading...
HunyuanCustom
Loading...
Humo
Loading...
Ours
The scene is set in a dimly lit. On the left, sub1 with styled, highlighted blondish-brown hair and facial scruff is wearing a simple black t-shirt. On the right, seen from over his shoulder and partially out of focus, is a younger man wearing a black and white patterned shirt. sub1 is actively engaged in a conversation, looking at the younger man. he speaks "Jackson, I do not do that."
Audio:
Loading...
HunyuanCustom
Loading...
Humo

Ethical Considerations

The reference images and audio used in these demo videos are sourced from public domains or generated by models, and are intended solely to demonstrate the capabilities of this research. If there are any concerns, please contact us (guo-x24@mails.tsinghua.edu.cn) and we will delete it in time.