AI Art Weekly #119
Hello, my fellow dreamers, and welcome to issue #119 of AI Art Weekly! 👋
After last week’s Gemini 2 multimodality, ChatGPT finally released their GPT-4o native image generation feature which they had presented almost over a year ago.
GPT-4o, the model itself, can now generate images. This is not to be confused with DALL-E, a separate image model which GPT-4o only called in the background up until now.
And oh boy, is it good. Slow and heavily restricted, but very good. It instantly caused a “ghiblify” trend on X after launch. Never before have I seen the entire feed just filled with a single trend before.
4o is capable of so much more though. Making comics, movie posters, YouTube thumbnails, turning things real, transparent illustrations, ads, and basically any image task one can imagine that so far required separate models, workflows, and pipelines - most of them are now just one prompt away.
Buckle up, because this is the power of multimodality.
Support the newsletter and unlock the full potential of AI-generated art with my curated collection of 275+ high-quality Midjourney SREF codes and 2000+ creative prompts.
News & Papers
3D
DIDiffGes: Decoupled Semi-Implicit Diffusion Models for Real-time Gesture Generation from Speech
DIDiffGes can generate high-quality gestures from speech in just 10 sampling steps.

DIDiffGes example
PGC: Physics-Based Gaussian Cloth from a Single Pose
PGC: Physics-Based Gaussian Cloth from a Single Pose can create simulation-ready garments from a single static pose. It captures fine details using 3D Gaussian splats and allows accurate garment simulation on new poses with minimal data.

PGC example
PhysGen3D: Crafting a Miniature Interactive World from a Single Image
PhysGen3D can turn a single image into an interactive 3D scene. It allows users to control object speed and material properties while ensuring the scene looks realistic and behaves like it would in real life.

PhysGen3D example
Liv3Stroke: Recovering Dynamic 3D Sketches from Videos
Liv3Stroke can reconstruct dynamic 3D sketches from videos using flexible 3D strokes.

Recovering Dynamic 3D Sketches from Videos example
Dance Like a Chicken: Low-Rank Stylization for Human Motion Diffusion
LoRA-MDM can generate stylized human motions in different styles, like “Chicken,” by using a few reference samples with a motion diffusion model. It allows for style blending and motion editing while keeping a good balance between text fidelity and style consistency.

Dance Like a Chicken example
DeClotH: Decomposable 3D Cloth and Human Body Reconstruction from a Single Image
DeClotH can reconstruct 3D cloth and human bodies from a single image.

DeClotH example
LookCloser: Frequency-aware Radiance Field for Tiny-Detail Scene
LookCloser can capture the full structure and fine details of scenes in one NeRF model. It allows for real-time rendering and interaction.

LookCloser example
TaoAvatar: Real-Time Lifelike Full-Body Talking Avatars for Augmented Reality via 3D Gaussian Splatting
TaoAvatar can create high-quality, real-time full-body talking avatars from multiple camera angles. It runs at 90 frames per second on devices like the Apple Vision Pro and allows detailed control over facial expressions and body movements, making it great for e-commerce live streaming and holographic communication.

TaoAvatar example
SV4D 2.0: Enhancing Spatio-Temporal Consistency in Multi-View Video Diffusion for High-Quality 4D Generation
SV4D 2.0 can generate high-quality 4D models and videos from a reference video.

SV4D 2.0 example
DiffPortrait360: Consistent Portrait Diffusion for 360 View Synthesis
DiffPortrait360 can create high-quality 360-degree views of human heads from single images.

DiffPortrait360 example
Image
LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis
LeX-Art can generate high-quality text-image pairs with better text rendering and design. It uses a prompt enrichment model called LeX-Enhancer and two optimized models, LeX-FLUX and LeX-Lumina, to improve color, position, and font accuracy.

A picture of a neatly arranged stack of wooden blocks with colorful designs, with the text on it: "Learn", "Play", "Grow", "Build", "Create", "Explore", "Think", "Solve", "Imagine"
SISO: Single Image Iterative Subject-driven Generation and Editing
SISO can generate and edit images using just one subject image without any training. It improves image quality, keeps the subject clear, and preserves the background better than other methods.

SISO editing example
MagicColor: Multi-Instance Sketch Colorization
MagicColor can automatically colorize multi-instance sketches while keeping colors consistent across objects using reference images.

MagicColor example
Video
AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset
AccVideo can speed up video diffusion models by reducing the number of steps needed for video creation. It achieves an 8.5x faster generation speed compared to HunyuanVideo, producing high-quality videos at 720x1280 resolution and 24fps, which makes text-to-video generation way more efficient.

AccVideo example
PP-VCtrl: Enabling Versatile Controls for Video Diffusion Models
PP-VCtrl can turn text-to-video models into customizable video generators. It uses control signals like Canny edges and segmentation masks to improve video quality and control without retraining the models, making it great for character animation and video editing.

Enabling Versatile Controls for Video Diffusion Models example
Video Motion Graphs
Video Motion Graphs can generate realistic human motion videos by combining clips from a reference video with music or motion tags. It uses HMInterp for smooth transitions and high-quality video, improving how we create multi-modal human motion videos. Pretty smart.

Video Motion Graphs example
Synthetic Video Enhances Physical Fidelity in Video Synthesis
Synthetic Video Enhances Physical Fidelity in Video Synthesis can improve the realism of video generation by using synthetic videos that follow real-world physics. This method reduces unwanted artifacts and ensures 3D consistency, making generated videos more lifelike.

A woman squats, launches into a backflip, and lands gracefully on her feet on sunny grassland
FullDiT: Multi-Task Video Generative Foundation Model with Full Attention
FullDiT can generate videos with control over camera angles, identities, depth, and text. It uses a unified full-attention mechanism to improve video content creation and achieve top results in multi-task video generation.

FullDiT example
Mask²DiT: Dual Mask-based Diffusion Transformer for Multi-Scene Long Video Generation
Mask²DiT can generate long videos with multiple scenes by aligning video segments with text descriptions.

This video demonstrates auto-regressive scene extension, where the model generates the third 6-second scene conditioned on the first two 6-second scenes (12s in total) as context.
HunyuanPortrait: Implicit Condition Control for Enhanced Portrait Animation
HunyuanPortrait can animate characters from a single portrait image by using facial expressions and head poses from video clips. It achieves lifelike animations with high consistency and control, effectively separating appearance and motion.

HunyuanPortrait example
DisentTalk: Cross-lingual Talking Face Generation via Semantic Disentangled Diffusion Model
DisentTalk can generate high-quality talking face animations from audio. It ensures accurate lip synchronization and natural expressions while maintaining smooth movement over time.

DisentTalk examples
Audio
MusiCoT: Analyzable Chain-of-Musical-Thought Prompting for High-Fidelity Music Generation
MusiCoT can generate high-quality music by first outlining its structure and then producing audio tokens. It supports variable-length audio inputs and performs well in both objective and subjective measures, effectively addressing copying issues.

MusiCoT flow
Also interesting
There is just so much 4o can do, so here are a few more examples for inspiration.
@venturetwins turned the Office into a Ghibli style anime using GPT-4o and Hedra.
@perrymetzger made a fake O’Reilly coding book cover about “vibe coding”.
4o is pretty good at text rendering. @rayisdoingfilm made some cool fake posters using of the “Reality Control Division”.

“Touch of Shoggoth” created with GPT-4o
And that my fellow dreamers, concludes yet another AI Art weekly issue. If you like what I do, you can support me by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying my Midjourney prompt collection on PROMPTCACHE 🚀
- Buying access to AI Art Weekly Premium 👑
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa