AI Art Weekly #76
Hello there, my fellow dreamers, and welcome to issue #76 of AI Art Weekly! 👋
Just this morning I skimmed through a whopping 230+ papers and projects, and let me tell you, we aren’t prepared for the exponentials that are about to hit us. Robots, AI superchips, LLM powered operating systems, an explosion in 3D content, things are ramping up hard.
This issue is the biggest one yet. It’s about 4 times as big as usual. So if you can, please consider supporting this newsletter by buying me a coffee or becoming a monthly supporter.
Due to the sheer amount of content this week, I’ve split the news section into categories. In this issue we cover:
- 3D generation and texturing: 13 different methods for text and image-to-3D, InTeX, TexDreamer, GaussianFlow
- Image generation: SD3 Turbo, LightIt, OMG, YOSO, FouriScale, Desigen
- Image editing: StyleSketch, Wear-Any-Way, DiffCriticEdit, Magic Fixup, DesignEdit, ReNoise
- Video generation: AnimateDiff-Lightning, StyleCineGAN, Time Reversal, Mora, VSTAR
- Video editing: FRESCO, AnyV2V, MOTIA
- and more!
Want me to keep up with AI for you? Well, that requires a lot of coffee. If you like what I do, please consider buying me a cup so I can stay awake and keep doing what I do 🙏
Cover Challenge 🎨
For next weeks cover I’m looking for ostara submissions! Reward is again $50 and a rare role in our Discord community which lets you vote in the finals. Rulebook can be found here and images can be submitted here.
News & Papers
3D generation and texturing
Text-to-3D and Image-to-3D
As I said in the intro, 3D content is about to explode. Just this week we had 13 papers on text and image-to-3D object reconstruction alone. As they’re all somewhat similar, I’m not going to dissect them all. Instead, I’ll just list them here:
- SV3D: Stability AI released a new model for high-resolution, image-to-3D reconstruction.
- LATTE3D: NVIDIA’s new text-to-3D method to generates high-quality textured meshes from text robustly in just 400ms.
- Isotropic3D: Image-to-3D Generation Based on a Single CLIP Embedding.
- MVControl: Text-to-3D with ControlNet like conditioning (canny, depth, scribble, etc.).
- Make-Your-3D: Image-to-3D with the ability to control generation with a text prompt
- MVEdit: Supports text-to-3D, image-to-3D, and 3D-to-3D with texture generation.
- VFusion3D: Image-to-3D from Video Diffusion Models.
- GVGEN: Text-to-3D Generation with Volumetric Representation.
- GRM: High-quality, efficient text-to-3D and image-to-3D in 100ms
- FDGaussian: Image-to-3D with Gaussian Splatting.
- Ultraman: Image-to-3D with a focus on human avatars.
- Sculpt3D: More text-to-3D.
- ComboVerse: More image-to-3D.
InTeX: Interactive Text-to-Texture Synthesis via Unified Depth-aware Inpainting
Now that we have a gazillion options to generate 3D objects, we might want to have more control over the textures. InTeX helps with that by generating and inpainting textures from text.
TexDreamer: Towards Zero-Shot High-Fidelity 3D Human Texture Generation
And another one! TexDreamer is a high-fidelity 3D human texture generation model that supports both text and image inputs.
GaussianFlow: Splatting Gaussian Dynamics for 4D Content Creation
Image-to-3D is cool. But what about video-to-4D? GaussianFlow can generate 4D Gaussian Splatting fields from monocular videos (like Sora).
Image generation
Stable Diffusion 3 Turbo
Stable Diffusion 3 hasn’t even been released yet, and Stability already announced its Turbo version. This is SD3 but faster, think SDXL quality in 4 steps.
LightIt: Illumination Modeling and Control for Diffusion Models
Now, let’s talk image generation. LightIt is a method for explicit illumination control for image generation. It’s the first method that enables the generation of images with controllable, consistent lighting and performs on par with specialized relighting state-of-the-art methods.
OMG: Occlusion-friendly Personalized Multi-concept Generation In Diffusion Models
OMG is a framework for multi-concept image generation, supporting character and style LoRAs. Instead of LoRAs, it also supports InstantID for multi-ID support.
YOSO: You Only Sample Once
Image models are becoming faster, bigger, better. YOSO is a new method that can finetune pretrained diffusion models to generate high-fidelity images in one-step.
FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis
FouriScale can generate high-resolution images from pre-trained diffusion models with various aspect ratios and achieve an astonishing capacity of arbitrary-size, high-resolution, and high-quality generation.
Desigen: A Pipeline for Controllable Design Template Generation
Unlimited design templates unlocked. Desigen is a pipeline for automatic template creation which generates background images as well as harmonious layout elements over the background. This could be used to generate design templates for websites, presentations, social media posts and more.
Image editing
StyleSketch: Stylized Face Sketch Extraction via Generative Prior with Limited Data
StyleSketch is a method for extracting high-resolution stylized sketches from a face image. Pretty cool!
Wear-Any-Way: Manipulable Virtual Try-on via Sparse Correspondence Alignment
Wear-Any-Way is a new framework for virtual try-on that supports users to precisely manipulate the wearing style of garments. The method enables users to drag sleeves to roll them up, open coats, and control the style of tucks, among other things.
Diffusion Models are Geometry Critics: Single Image 3D Editing Using Pre-Trained Diffusion Priors
DiffCriticEdit enables 3D manipulations on images, such as object rotation and translation.
Magic Fixup: Streamlining Photo Editing by Watching Dynamic Videos
Adobe’s Magic Fixup lets you edit images with a cut-and-paste approach that fixes edits automatically. Can see this being super useful for generating animation frames for tools like AnimateDiff. But it’s not clear yet if or when this hits Photoshop.
DesignEdit: Multi-Layered Latent Decomposition and Fusion for Unified & Accurate Image Editing
DesignEdit is another image editing method, but from Microsoft. It can remove objects, edit typography, swap, relocate, resize, add and flip multiple objects, pan and zoom images, remove decorations from images, and edit posters.
ReNoise: Real Image Inversion Through Iterative Noising
ReNoise can be used to reconstruct an input image that can be edited using text prompts.
Video generation
AnimateDiff-Lightning
After SDXL Lightning, ByteDance now released AnimateDiff-Lightning. A text-to-video model that can generate videos more than ten times faster than the original AnimateDiff.
StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN
StyleCineGAN is a method that can generate high-resolution looping cinemagraphs automatically from a still landscape image using a pre-trained StyleGAN.
Time Reversal: Explorative Inbetweening of Time and Space
Time Reversal is making it possible to generate in-between frames of two input images. In particular, this enables the generation of looping cinemagraphs as well as camera and subject motion videos.
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework
Mora is an open-source attempt at replicating OpenAI’s Sora video model capabilities in various tasks such as text-to-video generation, image-to-video generation, extending generated videos, video-to-video editing, connecting videos, and simulating digital worlds. Results are far away from Sora, but it’s a start!
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
VSTAR is a method that enables text-to-video models to generate longer videos with dynamic visual evolution in a single pass, without finetuning needed.
Video editing
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
FRESCO combines ControlNet with Ebsynth for zero-shot video translation that focuses on preserving the spatial and temporal consistency of the input frames.
AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks
AnyV2V can edit a source video along with additional control (such as text prompts, subjects, or styles). Looks like one of the best Gen-1 alternatives yet.
Be-Your-Outpainter: Mastering Video Outpainting through Input-Specific Adaptation
MOTIA is a high-quality flexible video outpainting method. But no code yet 😭
Also interesting
- SceneScript: an AI model and method to understand and describe 3D spaces
- Arc2Face: A Foundation Model of Human Faces
- ScoreHMR: Score-Guided Diffusion for 3D Human Recovery
- Speech-driven Personalized Gesture Synthetics: Harnessing Automatic Fuzzy Feature Inference
- Portrait4D-v2: Pseudo Multi-View Data Creates Better 4D Head Synthesizer
- SCP-Diff: Photo-Realistic Semantic Image Synthesis with Spatial-Categorical Joint Prior
- GeoWizard: Unleashing the Diffusion Priors for 3D Geometry Estimation from a Single Image
A one-step image-to-image method based on SD-turbo that enables sketch2image, day2night, and more. Hugging face demo.
StreamMultiDiffusion is a real-time interactive multiple-text-to-image generation from user-assigned regional text prompts.
A free tool for generating textures via Automatic1111 StableDiffusion. Supports preserving of UV maps, blending layers by brush, 3D inpainting and more.
The code and model weights for interpolating between two images with DynamiCrafter got released.
And that my fellow dreamers, concludes yet another AI Art weekly issue. Please consider supporting this newsletter by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying a physical art print to hang onto your wall
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa