Issue #122 | AI Art Weekly

Hello, my fellow dreamers, and welcome to issue #122 of AI Art Weekly! 👋

AI progress continues to advance. In the last two weeks, we saw quite a few smaller and larger methods and models get open-sourced. 16/21 as a matter of fact, which is probably the highest it has ever been since I started tracking papers. It’s always great to see code actually getting released!

Meanwhile, I’ve started a new “ChatGPT” section on Promptcache in which I plan to add prompt ideas for creative multimodal image tasks that could be used for working on websites, designs, games, etc. I’ve added 8 so far, but I have a list of 50 ideas lying around which I’ll add as I go.

The next issue will again be in two weeks as I’m going to take some family time. Enjoy the weekend, everybody! ✌️

PROMPTCACHE

Support the newsletter and unlock the full potential of AI-generated art with my curated collection of 275+ high-quality Midjourney SREF codes and 2000+ creative prompts.

promptcache.com

News & Papers

Highlights

MAGI-1

MAGI-1 is a new autoregressive video model that looks like it surpasses Wan-2.1 in quality. It supports:

High-Resolution Video: Generates videos at 720p resolution by default, with a 2x decoder variant supporting 1440p for sharper, cinematic-quality visuals suitable for professional content creation.
Up to 16-Second Clips: Produces video clips up to 16 seconds long at 24 FPS, with chunk-wise generation allowing seamless extension for longer narratives or interactive media.
Video Continuation (V2V): Extends existing video clips by predicting subsequent frames, maintaining motion continuity and context, ideal for storytelling or game cinematics.
Real-Time Streaming: Delivers video chunks in real-time, enabling live applications like interactive broadcasts or virtual environments.
Smooth Transitions with Second-by-Second Prompts: Supports fine-grained control via text prompts for each 1-second chunk, allowing precise scene changes (man smiling to man juggling) and controllable shot transitions while preserving object identity or scene layout.

It runs on a RTX 4090 (4.5B model) or 8xH100 (24B model) with optimized memory (21.94 GB peak for 4.5B). The code and weights for MAGI-1 are available on Hugging Face.

Nari Dia-1.6B

Dia-1.6B is a new text-to-speech model that reportedly outperforms ElevenLabs in realistic dialogue generation. It supports:

Realistic Dialogue: Generates natural-sounding conversations from text scripts with [S1], [S2] tags for multiple speakers.
Non-Verbal Sounds: Produces sounds like laughter, coughs, sighs, and more using tags ((laughs), (coughs)).
Voice Cloning: Replicates a speaker’s voice from an audio prompt for consistent tone and emotion.
Real-Time Audio: Generates audio in real-time on enterprise GPUs (40 tokens/s on A4000, ~86 tokens = 1 second).
English-Only: Currently supports English dialogue generation.

It runs on CUDA supported GPUs with ~10GB VRAM with CPU support planned. The code and weights for Dia-1.6B are available on Hugging Face. Examples can be found here.

3D

TAPIP3D: Tracking Any Point in Persistent 3D Geometry

Arxiv, Project Page, Code

TAPIP3D can track 3D points in videos.

CoMotion: Concurrent Multi-person 3D Motion

Arxiv, Code

CoMotion can detect and track 3D poses of multiple people using just one camera. It works well in crowded places and can keep track of movements over time with high accuracy.

PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond

Arxiv, Project Page, Code

PartField can segment 3D shapes into parts without using templates or text names.

HoloPart: Generative 3D Part Amodal Segmentation

Arxiv, Project Page, Code

HoloPart can break down 3D shapes into complete and meaningful parts, even if they are hidden. It also supports numerous downstream applications such as Geometry Editing, Geometry Processing, Material Editing and Animation.

HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

Arxiv, Project Page

HiScene can generate high-quality 3D scenes from 2D images by treating them as layered objects. It allows for interactive editing and effectively manages occlusions and shadows using a video-diffusion technique.

Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Arxiv, Project Page, Code (coming soon)

Art3D can turn flat 2D designs into 3D images. It uses pre-trained 2D image models and a realism check to improve the 3D effect across different art styles.

Text

Describe Anything: Detailed Localized Image and Video Captioning

Arxiv, Project Page, Code

Describe Anything can generate detailed descriptions for specific areas in images and videos using points, boxes, scribbles, or masks.

Image

Step1X-Edit: A Practical Framework for General Image Editing

Arxiv, Code

Step1X-Edit can perform advanced image editing tasks by processing reference images and user instructions.

InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

Arxiv, Project Page, Code, Demo

InstantCharacter can generate high-quality images of personalized characters from a single reference image with FLUX. It supports different styles and poses, ensuring identity consistency and allowing for text-based edits.

Shape-Guided Clothing Warping for Virtual Try-On

Arxiv, Code

SCW-VTON can fit in-shop clothing to a person’s image while keeping their pose consistent. It improves the shape of the clothing and reduces distortions in visible limb areas, making virtual try-on results look more realistic.

Shape-Guided Clothing Warping for Virtual Try-On example

IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design

Arxiv, Project Page, Code

IMAGGarment-1 can generate high-quality garments with control over shape, color, and logo placement.

Cobra: Efficient Line Art COlorization with BRoAder References

Arxiv, Project Page, Code

Cobra can efficiently colorize line art by utilizing over 200 reference images.

TryOffDiff: Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off

Arxiv, Project Page, Code

TryOffDiff can generate high-quality images of clothing from photos of people wearing them.

Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off example

Video

SkyReels-V2: Infinite-length Film Generative Model

Arxiv, Code

SkyReels-V2 can generate infinite-length videos by combining a Diffusion Forcing framework with Multi-modal Large Language Models and Reinforcement Learning.

the first 5 seconds of a 30 seconds SkyReels-V2 video example

Ev-DeblurVSR: Event-Enhanced Blurry Video Super-Resolution

Arxiv, Project Page, Code

Ev-DeblurVSR can turn low-resolution and blurry videos into high-resolution ones.

Event-Enhanced Blurry Video Super-Resolution example

FramePack: Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

Arxiv, Project Page, Code

FramePack aims to make video generation feel like image gen. It can generate single video frames in 1.5 seconds with 13B models on a RTX 4090. Also supports full fps-30 with 13B models using a 6GB laptop GPU, but obviously slower.

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation example

UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

Arxiv, Code

UniAnimate-DiT can generate high-quality animations from human images. It uses the Wan2.1 model and a lightweight pose encoder to create smooth and visually appealing results, while also upscaling animations from 480p to 720p.

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

Arxiv, Project Page, Code

NormalCrafter can generate consistent surface normals from video sequences. It uses video diffusion models and Semantic Feature Regularization to ensure accurate normal estimation while keeping details clear across frames.

3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models

Arxiv, Project Page

3DV-TON can generate high-quality videos for trying on clothes using 3D models. It handles complex clothing patterns and different body poses well, and it has a strong masking method to reduce errors.

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

Arxiv, Project Page, Code (coming soon)

RealisDance-DiT can generate high-quality character animations from images and pose sequences. It effectively handles challenges like character-object interactions and complex gestures while using minimal changes to the Wan-2.1 video model and is part of the Uni3C method.

Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation

Arxiv, Project Page, Code (coming soon)

Uni3C is a video generation method that adds support for both camera controls and human motion in video generation.

And that my fellow dreamers, concludes yet another AI Art weekly issue. If you like what I do, you can support me by:

Sharing it 🙏❤️
Following me on Twitter: @dreamingtulpa
Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
Buying my Midjourney prompt collection on PROMPTCACHE 🚀
Buying access to AI Art Weekly Premium 👑

Reply to this email if you have any feedback or ideas for this newsletter.

Thanks for reading and talk to you next week!

– dreamingtulpa

Subscribe to our newsletter

AI Art Weekly #122

News & Papers

Highlights

MAGI-1

Nari Dia-1.6B

3D

TAPIP3D: Tracking Any Point in Persistent 3D Geometry

CoMotion: Concurrent Multi-person 3D Motion

PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond

HoloPart: Generative 3D Part Amodal Segmentation

HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation

Art3D: Training-Free 3D Generation from Flat-Colored Illustration

Text

Describe Anything: Detailed Localized Image and Video Captioning

Image

Step1X-Edit: A Practical Framework for General Image Editing

InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework

Shape-Guided Clothing Warping for Virtual Try-On

IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design

Cobra: Efficient Line Art COlorization with BRoAder References

TryOffDiff: Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off

Video

SkyReels-V2: Infinite-length Film Generative Model

Ev-DeblurVSR: Event-Enhanced Blurry Video Super-Resolution

FramePack: Packing Input Frame Context in Next-Frame Prediction Models for Video Generation

UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer

NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors

3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models

RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild

Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation