AI Art Weekly #122
Hello, my fellow dreamers, and welcome to issue #122 of AI Art Weekly! 👋
AI progress continues to advance. In the last two weeks, we saw quite a few smaller and larger methods and models get open-sourced. 16/21 as a matter of fact, which is probably the highest it has ever been since I started tracking papers. It’s always great to see code actually getting released!
Meanwhile, I’ve started a new “ChatGPT” section on Promptcache in which I plan to add prompt ideas for creative multimodal image tasks that could be used for working on websites, designs, games, etc. I’ve added 8 so far, but I have a list of 50 ideas lying around which I’ll add as I go.
The next issue will again be in two weeks as I’m going to take some family time. Enjoy the weekend, everybody! ✌️
Support the newsletter and unlock the full potential of AI-generated art with my curated collection of 275+ high-quality Midjourney SREF codes and 2000+ creative prompts.
News & Papers
Highlights
MAGI-1
MAGI-1 is a new autoregressive video model that looks like it surpasses Wan-2.1 in quality. It supports:
- High-Resolution Video: Generates videos at 720p resolution by default, with a 2x decoder variant supporting 1440p for sharper, cinematic-quality visuals suitable for professional content creation.
- Up to 16-Second Clips: Produces video clips up to 16 seconds long at 24 FPS, with chunk-wise generation allowing seamless extension for longer narratives or interactive media.
- Video Continuation (V2V): Extends existing video clips by predicting subsequent frames, maintaining motion continuity and context, ideal for storytelling or game cinematics.
- Real-Time Streaming: Delivers video chunks in real-time, enabling live applications like interactive broadcasts or virtual environments.
- Smooth Transitions with Second-by-Second Prompts: Supports fine-grained control via text prompts for each 1-second chunk, allowing precise scene changes (
man smiling
toman juggling
) and controllable shot transitions while preserving object identity or scene layout.
It runs on a RTX 4090 (4.5B model) or 8xH100 (24B model) with optimized memory (21.94 GB peak for 4.5B). The code and weights for MAGI-1 are available on Hugging Face.

MAGI-1 example
Nari Dia-1.6B
Dia-1.6B is a new text-to-speech model that reportedly outperforms ElevenLabs in realistic dialogue generation. It supports:
- Realistic Dialogue: Generates natural-sounding conversations from text scripts with
[S1]
,[S2]
tags for multiple speakers. - Non-Verbal Sounds: Produces sounds like laughter, coughs, sighs, and more using tags (
(laughs)
,(coughs)
). - Voice Cloning: Replicates a speaker’s voice from an audio prompt for consistent tone and emotion.
- Real-Time Audio: Generates audio in real-time on enterprise GPUs (40 tokens/s on A4000, ~86 tokens = 1 second).
- English-Only: Currently supports English dialogue generation.
It runs on CUDA supported GPUs with ~10GB VRAM with CPU support planned. The code and weights for Dia-1.6B are available on Hugging Face. Examples can be found here.

Dia prompt example
3D
TAPIP3D: Tracking Any Point in Persistent 3D Geometry
TAPIP3D can track 3D points in videos.

TAPIP3D example
CoMotion can detect and track 3D poses of multiple people using just one camera. It works well in crowded places and can keep track of movements over time with high accuracy.

CoMotion example
PARTFIELD: Learning 3D Feature Fields for Part Segmentation and Beyond
PartField can segment 3D shapes into parts without using templates or text names.

PARTFIELD example
HoloPart: Generative 3D Part Amodal Segmentation
HoloPart can break down 3D shapes into complete and meaningful parts, even if they are hidden. It also supports numerous downstream applications such as Geometry Editing, Geometry Processing, Material Editing and Animation.

HoloPart example
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation
HiScene can generate high-quality 3D scenes from 2D images by treating them as layered objects. It allows for interactive editing and effectively manages occlusions and shadows using a video-diffusion technique.

HiScene example
Art3D: Training-Free 3D Generation from Flat-Colored Illustration
Art3D can turn flat 2D designs into 3D images. It uses pre-trained 2D image models and a realism check to improve the 3D effect across different art styles.

Art3D example
Text
Describe Anything: Detailed Localized Image and Video Captioning
Describe Anything can generate detailed descriptions for specific areas in images and videos using points, boxes, scribbles, or masks.

Describe Anything example
Image
Step1X-Edit can perform advanced image editing tasks by processing reference images and user instructions.

Step1X-Edit example
InstantCharacter: Personalize Any Characters with a Scalable Diffusion Transformer Framework
InstantCharacter can generate high-quality images of personalized characters from a single reference image with FLUX. It supports different styles and poses, ensuring identity consistency and allowing for text-based edits.

InstantCharacter example
SCW-VTON can fit in-shop clothing to a person’s image while keeping their pose consistent. It improves the shape of the clothing and reduces distortions in visible limb areas, making virtual try-on results look more realistic.

Shape-Guided Clothing Warping for Virtual Try-On example
IMAGGarment-1: Fine-Grained Garment Generation for Controllable Fashion Design
IMAGGarment-1 can generate high-quality garments with control over shape, color, and logo placement.

IMAGGarment-1 example
Cobra: Efficient Line Art COlorization with BRoAder References
Cobra can efficiently colorize line art by utilizing over 200 reference images.

Cobra example
TryOffDiff: Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off
TryOffDiff can generate high-quality images of clothing from photos of people wearing them.

Enhancing Person-to-Person Virtual Try-On with Multi-Garment Virtual Try-Off example
Video
SkyReels-V2 can generate infinite-length videos by combining a Diffusion Forcing framework with Multi-modal Large Language Models and Reinforcement Learning.

the first 5 seconds of a 30 seconds SkyReels-V2 video example
Ev-DeblurVSR: Event-Enhanced Blurry Video Super-Resolution
Ev-DeblurVSR can turn low-resolution and blurry videos into high-resolution ones.

Event-Enhanced Blurry Video Super-Resolution example
FramePack: Packing Input Frame Context in Next-Frame Prediction Models for Video Generation
FramePack aims to make video generation feel like image gen. It can generate single video frames in 1.5 seconds with 13B models on a RTX 4090. Also supports full fps-30 with 13B models using a 6GB laptop GPU, but obviously slower.

Packing Input Frame Context in Next-Frame Prediction Models for Video Generation example
UniAnimate-DiT can generate high-quality animations from human images. It uses the Wan2.1 model and a lightweight pose encoder to create smooth and visually appealing results, while also upscaling animations from 480p to 720p.

UniAnimate-DiT example
NormalCrafter: Learning Temporally Consistent Normals from Video Diffusion Priors
NormalCrafter can generate consistent surface normals from video sequences. It uses video diffusion models and Semantic Feature Regularization to ensure accurate normal estimation while keeping details clear across frames.

NormalCrafter example
3DV-TON: Textured 3D-Guided Consistent Video Try-on via Diffusion Models
3DV-TON can generate high-quality videos for trying on clothes using 3D models. It handles complex clothing patterns and different body poses well, and it has a strong masking method to reduce errors.

3DV-TON example
RealisDance-DiT: Simple yet Strong Baseline towards Controllable Character Animation in the Wild
RealisDance-DiT can generate high-quality character animations from images and pose sequences. It effectively handles challenges like character-object interactions and complex gestures while using minimal changes to the Wan-2.1 video model and is part of the Uni3C method.

RealisDance-DiT example
Uni3C: Unifying Precisely 3D-Enhanced Camera and Human Motion Controls for Video Generation
Uni3C is a video generation method that adds support for both camera controls and human motion in video generation.

Uni3C example

Enjoy the weekend!
And that my fellow dreamers, concludes yet another AI Art weekly issue. If you like what I do, you can support me by:
- Sharing it 🙏❤️
- Following me on Twitter: @dreamingtulpa
- Buying me a coffee (I could seriously use it, putting these issues together takes me 8-12 hours every Friday 😅)
- Buying my Midjourney prompt collection on PROMPTCACHE 🚀
- Buying access to AI Art Weekly Premium 👑
Reply to this email if you have any feedback or ideas for this newsletter.
Thanks for reading and talk to you next week!
– dreamingtulpa