AI Toolbox
A curated collection of 759 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





[Matryoshka Diffusion Models] can generate high-quality images and videos using a NestedUNet architecture that denoises inputs at different resolutions. This method allows for strong performance at resolutions up to 1024x1024 pixels and supports effective training without needing specific examples.
DiffComplete can complete 3D shapes from incomplete scans using a diffusion-based method.
Puppet-Master can create realistic motion in videos from a single image using simple drag controls. It uses a fine-tuned video diffusion model and all-to-first attention method to make high-quality videos.
Generative Camera Dolly can regenerate a video from any chosen perspective. Still very early, but imagine being able to change any shot or angle in a video after it’s been recorded!
Sprite-Decompose can break down animated graphics into sprites using videos and box outlines.
MILS can generate captions for images, videos, and audio without any training. It achieves top performance in zero-shot captioning and improves text-to-image generation, allowing for creative uses across different media types.
IPAdapter-Instruct can efficiently combine natural-image conditioning with “Instruct” prompts! It enables users to switch between various interpretations of the same image, such as style transfer and object extraction.
MeshAvatar can generate high-quality triangular human avatars from multi-view videos. The avatars can be edited, manipulated, and relit.
MeshAnything V2 can generate 3D meshes from point clouds, meshes, images, text and more.
Lumina-mGPT can create photorealistic images from text and handle different visual and language tasks! It uses a special transformer model, making it possible to control image generation, do segmentation, estimate depth, and answer visual questions in multiple steps.
And talking about Splats, Feature Splatting can manipulate both the appearance and the physical properties of objects in a 3D scene using text prompts.
VAR-CLIP creates detailed fantasy images that match text descriptions closely by combining Visual Auto-Regressive techniques with CLIP! It uses text embeddings to guide image creation, ensuring strong results by training on a large image-text dataset.
CityGaussian can render large-scale 3D scenes in real-time using a divide-and-conquer training approach and Level-of-Detail strategy. It achieves high-quality rendering at an average speed of 36 FPS on an A100 GPU.
Perm can generate and manipulate 3D hairstyles. It enables applications such as 3D hair parameterization, hairstyle interpolation, single-view hair reconstruction, and hair-conditioned image generation.
SV4D 2.0 can generate high-quality 4D models and videos from a reference video.
SEG improves image generation for SDXL by smoothing the self-attention energy landscape! This boosts quality without needing guidance scale, using a query blurring method that adjusts attention weights, leading to better results with fewer drawbacks.
SMooDi can generate stylized motion from text prompts and style motion sequences.
Interactive3D can generate high-quality 3D objects that users can easily modify. It allows for adding and removing parts, dragging objects, and changing shapes.
XHand can generate high-fidelity hand shapes and textures in real-time, enabling expressive hand avatars for virtual environments.
DreamMover can generate high-quality intermediate images and short videos from image pairs with large motion. It uses a flow estimator based on diffusion models to keep details and ensure consistency between frames and input images.