AI Toolbox
A curated collection of 758 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





On the other hand, Customizing Motion can learn and generalize input motion patterns from input videos and apply them to new and unseen contexts.
MEMO can generate talking videos from images and audio. It keeps the person’s identity consistent and matches lip movements to the audio, producing natural expressions.
MV-Adapter can generate images from multiple views while keeping them consistent across views. It enhances text-to-image models like Stable Diffusion XL, supporting both text and image inputs, and achieves high-resolution outputs at 768x768.
CAVIS can do instance segmentation on videos. It’s able to better track objects and improve instance matching accuracy, resulting in more accurate and stable instance segmentation.
VideoRepair can improve text-to-video generation by finding and fixing small mismatches between text prompts and videos.
Trellis 3D generates high-quality 3D assets in formats like Radiance Fields, 3D Gaussians, and meshes. It supports text and image conditioning, offering flexible output format selection and local 3D editing capabilities.
Anagram-MTL can generate visual anagrams that change appearance with transformations like flipping or rotating.
Dessie can estimate the 3D shape and pose of horses from single images. It also works with other large animals like zebras and cows.
Negative Token Merging can improve image diversity by pushing apart similar features during the reverse diffusion process. It reduces visual similarity with copyrighted content by 34.57% and works well with Stable Diffusion as well as Flux.
FlowEdit can edit images using only text prompts with Flux and Stable Diffusion 3.
L4GM is a 4D Large Reconstruction Model that can turn a single-view video into an animated 3D object.
D3GA is the first 3D controllable model for human bodies rendered with Gaussian splats in real-time. This lets us turn ourselves or others with a multi-cam setup into a Gaussian splat which can be animated, even allowing to decompose the avatar into its different cloth layers.
You ever tried to inpaint smaller objects and details into an image? Can be kind of a hit or miss. SOEDiff has been specifically trained to handle these cases and can do a pretty good job at it.
Material Anything can generate realistic materials for 3D objects, including those without textures. It adapts to different lighting and uses confidence masks to improve material quality, ensuring outputs are ready for UV mapping.
Inverse Painting can generate time-lapse videos of the painting process from a target artwork. It uses a diffusion-based renderer to learn from real artists’ techniques, producing realistic results across different artistic styles.
MegaFusion can extend existing diffusion models for high-resolution image generation. It achieves images up to 2048x2048 with only 40% of the original computational cost by enhancing denoising processes across different resolutions.
CAT4D can create dynamic 4D scenes from single videos. It uses a multi-view video diffusion model to generate videos from different angles, allowing for strong 4D reconstruction and high-quality images.
SuperMat can quickly break down images of materials into three important maps: albedo, metallic, and roughness. It does this in about 3 seconds while keeping high quality, making it efficient for 3D object material estimation.
SelfSplat can create 3D models from multiple images without needing specific poses. It uses self-supervised methods for depth and pose estimation, resulting in high-quality appearance and geometry from real-world data.
DreamMix is a inpainting method based on the Fooocus model that can add objects from reference images and change their features using text.