AI Toolbox
A curated collection of 758 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





DAWN can generate talking head videos from a single portrait and audio clip. It produces lip movements and head poses quickly, making it effective for creating long video sequences.
DimensionX can generate photorealistic 3D and 4D scenes from a single image using controllable video diffusion.
SG-I2V can control object and camera motion in image-to-video generation using bounding boxes and trajectories
RayGauss can create realistic new views of 3D scenes, using Gaussian-based ray casting! It produces high-quality images quickly, running at 25 frames per second, and avoids common picture problems that older methods had.
CLoSD can control characters in physics-based simulations using text prompts. It can navigate to goals, strike objects, and switch between sitting and standing, all guided by simple instructions.
GIMM is a new video interpolation method that uses motion modelling to predict motion between frames.
Regional-Prompting-FLUX adds regional prompting capabilities to diffusion transformers like FLUX. It effectively manages complex prompts and works well with tools like LoRA and ControlNet.
AutoVFX can automatically create realistic visual effects in videos from a single image and text instructions.
Adaptive Caching can speed up video generation with Diffusion Transformers by caching important calculations. It can achieve up to 4.7 times faster video creation at 720p without losing quality.
ZIM can generate precise matte masks from segmentation labels, enabling zero-shot image matting.
Face Anon can anonymize faces in images while keeping original facial expressions and head positions. It uses diffusion models to achieve high-quality image results and can also perform face swapping tasks.
CityGaussianV2 can reconstruct large-scale scenes from multi-view RGB images with high accuracy.
Self-Supervised Any-Point Tracking by Contrastive Random Walks can track any point in a video using a self-supervised global matching transformer.
MOFT is a training-free video motion interpreter and controller. It can be used to extract motion information from video diffusion models and guide the motion of generated videos without the need for retraining.
PF3plat can generate photorealistic images and accurate camera positions from uncalibrated image collections.
ScalingConcept can enhance or suppress existing concepts in images and audio without adding new elements. It can generate poses, enhance object stitching and reduce fuzziness in anime productions.
NoPoSplat can reconstruct 3D Gaussian scenes from multi-view. It achieves real-time reconstruction and high-quality images, especially when there are few input images.
ControlAR adds controls like edges, depths, and segmentation masks to autoregressive models like LlamaGen.
State of the art diffusion models are trained on square images. FiT is a new transformer architecture specifically designed for generating images with unrestricted resolutions and aspect ratios (similar to what Sora does). This enables a flexible training strategy that effortlessly adapts to diverse aspect ratios during both training and inference phases, thus promoting resolution generalization and eliminating biases induced by image cropping.
From Text to Pose to Image can generate high-quality images from text prompts by first creating poses and then using them to guide image generation. This method improves control over human poses and enhances image fidelity in diffusion models.