AI Toolbox
A curated collection of 759 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





AnimateDiff-Lightning can generate videos over ten times faster than AnimateDiff. It uses progressive adversarial diffusion distillation to combine multiple diffusion models into one motion module, improving style compatibility and achieving top performance in few-step video generation.
HoloDreamer can generate enclosed 3D scenes from text descriptions. It does so by first creating a high-quality equirectangular panorama and then rapidly reconstructing the 3D scene using 3D Gaussian Splatting.
InTeX can enable interactive text-to-texture synthesis for 3D content creation. It allows users to repaint specific areas and edit textures precisely, while a depth-aware inpainting model reduces 3D inconsistencies and speeds up generation.
StyleSketch is a method for extracting high-resolution stylized sketches from a face image. Pretty cool!
Controllable Text-to-3D Generation via Surface-Aligned Gaussian Splatting can create high-quality 3D content from text prompts. It uses edge, depth, normal, and scribble maps in a multi-view diffusion model, enhancing 3D shapes with a unique hybrid guidance method.
Desigen can generate high-quality design templates, including background images and layout elements. It uses advanced diffusion models for better control and has been tested on over 40,000 advertisement banners, achieving results similar to human designers.
StyleGaussian on the other hand enables instant style transfer of any image’s style to a 3D scene at 10fps while preserving strict multi-view consistency.
DragAnything can control the motion of any object in videos by letting users draw trajectory lines. It allows for separate motion control of multiple objects, including backgrounds.
DEADiff can synthesize images that combine the style of a reference image with text prompts. It uses a Q-Former mechanism to separate style and meaning.
VideoElevator is a training-free and plug-and-play method that can be used to enhance temporal consistency and add more photo-realistic details of text-to-video models by using text-to-image models.
ELLA is a lightweight approach to equip existing CLIP-based diffusion models with LLMs to improve prompt-understanding and enables long dense text comprehension for text-to-image models.
SplattingAvatar can generate photorealistic real-time human avatars using a mix of Gaussian Splatting and triangle mesh geometry. It achieves over 300 FPS on modern GPUs and 30 FPS on mobile devices, allowing for detailed appearance modeling and various animation techniques.
The PixArt model family got a new addition with PixArt-Σ. The model is capable of directly generating images at 4K resolution. Compared to its predecessor, PixArt-α, it offers images of higher fidelity and improved alignment with text prompts.
UniCtrl can improve the quality and consistency of videos made by text-to-video models. It enhances how frames connect and move together without needing extra training, making videos look better and more diverse in motion.
TripoSR can generate high-quality 3D meshes from a single image in under 0.5 seconds.
ResAdapter can generate images with any resolution and aspect ratio for diffusion models. It works with various personalized models and processes images efficiently, using only 0.5M parameters while keeping the original style.
ViewDiff is a method that can generate high-quality, multi-view consistent images of a real-world 3D object in authentic surroundings from a single text prompt or a single posed image.
While LCM and Turbo have unlocked near real-time image diffusion, the quality is still a bit lacking. TCD on the other hand manages to generate images with both clarity and detailed intricacy without compromising on speed.
OHTA can create detailed and usable hand avatars from just one image. It allows for text-to-avatar conversion and editing of hand textures and shapes, using data-driven hand priors to improve accuracy with limited input.
SongComposer can generate both lyrics and melodies using symbolic song representations. It aligns lyrics and melodies precisely and outperforms advanced models like GPT-4 in creating songs.