AI Toolbox
A curated collection of 759 free cutting edge AI papers with code and tools for text, image, video, 3D and audio generation and manipulation.





GEM3D is a deep, topology-aware generative model of 3D shapes. The method is able to generate diverse and plausible 3D shapes from user-modeled skeletons, making it possible to draw the rough structure of an object and have the model fill in the rest.
Multi-LoRA Composition focuses on the integration of multiple Low-Rank Adaptations (LoRAs) to create highly customized and detailed images. The approach is able to generate images with multiple elements without fine-tuning and without losing detail or image quality.
MeshFormer can generate high-quality 3D textured meshes from just a few 2D images in seconds.
SPA-RP can create 3D textured meshes and estimate camera positions from one or a few 2D images. It uses 2D diffusion models to quickly understand 3D space, achieving high-quality results in about 20 seconds.
SCG can be used by musicians to compose and improvise new piano pieces. It allows musicians to guide music generation by using rules like following a simple I-V chord progression in C major. Pretty cool.
FlashTex](https://flashtex.github.io) can texture an input 3D mesh given a user-provided text prompt. These generated textures can also be relit properly in different lighting environments.
Visual Style Prompting can generate images with a specific style from a reference image. Compared to other methods like IP-Adapter and LoRAs, Visual Style Prompting is better at retainining the style of the referenced image while avoiding style leakage from text prompts.
Vevo can imitate voices without needing specific training data. It can change accents and emotions while keeping output high quality, using a self-supervised method that separates different speech features.
Argus3D can generate 3D meshes from images and text prompts as well as unique textures for its generated shapes. Just imagine composing a 3D scene and fill it with objects by pointing at a space and using natural language to describe what you want to place there.
AudioEditing are two methods for editing audio. The first technique allows for text-based editing, while the second is an approach for discovering semantically meaningful editing directions without supervision.
Magic-Me can generate identity-specific videos from a few reference images while keeping the person’s features clear.
Continuous 3D Words is a control method that can modify attributes in images with a slider based approach. This allows for more control over illumination, non-rigid shape changes (like wings), and camera orientation for instance.
GALA3D is a text-to-3D method that can generate complex scenes with multiple objects and control their placement and interaction. The method uses large language models to generate initial layout descriptions and then optimizes the 3D scene with conditioned diffusion to make it more realistic.
LGM can generate high-resolution 3D models from text prompts or single-view images. It uses a fast multi-view Gaussian representation, producing models in under 5 seconds while maintaining high quality.
ConsistI2V is an image-to-video method with enhanced visual consistency. Compared to other methods, this one is able to better maintain the subject, background, and style from the first frame, as well as ensure a fluid and logical progression while supporting long video generation as well as camera motion control.
Direct-a-Video can individually or jointly control camera movement and object motion in text-to-video generations. This means you can generate a video and tell the model to move the camera from left to right, zoom in or out and move objects around in the scene.
Video-LaVIT is a multi-modal video-language method that can comprehend and generate image and video content and supports long video generation.
InterScene is a novel framework that enables physically simulated characters to perform long-term interaction tasks in diverse, cluttered, and unseen scenes. Another step closer to completely dynamic game worlds and simulations. Checkout an impressive demo below.
AToM is a text-to-mesh framework that can generate high-quality textured 3D meshes from text prompts in less than a second. The method is optimized across multiple prompts and is able to create diverse objects for which it wasn’t trained on.
Last year we got real-time diffusion for images, this year we’ll get it for video! AnimateLCM can generate high-fidelity videos with minimal steps. The model also supports image-to-video as well as support for adapters like ControlNet. It’s not available yet, but once it hits, expect way more AI generated video content.