IC-LoRA LTX Video Trainer: In-Context LoRA for Video-to-Video Control

IC-LoRA — In-Context LoRA — is a LoRA type for LTX Video 2.3 that gives you fine-grained video-to-video control at inference time. Instead of only conditioning the model through text, IC-LoRA lets you supply a reference video as a conditioning input, anchoring the output to the reference's structure, depth, pose, style, or motion. The result is significantly more controlled video generation than text-only prompting allows — useful for depth adaptation, pose transfer, style preservation, and scene consistency across multiple clips. This guide covers what IC-LoRA is, how it differs from standard LTX Video LoRA training, who needs it, and the current state of no-code IC-LoRA trainers.

What Is IC-LoRA?

IC-LoRA stands for In-Context LoRA. It was developed by Lightricks as part of the LTX Video 2 (LTX-2) ecosystem and is included in the official ltx-trainer package. Conceptually, IC-LoRA is a LoRA adapter trained to accept a reference video as a conditioning input at inference time. When generating video with an IC-LoRA model, you provide both a text prompt and a reference video clip. The IC-LoRA adapter conditions the output to follow the reference video's spatial structure — depth layout, edge geometry, human pose, or camera motion — while generating new content described by the text prompt.

Lightricks ships several pre-trained IC-LoRA checkpoints: IC-LoRA-Detailer (for texture and surface enhancement on reference video), Union-Control (combining multiple control signals), and others targeting specific conditioning types like depth, canny edge, and pose. Community and developer-trained IC-LoRA adapters can target custom control signals relevant to specific production workflows.

How IC-LoRA Differs From Standard LoRA Training

Standard LTX Video LoRA training — as used for character LoRA, style LoRA, motion LoRA, and world LoRA — embeds learned behavior into the model weights during training. At inference time, you activate the LoRA by including its trigger word in the text prompt. The model generates video following your prompt, modified by the learned behavior encoded in the LoRA weights. This works well for consistent character motion, trained motion styles, and world-building contexts, but it cannot incorporate structure or spatial information from a reference video at generation time.

IC-LoRA adds an inference-time video conditioning pathway. The trained IC-LoRA adapter learns how to extract structural, spatial, or stylistic features from a reference video and inject them into the generation process — not by memorizing specific clips during training, but by learning a general video conditioning mechanism. A trained IC-LoRA can accept any reference video at inference time and use its spatial information to guide the output structure. You do not need to train a new IC-LoRA for every new reference video once the conditioning type is established.

The practical difference: standard LoRA encodes a specific learned behavior (this character, this motion style, this visual world). IC-LoRA learns a conditioning pathway (take any reference video and transfer its depth structure, or its edge geometry, or its pose information into a new generation). The two approaches can be combined — a character LoRA and an IC-LoRA-Depth adapter applied together let you generate video of a specific character following the spatial layout of a reference clip.

Use Cases for IC-LoRA

Depth-guided video generation: Use a reference video's depth map to control the spatial layout and camera geometry of a new generation. Useful for cinematic re-composition, parallax effects, and consistent spatial framing across multiple generated clips without manual scene reconstruction.

Pose transfer: Apply a reference video's human pose sequence to generate new video of a different person or character performing the same motion. Enables motion transfer from reference footage to generated characters without traditional motion capture or retargeting workflows.

Edge and structure conditioning: Use canny edge detection from a reference video to constrain the structural layout of the generated output. Allows generating stylistically different video that preserves the spatial composition of a reference — useful for archviz walk-through generation and stylized re-rendering of reference footage.

Video restoration and enhancement: Train an IC-LoRA on paired degraded-to-clean video examples. At inference, supply degraded input video as the reference conditioning and generate an enhanced, restored version following the text prompt's quality description.

Scene consistency across shots: Use a reference clip's spatial and lighting structure as IC-LoRA conditioning to generate multiple new shots that share consistent scene geometry and depth layout — useful for episodic or serialized generated content requiring visual continuity.

Training an IC-LoRA: Technical Requirements

IC-LoRA training requires a dataset of paired videos — each training example consists of a conditioning reference video and a corresponding target video. The model learns the relationship between the reference video's extracted features (depth maps, edge maps, pose keypoints, or raw video frames) and the target video generation. IC-LoRA training typically requires around 15 or more high-quality video pairs, more than standard LoRA training, because the conditioning pathway needs to generalize across varied reference inputs.

Quality matters substantially more than quantity. The paired examples must have consistent spatial correspondence between the conditioning signal and the target video — a depth map that accurately represents the target clip's scene geometry, for example. Misaligned pairs produce a weakly conditioned IC-LoRA that loses spatial fidelity at inference time.

Training configuration for IC-LoRA differs from standard LoRA. The learning rate is typically lower and training steps higher to teach the conditioning pathway without collapsing into direct video copying. Rank settings in the 32 to 64 range are common starting points. Auto-captioning the target videos remains important so text-conditioning stays functional alongside IC conditioning at inference time.

The LTX-2 trainer supports IC-LoRA training via the standard training pipeline with IC-specific configuration. Training runs take 60 to 120 minutes depending on dataset size and configuration, producing a standard .safetensors file compatible with LTX Video 2.3 inference endpoints.

Current State of No-Code IC-LoRA Tools

As of April 2026, WaveSpeedAI has a functional IC-LoRA trainer at wavespeed.ai that accepts a ZIP file of paired training videos. It is the most accessible option currently available — upload your dataset ZIP, configure basic settings, and receive a .safetensors file. However, the workflow still requires you to understand what paired video datasets are, how to prepare conditioning signal videos (depth maps, edge maps, or raw video pairs), and how to structure a valid training ZIP. There is no guided wizard explaining what each step means for non-technical users.

The official LTX-2 trainer on GitHub supports IC-LoRA training for developers comfortable with Python environments, configuration files, and GPU resource management. This is a full-control option but requires significant technical setup.

Grix LoRA Trainer is building IC-LoRA training support into its no-code wizard interface. The training wizard at grixai.com/lora/train uses a four-step Recipe → Dataset → Config → Launch flow. IC-LoRA will appear as a recipe type with plain-language explanations of what conditioning type to choose, how to prepare paired videos, and what the training output will do at inference time. No Python, no ZIP file preparation, no configuration file editing required.

IC-LoRA vs. Standard Image-to-Video: Key Distinction

IC-LoRA is often confused with standard image-to-video generation, where a reference image or video is used as the starting frame for the output. These are different techniques with different outputs and different use cases.

Standard image-to-video (LTX Video's default start-frame conditioning): The reference input becomes the literal first frame or starting visual state. The video animates forward from that starting point. The reference appears unchanged at frame 0 and the scene evolves from it following the text prompt.

IC-LoRA conditioning: A reference video provides structural information — depth, edges, pose, or other spatial signals — that influences the spatial layout of the generated output. The reference video does not appear in the output. Instead, its spatial structure is extracted and used to guide the geometry and composition of the newly generated video. The generated content follows the text prompt for its visual appearance while following the reference for its spatial structure.

For controlling spatial layout, camera geometry, human pose, or depth structure in generated video without the reference appearing in the output, IC-LoRA is the correct approach. For animating forward from a specific starting visual state, standard image-to-video is correct.

Getting Started

To train an IC-LoRA for LTX Video 2.3 today, the most accessible option is WaveSpeedAI's IC-LoRA trainer — upload a ZIP of your paired conditioning and target videos, configure training settings, and receive a .safetensors file. This requires understanding how to prepare paired video datasets, which control signal type (depth, canny, pose, or raw video) matches your use case, and how to structure a valid training ZIP. For developers, the official LTX-2 trainer on GitHub provides full control via Python.

When IC-LoRA training is available through Grix LoRA Trainer, the workflow will be: select IC-LoRA as the recipe type, upload your paired video dataset through the guided dataset step with plain-language instructions, review pre-configured training settings with AI sidekick explanations, and launch. The resulting .safetensors LoRA file is immediately testable in Grix Studio by providing a reference conditioning video alongside your generation prompt.

Frequently Asked Questions

What does IC-LoRA stand for? IC-LoRA stands for In-Context LoRA. It is a LoRA training approach developed by Lightricks for the LTX Video 2 ecosystem. "In-Context" refers to the model learning to condition generation on a reference video context at inference time.

How is IC-LoRA different from standard LTX Video LoRA? Standard LoRA embeds a specific learned style, character, or motion into weights during training and activates via a trigger word at inference. IC-LoRA trains a video conditioning pathway — at inference, you provide a reference video, and the model extracts its spatial structure (depth, edges, pose) to guide the generated output's geometry while the text prompt governs visual content.

Can I train an IC-LoRA without coding? Yes, with some effort. WaveSpeedAI offers a no-code IC-LoRA trainer at wavespeed.ai that accepts a ZIP of paired training videos. It still requires understanding how to prepare paired video datasets. A fully guided wizard with plain-language explanations is in development at Grix LoRA Trainer.

How many video pairs do I need to train an IC-LoRA? Around 15 or more high-quality paired videos is the typical starting point, more than standard LoRA training. Quality and accurate spatial correspondence between conditioning and target videos matters more than dataset size.

What is the difference between IC-LoRA and image-to-video generation? Image-to-video uses a reference image as the literal first frame and animates forward from it. IC-LoRA takes a reference video and extracts its spatial structure (depth, pose, edges) to guide the layout of a newly generated video without the reference appearing in the output. IC-LoRA is for structural control; image-to-video is for animating from a starting visual state.