This work investigates the use of latent diffusion models for real-time visual servoing of unmanned aerial vehicles (UAVs) operating in unstructured environments. Traditional visual servoing approaches rely on continuous target visibility and handcrafted control pipelines, which limit adaptability and robustness. In contrast, diffusion models can compactly encode target and environmental information within a latent space, enabling more efficient and flexible control strategies.
However, their inherent computational complexity poses a challenge for real-time deployment. To overcome this limitation, this research explores multiple diffusion-based strategies to reduce inference time and improve control responsiveness.
The ultimate objective is to develop a computationally efficient, end-to-end UAV control framework capable of generating low-level commands directly from image inputs, advancing the autonomy and robustness of UAV visual servoing in dynamic, unstructured settings.