MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

🫱Click here to join the 16 specialized direction discussion group (šŸ”„Recommended)🫲
Abstract:

The authors introduce a Multi-Modality Driven Low-Rank Adaptation (MMD-LoRA) method that utilizes low-rank adaptation matrices to achieve efficient fine-tuning from the source domain to the target domain, addressing the Adverse Condition Depth Estimation (ACDE) problem. It consists of two core components: Prompt-based Domain Alignment (PDDA) and Visual-Text Consistency Contrastive Learning (VTCCL). Through extensive experiments, this method highlights its robustness and efficiency in adapting to various adverse environments.

Ā©ļøć€Deep Blue AI怑Compiled

Paper Title:Multi-Modality Driven LoRA for Adverse Condition Depth Estimation
Authors: Guanglei Yang, Rui Tian, Yongqiang Zhang, Zhun Zhong, Yongqiang Li, Wangmeng Zuo

Paper Link:https://arxiv.org/pdf/2412.20162

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

Autonomous driving systems aim to handle various real-world conditions. Among them, the main challenge is to solve numerous corner case scenarios where driving safety becomes crucial under adverse conditions (such as nighttime, fog, rain, and snow). These adverse conditions not only limit the vehicle’s ability to perceive the environment but also increase the risk of accidents, making robust depth estimation in such scenarios essential for safe autonomous driving.

One of the main difficulties in addressing this issue is the lack of high-quality real-world images from these adverse conditions, and collecting such data is a challenging task. Additionally, the high cost of annotating these real-world images makes it impractical to rely solely on traditional data collection and annotation methods. Thus, alternative methods are sought without the need for large-scale annotated data.

To this end, a technology for depth estimation under different adverse weather conditions (ACDE) has become a current research hotspot, estimating depth information under unseen weather conditions without relying on a large number of annotated samples. Although previous methods (such as md4all) have made significant progress, they primarily rely on generative models to convert images captured under clear weather into images representing adverse weather. However, these generative methods depend on sufficient target images to build well-trained models (e.g., ForkGAN). On the other hand, some methods utilize learnable parameters for feature enhancement to adapt to the target domain, resulting in increased model complexity and tuning efforts. Furthermore, unlike CLIP-based methods, depth estimation models lack sufficient alignment to match text and visual spaces, hindering coherent understanding under adverse conditions.

In this paper, the authors propose a new method: MMD-LoRA, which combines low-rank adaptation (LoRA) technology with contrastive learning to address the Adverse Condition Depth Estimation (ACDE) task. Specifically, by designing a Multi-Modality Driven Low-Rank Adaptation (MMD-LoRA), it aims to solve the domain gap from the source domain to the target domain as well as the misalignment between visual and textual representations. The core innovation of MMD-LoRA lies in two main components: Prompt-based Domain Alignment (PDDA) and Visual-Text Consistency Contrastive Learning (VTCCL).

The specific contributions are as follows:

  • The authors propose MMD-LoRA, a new ACDE method that effectively addresses domain gaps and multi-modal misalignment by combining low-rank adaptation (LoRA) technology with contrastive learning.

  • The authors also introduce Prompt-based Domain Alignment (PDDA), which employs a trainable low-rank adaptation matrix in the image encoder, guided by text embeddings. This component captures accurate target domain visual features without requiring additional target images. Meanwhile, Visual-Text Consistency Contrastive Learning (VTCCL) aims to achieve robust multi-modal alignment by separating embeddings of different weather conditions while bringing similar embeddings closer together, thus enhancing consistent representations.

  • Extensive experiments show that MMD-LoRA excels in depth estimation under adverse environmental conditions on two popular benchmarks (including the nuScenes dataset and the Oxford RobotCar dataset).

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Figure 1 | Comparison of depth estimation results based on LoRA and enhancement methods Ā©ļøć€Deep Blue AI怑Compiled

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā– 2.1. Depth Estimation Under Adverse Conditions

Adverse weather conditions can lead to measurement errors in LiDAR sensors, particularly due to reflections caused by water accumulation on roads during rainy days and the effects of non-textured areas under nighttime illumination.These factors hinder accurate depth estimation in pixel correspondences. So far, only a limited number of studies have explored depth estimation under adverse weather conditions.

Recently, progress in depth estimation under adverse weather conditions has been made through image enhancement-based methods and style transfer-based methods. These image enhancement-based methods focus solely on addressing issues related to inadequate lighting and reflections. However, these methods often fail to establish a unified framework to provide more robust and general solutions. To address this limitation, style transfer-based methods have been proposed to construct a unified framework that addresses various adverse weather conditions. For instance, md4all diversifies source domain images by utilizing generative models like ForkGAN to convert images captured under clear weather into images depicting adverse weather. Similarly, Fabio et al. utilized advanced text-to-image diffusion models to generate new user-defined scenes and their associated depth information.

ā– 2.2. Zero-Shot Depth Estimation

Zero-shot depth estimation is a significant challenge task that requires effectively generalizing a depth estimator trained on source domain images to unknown target domains during inference.For example, Zoedepth achieved impressive generalization performance by pre-training on multiple datasets, combining relative and metric depth, and using a lightweight decoder to fine-tune the model with metric depth information. Ranftl et al. proposed a robust training objective invariant to depth range and scaling variations by combining data from different sources to improve generalization performance. Recently, Depth Anything improved the model’s generalization ability by expanding the training set to approximately 62 million images. Despite these efforts enhancing zero-shot inference capabilities, there remains an urgent need for more high-quality synthetic real-world images.

ā– 2.3. Multi-Modal Alignment Strategy

Multi-modal alignment enhances the model’s scene perception ability and captures fine-grained representations of real-world scenes.For instance, Alec Radford et al. pioneered the use of natural language as a supervisory signal for image representations, achieving alignment between visual and textual encoders. Yu et al. developed an instance-language matching network that uses visual prompts learning and cross-attention within the CLIP backbone to facilitate the matching of instance and text embeddings. Zhou et al. introduced pseudo-labeling and self-training processes to achieve pixel-text alignment for semantic segmentation tasks in the absence of annotations.

Unlike these pre-aligned CLIP-based methods, depth estimation models lack sufficient alignment between multi-modal features, hindering coherent understanding under adverse conditions. The misalignment between text encoders and image encoders inevitably undermines LoRA’s generalization ability under adverse conditions, leading to suboptimal results.

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā– 3.1. Prompt-Driven Domain Alignment

In the pre-training step, the image encoder in the benchmark depth estimator based on MMD-LoRA captures accurate target domain visual representations under the supervision of alignment loss during the PDDA process. Meanwhile, VTCCL separates representations of different weather conditions and clusters similar representations together to further enhance MMD-LoRA’s generalization ability across various adverse conditions.
In the training step, the trained MMD-LoRA injects trainable low-rank decomposition matrices into the self-attention modules of the image encoder’s ā€˜q’, ā€˜k’, ā€˜v’, ā€˜proj’ layers, and further optimizes the depth estimator. Given images in the source domain, fine-grained weather conditions are perceived through randomly cropping multiple cropped images, where N is the number of crops. This paper combines source domain text descriptions and unseen target domain text descriptions, where and represent the indices of the target domain and the total number of target domains respectively. For example, and can be represented as ā€œimages taken during daytimeā€ and ā€œimages taken during nighttimeā€ or ā€œimages taken during rainy days.ā€ To diversify the text descriptions of unseen target domains during the PDDA process, semantic concepts of various adverse weather conditions are combined, for example, can be expressed as ā€œimages taken during rainy nights.ā€ Pre-trained CLIP based text encoder and pre-trained image encoder are introduced to extract visual representations and text embeddings. Then, using LoRA, which is efficient based on its inherent advantages, the unseen target domain visual representations are captured. Following the existing training paradigm for depth estimation, the depth estimator uses image captions to capture semantic information of complex road scenes.
Pre-trained image encoders in existing depth estimators typically capture visual representations only under clear weather conditions, which limits their generalization ability in unseen target domains. To overcome this limitation, the aim is to capture more robust visual representations and effectively adapt the benchmark depth estimator for depth estimation under adverse conditions with minimal changes. Inspired by the strong generalization ability shown in natural language processing with low-rank adaptation (LoRA), trainable low-rank decomposition matrices (referred to as MMD-LoRA) are introduced into specific layers of the image encoder. This integration significantly improves the model’s generalization ability across different adverse weather scenarios. Rather than updating the entire original image encoder, modifications are limited, injecting trainable low-rank matrices as defined in Equation 1:
where the weight matrix of the pre-trained image encoder is represented as. ΔW is constructed from the trainable low-rank decomposition matrices B and A to support the ability to transfer from clear weather conditions to adverse weather conditions, where, and the rank. The trainable low-rank decomposition matrices B and A are designed to capture various unseen target domain visual representations.
To train these two matrices (i.e., B and A), PDDA utilizes text embeddings to guide the estimation of semantically relevant visual representations in unseen target domains. Specifically, the text encoder and the image encoder capture source domain visual representations, source domain text embeddings, and target domain text embeddings during the pre-training step. The visual representations of unseen target domains are computed using Equation 2:
where represents the trainable low-rank decomposition matrices (i.e., MMD-LoRA) integrated into the ā€˜q’, ā€˜k’, ā€˜v’, and ā€˜proj’ self-attention layers of the image encoder. In the text and visual space, the source-target differences between language and images should be equal. Therefore, the text difference āˆ†L helps estimate the visual differences āˆ†V between the source and target domains. We design an alignment loss to supervise MMD-LoRA to further obtain accurate unseen target domain visual representations, as shown in Equation 3:
where represents the cosine similarity loss. represents L1 regularization to prevent the predicted unseen target domain representations from drifting too far from the source domain visual representations. āˆ†V represents the difference obtained through Equation 2 between and. āˆ†L represents the difference obtained through the text encoder between and. It is important to note that the text encoder and image encoder are fixed.

ā– 3.2. Visual-Text Consistency Contrastive Learning

Based on the aforementioned process, PDDA enables the image encoder with MMD-LoRA to capture accurate target domain visual features for depth estimation without needing access to target domain images.However, existing depth estimation models lack sufficient alignment between multi-modal features, which inevitably interferes with MMD-LoRA’s generalization ability, leading to suboptimal issues. To address this suboptimal problem, the authors introduce Visual-Text Consistency Contrastive Learning (VTCCL) to achieve robust multi-modal alignment, further enhancing MMD-LoRA’s generalization ability under adverse weather conditions.
Specifically, for each image crop in the image, the visual representations and text embeddings for each weather condition are computed using,,, and. Following the training paradigm of contrastive learning via Equation 4, different weather representations (visual and text) are iteratively pushed apart while similar representations are clustered together, further enhancing MMD-LoRA’s generalization ability under adverse conditions. Equation 4 is:
where,,,. and represent the indices of adverse weather, the number of image crops, and the total number of adverse weather conditions respectively. represents the contrastive learning weight coefficient, used to control the different weights for each weather condition in Equation 4. Ļ„ represents the temperature parameter. For brevity, we omit the summation symbols of visual and text representations for all image crops.
For example, the general text description for the source domain is ā€œimages taken during daytime.ā€ For the target domain, a set of text descriptions represents adverse conditions, such as ā€œimages taken during nighttimeā€ and ā€œimages taken during rainy days.ā€ In this case, M is set to 2 for simplicity. Then, these text descriptions are processed through a pre-trained CLIP text encoder to obtain text embeddings for the source and target domains and.
Equation 4 applies two contrastive learning terms: one focusing on clear weather conditions and the other handling adverse weather (e.g., nighttime, rainy days). For the contrastive learning term for clear weather conditions, the source domain visual representation is selected as the anchor. The source domain text embedding is treated as a positive sample, while the target domain text embedding is treated as a negative sample. Similarity is computed using the inner product in the projection space to measure the distance between visual representations and text embeddings. In the contrastive learning for adverse weather, a similar approach is adopted. For example, the nighttime domain visual representation is selected as the anchor, with the corresponding nighttime domain text embedding designated as the positive sample, while the source domain text embedding is designated as the negative sample. This process is iteratively repeated for all target domain conditions (including nighttime and rainy days).
Finally, the overall loss during the pre-training phase is computed using Equation 5.

ā– 3.3. Training Architecture

After the pre-training step, the trained MMD-LoRA is injected into the self-attention modules of the depth estimator’s image encoder in the ā€˜q’, ā€˜k’, ā€˜v’, ā€˜proj’ layers, and further optimizes the benchmark depth estimator (e.g., EVP) using real depth maps. Specifically, the training steps are illustrated in Figure 2, where the benchmark depth estimator utilizes image captions to capture semantic information of complex road scenes, with the image encoder containing trained MMD-LoRA capturing source domain visual representations and unseen target domain visual representations. Various visual representations and text embeddings of image captions provide clear guidance for the depth estimation task using cross-attention maps. Then, the denoising U-Net’s decoder features input into Inverse Multi-Attention Feature Refinement (IMFAR) to capture enhanced visual representations, while the depth decoder outputs the predicted depth maps.

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Figure 2 | Overview of MMD-LoRA framework process Ā©ļøć€Deep Blue AI怑Compiled

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā– 4.1. Datasets and Evaluation Metrics

The experimental datasets are: nuScenes and Oxford RobotCar. For the nuScenes dataset, it is a challenging large-scale dataset containing 1000 scenes with diverse weather conditions and LiDAR data. The authors adopt the partition recommended by md4all, which includes 15129 training images and 6019 validation images. It is noteworthy that the training images only include clear daytime images without adverse weather images, while the validation images are divided into clear, nighttime, and daytime rainy conditions. For the Oxford RobotCar dataset, it contains a mix of daytime and nighttime scenes. The authors similarly adopt the partition recommended by md4all, which includes 16563 training images (including clear weather and excluding nighttime weather) and 1411 validation images (including clear weather and nighttime weather).

ā– 4.2. Experimental Details

MMD-LoRA is trained through PDDA and VTCCL during the pre-training step. The benchmark depth estimator (i.e., EVP) is trained using frozen MMD-LoRA during the training step. In the pre-training step, images are randomly cropped into 15 batches of 400 Ɨ 400, with a base learning rate set to 0.001, using the AdamW optimizer to train MMD-LoRA for 4000 iterations with a batch size of 4. The weight types are then set to Wq, Wk, Wv, Wproj, and the rank r=8 is set in MMD-LoRA to adapt to the unseen target domain (i.e., adverse weather). In VTCCL, the weight coefficients are set as Ī»0: Ī»1: Ī»2 = 1:0.1:1 for the nuScenes dataset, while Ī»0: Ī»1 = 1:0.05 for the Oxford RobotCar dataset, as the contrastive learning weight coefficients under different weather conditions. During the training step, the EVP depth estimator is chosen as the benchmark, while the other training settings in the MMD-LoRA method adopt the settings in EVP.

ā– 4.3. Experimental Comparisons

The comparison results on the nuScenes dataset are shown in Table 1. The MMD-LoRA method outperforms all previous methods, achieving the best results on most metrics. Compared to the recent state-of-the-art method md4all-DD, the MMD-LoRA method shows significant advantages. For instance, under clear weather conditions, MMD-LoRA surpassed previous state-of-the-art methods by 8.43% (from 88.03% to 96.46%), improved by 4.63% at nighttime (from 75.33% to 79.96%), and increased by 12.55% under rainy conditions (from 82.82% to 95.37%). These comparison results demonstrate the effectiveness of MMD-LoRA. Unlike the works of md4all and Fabio et al., which rely on generative models to convert clear weather conditions into adverse weather and further capture target domain visual representations, MMD-LoRA becomes more direct and robust in capturing target domain visual features. As shown in Figure 3, the MMD-LoRA method is capable of identifying key elements of scenes under nighttime and clear weather conditions, with a higher degree of distinction between adjacent objects or the same objects. For example, MMD-LoRA correctly captures the entire obstacle, clearly distinguishing between two columns and recovering the ā€œholeā€ in the car.

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Table 1 | Comparison of experimental results on nuScenesĀ©ļøć€Deep Blue AI怑Compiled

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Figure 3 | Qualitative experimental results comparison (nuScenes)Ā©ļøć€Deep Blue AI怑Compiled

To demonstrate the robustness of MMD-LoRA, the performance of MMD-LoRA was evaluated on the RobotCar dataset. The results are shown in Table 2, where MMD-LoRA also surpasses all previous methods. Compared to previous state-of-the-art methods, the MMD-LoRA method successfully improved d1 during the day from 87.17% to 92.56%, and during the night from 83.68% to 89.33%. As shown in Figure 4, MMD-LoRA clearly estimates the depth of standing pillars and heads in nighttime and clear weather images in the first and second rows. The comparisons on the RobotCar dataset further validate the effectiveness of MMD-LoRA in depth estimation under adverse conditions. Tables 1 and 2 also show that MMD-LoRA can obtain visual representations in real scenarios during rainy days and nighttime, which are robust to any unseen target domains.

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Table 2 | Comparison of experimental results on RobotCarsĀ©ļøć€Deep Blue AI怑Compiled

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Figure 4 | Qualitative experimental results comparison (Robotcar)Ā©ļøć€Deep Blue AI怑Compiled

ā– 4.4. Ablation Experiments

As shown in Table 3, although EVP performs well under adverse conditions, the performance of MMD-LoRA with PDDA still surpasses EVP by 0.39% and 5.97% in d1 during daytime and nighttime, respectively. The reason for this improvement lies in the fact that MMD-LoRA with PDDA can indeed capture the visual features of the target domain without additional target domain images. As shown in the second and third columns of Figure 5, EVP tends to produce depth artifacts when estimating the depth of object boundaries. Compared to EVP, MMD-LoRA with PDDA achieves clear boundaries and eliminates background noise. Additionally, by introducing VTCCL into MMD-LoRA, MMD-LoRA achieves 96.46% in d1, and the error remains comparable or significantly improved compared to using PDDA alone. This is because VTCCL achieves robust multi-modal alignment, enhancing consistent representations.

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Table 3 | Ablation experimental results comparison (nuScenes)Ā©ļøć€Deep Blue AI怑Compiled

As shown in the third and fourth columns of Figure 5, VTCCL further achieves clearer contours and pixel-level accurate depth estimation, with clear object boundaries (e.g., tree boundaries, complete trucks), and produces depth artifacts as indicated by the red boxes.

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Figure 5 | Ablation experimental results of MMD-LoRA with PDDA and VTCCLĀ©ļøć€Deep Blue AI怑Compiled

As shown in Table 4, the performance of learning-based enhancement methods and learning-based model methods (i.e., MMD-LoRA) in depth estimation under adverse conditions. While learning-based enhancement methods and learning-based model methods brought improvements of 1.60% and 5.47% at nighttime, and 0.22% and 0.85% under daytime rainy conditions, both significantly exceeded the baseline. Compared to learnable enhancement methods, learning-based model methods achieved superior depth estimation performance with almost no increase in the number of parameters (only an increase of 0.035M). The authors conducted ablation experiments to find the suitable rank r for the specific layers of self-attention in the image encoder. Table 5 shows the comparison of depth estimation performance using different ranks r. When the rank r is set to 8, MMD-LoRA achieves excellent d1 and comparable error metrics, performing well as r gradually increases. Notably, depth estimation performance does not always improve by increasing the rank r. This can be explained by the fact that setting a larger threshold r > 8 may lead to overfitting of the predicted unseen target domain representations to the corresponding text descriptions, deviating from the features in the real world.

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Table 4 | Comparison of depth estimation results between enhancement and LoRAĀ©ļøć€Deep Blue AI怑Compiled

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

ā–²Table 5 | Comparison of depth estimation results under different r valuesĀ©ļøć€Deep Blue AI怑Compiled

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

In conclusion, existing adverse condition depth estimation methods often require additional target images and depth estimators, but they fail to achieve sufficient alignment between multi-modal features. The authors propose the MMD-LoRA method, which addresses these limitations through parameter-efficient fine-tuning techniques and contrastive learning paradigms. Specifically, Prompt-based Domain Alignment (PDDA) effectively captures visual features from unseen target domains using a trainable low-rank adaptation matrix guided by text embeddings. Subsequently, Visual-Text Consistency Contrastive Learning (VTCCL) separates embeddings of different weather conditions while pulling similar embeddings closer together, ensuring robust multi-modal alignment and enhancing MMD-LoRA’s generalization ability in diverse adverse scenarios. Comprehensive quantitative and qualitative experiments validate the effectiveness of MMD-LoRA. Future work will explore extending this method to video-based applications.

However, the MMD-LoRA method relies on predefined text descriptions for the source and target domains, which is assumed to be known in most applications. Additionally, the experiments assume consistent brightness during clear weather, even under adverse weather conditions (e.g., rainfall during rainy days or visibility at nighttime). This assumption may not hold under slight variations in weather conditions.

Ref:

Multi-Modality Driven LoRA for Adverse Condition Depth Estimation

Compiled by

Translated by|Babata

Reviewed by|apr

怐Deep Blue AI怑 will continue to share cutting-edge dynamics in the field of artificial intelligencešŸ‘‡
ļ½žć€Deep Blue Academy怑 Technical Discussion Groupļ½ž
šŸ’ŖDedicated to helping friends “break down invisible walls“, building a more free, deeper, and valuable communication ecological community! Those working in major companies can refer each other, and those studying for master’s or doctoral degrees can exchange ideas and collaborate.

Ā· Planned cycle: Deep Blue Academy will establish a “community of like-minded individuals” for engineers & academic researchers every 3 months

Ā· Coverage direction: Robotics (bipedal, quadrupedal, robotic arms), embodied intelligence, end-to-end, autonomous driving, large models, drones, visual CV, medical AI……16 popular fields

Ā· Sharing content: Regular weekly sharing in the group including but not limited to industry dynamics, interview questions, cutting-edge academic papers, recruitment opportunities……

Scan to add Alang, and choose the discussion group you want to join

(Invitations will be sent in the order of submission, please choose as soon as possible)

šŸ‘‡

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation

Discussion group seešŸ‘‹šŸ»

MMD-LoRA: Integrating LoRA and Contrastive Learning for Depth Estimation
Previous Reviews

Revealing Netflix’s latest technology: Seeing sound, generating images from audio!

Domestic robots are too competitive! The world’s first explosion-proof robot, the fastest “machine beast” is born……

Leave a Comment