Tags: Unsupervised Learning, Monocular Video, KITTI

Unsupervised learning framework for task of monocular depth and camera motion estimation from unstructured video sequences.

Backgrounds

create novel views of a specific subject from images fromdifferent view points
explicitly reconstruct the accurate 3D model.
forced to learn intermediate precictions of geometry and/or correspondences.
View Synthesis itself is mostly a graphics problem, and able to work in an end-to-end learning-based framework. However, in this way, geometry correspondences lose.
Related papers:
- Image-based rendering using image-based priors 2005 Fitzgibbon
- DeepStereo(learning-based)

Pre-text tasks to learn visual features from video data that can later be re-purposed for other vison tasks.
Those tasks for example: exploiting geometric constraints in the auto-encoder framework..

Zhou et al. 2017 CVPR
This is the original idea to bring view-synthesis into unsupervised learning framework. To be unsupervised, they "warp"(after getting the predictions for tranformation between different frames) the source image back to the target one and compare to the original target frame.
Other loss terms include:
- multi-scale and smoothness of the depth map : avoid low-texture effects or far-away from estimation
- explainability mask

Add-ons:pre-compute instance segmentation masks for moving objects(viechles). Static background only fit to previouse ego-motion model, while objects first fit to E and then fit to object motion model(same structure but do it individually).

Add-ons: Normal constrains for depth. If the same surface, consider it to be the same normal. One constrain is to force the orthogonality correlation of depth and normal. They proposed the way to infer surface normal from depth map directly by do cross-product of the vectors forming by 8 neighbors. Then calculate depth map from normal map(standard method) for lacking of ground-truth normal map.
Edge-awareness: image gradient + edge net
fly-out mask: afterwarping, those pixels outof the boundaries should not be in training process.

CVPR 2007
Really important algorithm to ouput a dense set of rectangular patches covering the surfaces visible in the input images.
It was proposed to work on single model with calibrated multi-view stereopsis.