A video captured with a hand-held device (e.g., a cell-phone or a portable camcorder) often appears remarkably shaky and undi-rected. Digital videostabilization improves the video quality by re-moving unwanted camera motion. We implement a video stabilization method [Liu et al. 2013] which models cam-era motion with a bundle of (multiple) camera paths. The model is based on a mesh-based, spatially-variant motion representation and an adaptive, space-time path optimization. Also, the motion representation allows users to fundamentally handle parallax and rolling shutter effects while it does not require long feature trajectories or sparse 3D reconstruction. Furthermore, we speed up this method by parallelizing some parts with CUDA support.
A Single Global Path
Bundled Paths
$$E_d(\hat{V}) = \sum_p \left\lVert\hat{V}_p w_p - \hat{p}\right\rVert^2$$
The feature p can be represented by a 2D bilinear interpolation of the four vertices \(V_p\) of the enclosing grid cell \(p = V_p w_p\) , where \(w_p\) are interpolation weight. We expect that the corresponding feature \(\hat{p}\) can be represented by the same weights of the warped grid vertices \(\hat{V}_p\) . We decide to get p by using optical flow in opencv.
(opencv) cuda feature detect + brute force matching | (opencv) feature tracking |
61ms/frame | 17ms/frame |
$$E_s(\hat{V}) = \sum_\hat{v} \left\lVert\hat{v} - \hat{v}_1 -sR_{90}(\hat{v}_0 - \hat{v}_1)\right\rVert, R_{90} = \begin{bmatrix}0 & 1\\-1 & 0\end{bmatrix},$$
where \(s = \left\lVert v - v_1\right\rVert / \left\lVert v_0 - v_1\right\rVert\) is a known scalar computed form the initial mesh.
$$E(\hat{V}) = E_d(\hat{V}) + \alpha E_s(\hat{V}), $$
where \(\alpha\) is an important weight to control the amount of regularization and we set \(\alpha = 3\). The bigger \(\alpha\) we choose, the more rectangular the grid is. The above method is called ASAP(as similar as possible). Since the final energy is quadratic, the warped mesh \(\hat{V}\) can be solved by a sparse linear system solver. To solve the system, we have tried solve function in opencv, and the speed is very slow. Another method we have tried is cusolver, and the precision is bad. Finally, we choose Newton's method:$$V_1 = V_0 - \gamma\nabla E(V_0)$$ to iteratively upadte \(V\) until final \(\hat{V}\) .
(solve) LU decomposition | (solve) QR decomposition | Newton's method |
320ms | 50ms | 0.156ms |
$$\hat{V}_i = F_i(t) V_i,$$
where \(V_i\) and \(\hat{V}_i\) are the four vertices before and after the warping. Thus, local homography \(F_i(t)\) in the grid cell i of frame t can be estimated. To facilitate the warping estimation, we use global homography \(\bar{F}(t)\) , which is computed by findHomography in opencv, to bring matching features closer.
If we use \(C_i(t) = F_i(t)F_i(t-1) \cdot\cdot\cdot F_i(0)\)(used in the reference paper) to repesent camera path, each grid cell will be discontinuess. Treating \(V_c\) as feature points, we can require: $$V_c(t) = G_c(t) w_c,$$
where \(G_c\) is four vertices of new grid that covers \(V_c\) and \(w_c\) is the corresponding weight. By applying same \(w_c\) to previous frame, we can get \(V_c(t)\) position according to frame(0): $$V_c(t-1) = G_c(t-1) w_c$$
While smooting the camera path, we need to consider multiple factors: removing jitters, avoiding excessive cropping, and minimizing geometrical distortions. Given an original path \(C = \{C(t)\}\) , we seek an optimized path \(P = \{P(t)\}\) by minimizing the following function: $$O(\{P(t)\}) = \sum_t(\left\lVert P(t) - C(t)\right\rVert^2 + \lambda_t \sum_{r\in\Omega_t} w_{t,r} (C) \cdot \left\lVert P(t) - P(r) \right\rVert^2), $$
where \(\Omega_t\) are the neighborhood at frame t. The othor terms are:
Our solution is updated by a Jacobi-based iteration: $$P^{(\xi+1)} (t) = \frac{1}{\gamma} (C(t) + \mathop{\sum_{\gamma\in\Omega_t}}_{\gamma\neq t} 2\lambda_t w_{t,r} P^{(\xi)} (r)), $$
where \(\gamma = 1 + 2\lambda_t \sum_{\gamma\in\Omega_t, \gamma\neq t} w_{t,r}\)
$$w_{t,r} = G_t(\left\lVert r - t \right\rVert) \cdot G_m(\left\lVert C(r) - C(t) \right\rVert)$$
We use \(G_t()\) to suppress high frquency jitters, while using \(G_m()\) to preserves the camera motion in low-frequency. In our implementation, we set \(\Omega_t\) to 60 neighboring frames.
\(\lambda_t\) is used to balance data term and smooth term. The bigger it is, the higher corpping ration is. \(\lambda_t\) is initialized to 5. For any frame that does not satisfy the user requirements (cropping ratio or distortion is smaller than a pre-defined threshold), we decrease its parameter \(\lambda_t\) by a step \(1/10\lambda_t\) and re-run the optimization. We not only decrease \(\lambda_t\) at specific frame, but also take average in the neighbor of specific frame to smooth in temporality.
Our motion model generates a bundle of camera paths. If these paths are optimized independently, neighboring paths could be less consistent, which may generate distortion in the final rendered video. Hence, we do a space-time optimization of all paths by min-imizing the following objective function: $$\sum_i O(\{P_i(t)\}) + \sum_t \sum_{j \in N(i)} \left\lVert P_i(t) - P_j(t) \right\rVert^2, $$
where \(N(i)\) includes eight neighbors of the grid cell i. The Jacobi-based iteration will become: $$P^{(\xi+1)}_i (t) = \frac{1}{\gamma'} (C_i(t) + \mathop{\sum_{\gamma\in\Omega_t}}_{\gamma\neq t} 2\lambda_t w_{t,r} P^{(\xi)}_i (r) + \mathop{\sum_{j\in N(i)}}_{j\neq i} 2P^{(\xi)}_j (t))$$
where \( \gamma' = 2\lambda_t \sum_{r \in \Omega_t , r \neq t} w_{t,r} + 2N(i) - 1\) . We iterate 20 times to optimize camera paths.
First, we warp the original frame to C. Then, we calculate a smooth path P. Finally, we warp P to the final result. Following are two methods:
Using homography to warp each grid cell in original frame, then blend them into an image C. To warp image P to result frame, we apply the same method again. The bad thing is that there exists some seam in the result frame.
(opencv) blending | CUDA |
1320ms/frame | 16ms/frame |
original video
bundled path
our result
Under 640 x 360 resolution can achieve 25.54 fps
Under 1280 x 720 resolution is 7 times faster than paper
paper | ours | |
extracting features | 300ms | 33ms |
estimating motion | 50ms | 0.1ms |
rendering the final result | 30ms | 21ms |
total time | 400ms | 57ms |
fps | 2.5fps | 17.5fps |
陳學儀
陳卓晗
楊騏瑄