Document Type


Publication Date



Video frame interpolation typically involves two steps: motion estimation and pixel synthesis. Such a two-step ap- proach heavily depends on the quality of motion estima- tion. This paper presents a robust video frame interpo- lation method that combines these two steps into a single process. Specifically, our method considers pixel synthe- sis for the interpolated frame as local convolution over two input frames. The convolution kernel captures both the lo- cal motion between the input frames and the coefficients for pixel synthesis. Our method employs a deep fully convolu- tional neural network to estimate a spatially-adaptive con- volution kernel for each pixel. This deep neural network can be directly trained end to end using widely available video data without any difficult-to-obtain ground-truth data like optical flow. Our experiments show that the formula- tion of video interpolation as a single convolution process allows our method to gracefully handle challenges like oc- clusion, blur, and abrupt brightness change and enables high-quality video frame interpolation.


Originally published in This is the author manuscript of a paper submitted for the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

Persistent Identifier