Over the past decade, several approaches have been proposed to learn disentangled representations for video prediction. However, reported experiments are mostly based on standard benchmark datasets such as Moving MNIST and Bouncing Balls. In this work, we address the problem of learning disentangled representation for video prediction in an industrial environment. To this end, we use decompositional disentangled variational auto-encoder, a deep generative model that aims to decompose and recognize overlapped boxes on a pallet. Specifically, this approach disentangles each frame into a dynamic component (box appearance) and a temporally variant component (box location). We evaluate this approach on a new dataset, which contains 40000 video sequences. The experimental results demonstrate the ability to learn both the decomposition of the bounding boxes and their reconstruction without explicit supervision.
|