In this paper, we target at the Fine-grAined human-Centric Tracklet Segmentation (FACTS) problem, where 12 human parts, e.g., face, pants, left-leg, are segmented. To reduce the heavy and tedious labeling efforts, FACTS requires only one labeled frame per video during training. The small size of human parts and the labeling scarcity makes FACTS very challenging. Considering adjacent frames of videos are continuous and human usually do not change clothes in a short time, we explicitly consider the pixel-level and frame-level context in the proposed Temporal context Segmentation Network (TSN). On one hand, optical flow is on-line calculated to propagate the pixel-level segmentation results to neighboring frames. On the other hand, frame-level classification likelihood vectors are also propagated to nearby frames. By fully exploiting the pixel-level and framelevel context, TSN indirectly uses the large amount of unlabeled frames during training and produces smooth segmentation results during inference. Experimental results on four video datasets show the superiority of TSN over the state-of-the-arts. The newly annotated datasets can be downloaded via http://liusi-group.com/projects/FACTS for the further studies.


You can download the Indoor, Outdoor, iLIDS-Parsing and Daily Dataset from Baidu Drive(pwd: 63qi) or Google Drive


The architecture of the proposed network. The input are {It−l , It−s, It} and the output are the segmentation result of It. The triplet is sequentially fed into a feature extraction module and three parallel modules. The optical flow and frame parsing modules mine the pixel-level context and refine the parsing results pixel-wisely. The frame classification modules generate reliable likelihood vector by regularizing consequent frames to share similar global labels. The pixel-level confidence map and frame-level likelihood vector are fused to produce the final output.


“l” or “s” means using long or short range nearby frames for pixel-level context propagation, while “c” means the confidence is used in the fusion of warped optical flows. The techniques corresponding to “l”, “s”, and “c” have been discussed in our CVPR paper. “f” refers the method of using frame-level context, and “uf” means using the unsupervised fine-tuned optical flow.


Note: the area of neck is labeled as background in groudtruth and only one target human is segmented.


  • [1] Si Liu, Changhu Wang, Ruihe Qian, Han Yu, Renda Bao, Yao Sun, Surveillance Video Parsing with Single Frame Supervision, CVPR 2017
  • [2] Han Yu, Guanghui Ren, Ruihe Qian, Yao Sun, Changhu Wang, Hanqing Lu and Si Liu, RSVP: A Real-Time Surveillance Video Parsing System with Single Frame Supervision, ACM MM 2017