[Achanta et al., 2012] Achanta, R., Shaji, A., Smith, K., Lucchi, A., Fua, P., and Su¨sstrunk, S. (2012). SLIC superpixels compared to state-of-the-art superpixel meth- ods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274– 2282.
[Arbelaez et al., 2010] Arbelaez, P., Maire, M., Fowlkes, C., and Malik, J. (2010). Con- tour detection and hierarchical image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(5):898–916.
[Ba et al., 2016] Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.
[Badrinarayanan et al., 2017] Badrinarayanan, V., Kendall, A., and Cipolla, R. (2017). Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2481–2495.
[Boykov et al., 2001] Boykov, Y., Veksler, O., and Zabih, R. (2001). Fast approximate energy minimization via graph cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(11):1222–1239.
[Boykov and Jolly, 2001] Boykov, Y. Y. and Jolly, M.-P. (2001). Interactive graph cuts for optimal boundary & region segmentation of objects in nd images. In International Conference on Computer Vision, volume 1, pages 105–112. IEEE.
[Bradski, 2000] Bradski, G. (2000). The OpenCV Library. Dr. Dobb’s Journal of Software Tools.
[Bridle et al., 1992] Bridle, J. S., Heading, A. J., and MacKay, D. J. (1992). Unsuper- vised classifiers, mutual information and’phantom targets. In Advances in Neural Information Processing Systems, pages 1096–1101.
[Bruna et al., 2013] Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. (2013). Spec- tral networks and locally connected networks on graphs. Technical report.
[Chen et al., 2014] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2014). Semantic image segmentation with deep convolutional nets and fully connected crfs. Technical report.
[Chen et al., 2017a] Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. (2017a). Deeplab: Semantic image segmentation with deep convo- lutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848.
[Chen et al., 2017b] Chen, L.-C., Papandreou, G., Schroff, F., and Adam, H. (2017b). Rethinking atrous convolution for semantic image segmentation. Technical report.
[Chen et al., 2018] Chen, L.-C., Zhu, Y., Papandreou, G., Schroff, F., and Adam, H. (2018). Encoder-decoder with atrous separable convolution for semantic image seg- mentation. In European Conference on Computer Vision, pages 801–818.
[Ciresan et al., 2012] Ciresan, D., Giusti, A., Gambardella, L., and Schmidhuber, J. (2012). Deep neural networks segment neuronal membranes in electron microscopy images. Advances in Neural Information Processing Systems, 25:2843–2851.
[Cordts et al., 2016] Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., and Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 3213–3223.
[Defferrard et al., 2016] Defferrard, M., Bresson, X., and Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852.
[Dosovitskiy et al., 2015] Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., and Brox, T. (2015). Flownet: Learning optical flow with convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2758–2766.
[Farabet et al., 2012] Farabet, C., Couprie, C., Najman, L., and LeCun, Y. (2012). Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915–1929.
[Felzenszwalb and Huttenlocher, 2004] Felzenszwalb, P. F. and Huttenlocher, D. P. (2004). Efficient graph-based image segmentation. International Journal of Com- puter Vision, 59(2):167–181.
[Fu et al., 2018] Fu, H., Gong, M., Wang, C., Batmanghelich, K., and Tao, D. (2018). Deep ordinal regression network for monocular depth estimation. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2002–2011.
[Gadde et al., 2016] Gadde, R., Jampani, V., Kiefel, M., Kappler, D., and Gehler, P. V. (2016). Superpixel convolutional networks using bilateral inceptions. In European Conference on Computer Vision, pages 597–613. Springer.
[Godard et al., 2017] Godard, C., Mac Aodha, O., and Brostow, G. J. (2017). Un- supervised monocular depth estimation with left-right consistency. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 270–279.
[Godard et al., 2019] Godard, C., Mac Aodha, O., Firman, M., and Brostow, G. J. (2019). Digging into self-supervised monocular depth estimation. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 3828–3838.
[Goodfellow et al., 2014] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde- Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems, 27:2672–2680.
[Gould et al., 2009] Gould, S., Fulton, R., and Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In International Conference on Computer Vision, pages 1–8. IEEE.
[Grady, 2006] Grady, L. (2006). Random walks for image segmentation. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 28(11):1768–1783.
[Gupta et al., 2013] Gupta, S., Arbelaez, P., and Malik, J. (2013). Perceptual organiza- tion and recognition of indoor scenes from RGB-D images. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 564–571.
[Hammond et al., 2011] Hammond, D. K., Vandergheynst, P., and Gribonval, R. (2011). Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150.
[He et al., 2017] He, K., Gkioxari, G., Doll´ar, P., and Girshick, R. (2017). Mask r-cnn. In International Conference on Computer Vision, pages 2961–2969.
[He et al., 2012] He, K., Sun, J., and Tang, X. (2012). Guided image filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(6):1397–1409.
[He et al., 2016a] He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778.
[He et al., 2016b] He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Identity mappings in deep residual networks. In European Conference on Computer Vision, pages 630– 645. Springer.
[He et al., 2004] He, X., Zemel, R. S., and Carreira-Perpin˜´an, M. A´. (2004). Multiscale conditional random fields for image labeling. In The IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages II–II. IEEE.
[He et al., 2006] He, X., Zemel, R. S., and Ray, D. (2006). Learning and incorporating top-down cues in image segmentation. In European Conference on Computer Vision, pages 338–351. Springer.
[Hoogeboom et al., 2019] Hoogeboom, E., Berg, R. v. d., and Welling, M. (2019). Emerging convolutions for generative normalizing flows. arXiv preprint arXiv:1901.11137.
[Iizuka et al., 2017] Iizuka, S., Simo-Serra, E., and Ishikawa, H. (2017). Globally and locally consistent image completion. ACM Transactions on Graphics, 36(4):1–14.
[Ilg et al., 2017] Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., and Brox, T. (2017). Flownet 2.0: Evolution of optical flow estimation with deep networks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2462– 2470.
[Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accel- erating deep network training by reducing internal covariate shift. Technical report.
[Isola et al., 2017] Isola, P., Zhu, J.-Y., Zhou, T., and Efros, A. A. (2017). Image-to- image translation with conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 1125–1134.
[Jampani et al., 2018] Jampani, V., Sun, D., Liu, M.-Y., Yang, M.-H., and Kautz, J. (2018). Superpixel sampling networks. In European Conference on Computer Vision, pages 352–368. Springer.
[Johnson et al., 2018] Johnson, J., Gupta, A., and Fei-Fei, L. (2018). Image genera- tion from scene graphs. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 1219–1228.
[Kanezaki, 2018] Kanezaki, A. (2018). Unsupervised image segmentation by backprop- agation. In International Conference on Acoustics, Speech and Signal Processing, pages 1543–1547. IEEE.
[Kendall et al., 2015] Kendall, A., Badrinarayanan, V., and Cipolla, R. (2015). Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder archi- tectures for scene understanding. arXiv preprint arXiv:1511.02680.
[Kingma and Ba, 2014] Kingma, D. P. and Ba, J. (2014). Adam: A method for stochas- tic optimization. International Conference on Learning Representations.
[Kipf and Welling, 2016] Kipf, T. N. and Welling, M. (2016). Semi-supervised classifi- cation with graph convolutional networks. Technical report.
[Kirillov et al., 2019a] Kirillov, A., Girshick, R., He, K., and Doll´ar, P. (2019a). Panop- tic feature pyramid networks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 6399–6408.
[Kirillov et al., 2019b] Kirillov, A., He, K., Girshick, R., Rother, C., and Doll´ar, P. (2019b). Panoptic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 9404–9413.
[Knyazev et al., 2019] Knyazev, B., Lin, X., Amer, M. R., and Taylor, G. W. (2019).
Image classification with hierarchical multigraph networks. Technical report.
[Kohli et al., 2009] Kohli, P., Torr, P. H., et al. (2009). Robust higher order potentials for enforcing label consistency. International Journal of Computer Vision, 82(3):302– 324.
[Kr¨ahenbu¨hl and Koltun, 2011] Kr¨ahenbu¨hl, P. and Koltun, V. (2011). Efficient in- ference in fully connected crfs with gaussian edge potentials. Advances in Neural Information Processing Systems, 24:109–117.
[Krizhevsky et al., 2012] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Ima- genet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105.
[Krizhevsky et al., 2017] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2017). Im- agenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90.
[Kwak et al., 2017] Kwak, S., Hong, S., and Han, B. (2017). Weakly supervised seman- tic segmentation using superpixel pooling network. In Thirty-First AAAI Conference on Artificial Intelligence.
[Leung and Malik, 2001] Leung, T. and Malik, J. (2001). Representing and recognizing the visual appearance of materials using three-dimensional textons. International Journal of Computer Vision, 43(1):29–44.
[Levin et al., 2007] Levin, A., Lischinski, D., and Weiss, Y. (2007). A closed-form solution to natural image matting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):228–242.
[Li and Yu, 2015] Li, G. and Yu, Y. (2015). Visual saliency based on multiscale deep features. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 5455–5463.
[Li et al., 2020] Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., and Tong, Y. (2020). Semantic flow for fast and accurate scene parsing. Technical report.
[Lin et al., 2017] Lin, T.-Y., Doll´ar, P., Girshick, R., He, K., Hariharan, B., and Be- longie, S. (2017). Feature pyramid networks for object detection. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2117–2125.
[Liu et al., 2011] Liu, M.-Y., Tuzel, O., Ramalingam, S., and Chellappa, R. (2011). Entropy rate superpixel segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2097–2104. IEEE.
[Liu et al., 2016] Liu, Y.-J., Yu, C.-C., Yu, M.-J., and He, Y. (2016). Manifold slic: A fast method to compute content-sensitive superpixels. In Tthe IEEE Conference on Computer Vision and Pattern Recognition, pages 651–659.
[Long et al., 2015] Long, J., Shelhamer, E., and Darrell, T. (2015). Fully convolutional networks for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440.
[Matsuo and Aoki, 2015] Matsuo, K. and Aoki, Y. (2015). Depth image enhancement using local tangent plane approximations. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 3574–3583.
[McCallum and Sutton, 2005] McCallum, A. and Sutton, C. (2005). Piecewise training of undirected models. In Conference on Uncertainty in Artificial Intelligence.
[Mester et al., 2011] Mester, R., Conrad, C., and Guevara, A. (2011). Multichannel segmentation using contour relaxation: fast super-pixels and temporal propagation. In Scandinavian Conference on Image Analysis, pages 250–261. Springer.
[Mnih and Hinton, 2010] Mnih, V. and Hinton, G. E. (2010). Learning to detect roads in high-resolution aerial images. In European Conference on Computer Vision, pages 210–223. Springer.
[Mnih and Hinton, 2012] Mnih, V. and Hinton, G. E. (2012). Learning to label aerial images from noisy data. In The International Conference on Machine Learning, pages 567–574.
[Monti et al., 2017] Monti, F., Boscaini, D., Masci, J., Rodola, E., Svoboda, J., and Bronstein, M. M. (2017). Geometric deep learning on graphs and manifolds using mixture model cnns. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 5115–5124.
[Morgan and Bourlard, 1990] Morgan, N. and Bourlard, H. (1990). Generalization and parameter estimation in feedforward nets: Some experiments. In Advances in Neural Information Processing Systems, pages 630–637.
[Nah et al., 2017] Nah, S., Hyun Kim, T., and Mu Lee, K. (2017). Deep multi-scale convolutional neural network for dynamic scene deblurring. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 3883–3891.
[Noh et al., 2015] Noh, H., Hong, S., and Han, B. (2015). Learning deconvolution network for semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 1520–1528.
[Park et al., 2019] Park, T., Liu, M.-Y., Wang, T.-C., and Zhu, J.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2337–2346.
[Paszke et al., 2019] Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035.
[Perazzi et al., 2016] Perazzi, F., Pont-Tuset, J., McWilliams, B., Van Gool, L., Gross, M., and Sorkine-Hornung, A. (2016). A benchmark dataset and evaluation method- ology for video object segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 724–732.
[Pinheiro and Collobert, 2014] Pinheiro, P. and Collobert, R. (2014). Recurrent convo- lutional neural networks for scene labeling. In International Conference on Machine Learning, pages 82–90. PMLR.
[Porter and Duff, 1984] Porter, T. and Duff, T. (1984). Compositing digital images. In The 11th Annual Conference on Computer Graphics and Interactive Techniques, pages 253–259.
[Ren and Malik, 2003] Ren and Malik (2003). Learning a classification model for seg- mentation. In International Conference on Computer Vision, pages 10–17 vol.1.
[Ronneberger et al., 2015] Ronneberger, O., Fischer, P., and Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In International Con- ference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer.
[Rother et al., 2004] Rother, C., Kolmogorov, V., and Blake, A. (2004). ” grabcut” interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3):309–314.
[Saeedan et al., 2018] Saeedan, F., Weber, N., Goesele, M., and Roth, S. (2018). Detail-preserving pooling in deep networks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 9108–9116.
[Saito et al., 2016] Saito, S., Yamashita, T., and Aoki, Y. (2016). Multiple object ex- traction from aerial imagery with convolutional neural networks. Electronic Imaging, 2016(10):1–9.
[Sawicki, 2007] Sawicki, M. (2007). Filming the fantastic: a guide to visual effect cin- ematography. Taylor & Francis.
[Schonfeld et al., 2020] Schonfeld, E., Schiele, B., and Khoreva, A. (2020). A u-net based discriminator for generative adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 8207–8216.
[Shi and Malik, 2000] Shi, J. and Malik, J. (2000). Normalized cuts and image segmen- tation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888– 905.
[Shotton et al., 2006] Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2006). Tex- tonboost: Joint appearance, shape and context modeling for multi-class object recog- nition and segmentation. In European Conference on Computer Vision, pages 1–15. Springer.
[Shotton et al., 2009a] Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2009a). Textonboost for image understanding: Multi-class object recognition and segmen- tation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81(1):2–23.
[Shotton et al., 2009b] Shotton, J., Winn, J., Rother, C., and Criminisi, A. (2009b). Textonboost for image understanding: Multi-class object recognition and segmen- tation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81(1):2–23.
[Shuman et al., 2013] Shuman, D. I., Narang, S. K., Frossard, P., Ortega, A., and Van- dergheynst, P. (2013). The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains. IEEE Signal Processing Magazine, 30(3):83–98.
[Silberman et al., 2012] Silberman, N., Hoiem, D., Kohli, P., and Fergus, R. (2012). Indoor segmentation and support inference from RGBD images. In European Con- ference on Computer Vision, pages 746–760. Springer.
[Strang, 2006] Strang, G. (2006). Linear algebra and its applications. Thomson, Brooks/Cole, Belmont, CA.
[Stutz et al., 2018] Stutz, D., Hermans, A., and Leibe, B. (2018). Superpixels: An evaluation of the state-of-the-art. Computer Vision and Image Understanding, 166:1– 27.
[Sun et al., 2018] Sun, D., Yang, X., Liu, M.-Y., and Kautz, J. (2018). Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 8934–8943.
[Sun et al., 2004] Sun, J., Jia, J., Tang, C.-K., and Shum, H.-Y. (2004). Poisson mat- ting. In ACM SIGGRAPH 2004 Papers, pages 315–321.
[Suzuki, 2020] Suzuki, T. (2020). Superpixel segmentation via convolutional neural networks with regularized information maximization. In International Conference on Acoustics, Speech and Signal Processing, pages 2573–2577. IEEE.
[Suzuki, 2021] Suzuki, T. (2021). Implicit integration of superpixel segmentation into fully convolutional networks. arXiv preprint arXiv:2103.03435.
[Suzuki et al., 2018] Suzuki, T., Akizuki, S., Kato, N., and Aoki, Y. (2018). Superpixel convolution for segmentation. In International Conference on Image Processing, pages 3249–3253. IEEE.
[Suzuki and Aoki, 2018] Suzuki, T. and Aoki, Y. (2018). Graph convolutional neural networks on superpixels for segmentation.電子情報通信学会論文誌 D, 101(8):1120– 1129.
[Suzuki and Aoki, 2020] Suzuki, T. and Aoki, Y. (2020). Unsupervised superpixel seg- mentation via convolutional neural network. 電子情報通信学会論文誌 D, 103(10):702– 711.
[Takayama et al., 2016] Takayama, S., Suzuki, T., Aoki, Y., Isobe, S., and Masuda,
M. (2016). Tracking people in dense crowds using supervoxels. In International Conference on Signal-Image Technology & Internet-Based Systems, pages 532–537. IEEE.
[Tao et al., 2018] Tao, X., Gao, H., Shen, X., Wang, J., and Jia, J. (2018). Scale- recurrent network for deep image deblurring. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 8174–8182.
[Tasli et al., 2013] Tasli, H. E., Cigla, C., Gevers, T., and Alatan, A. A. (2013). Super pixel extraction via convexity induced boundary adaptation. In IEEE International Conference on Multimedia and Expo, pages 1–6. IEEE.
[Torralba et al., 2004] Torralba, A., Murphy, K. P., and Freeman, W. T. (2004). Shar- ing features: efficient boosting procedures for multiclass object detection. In The IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages II–II. IEEE.
[Tu et al., 2018] Tu, W.-C., Liu, M.-Y., Jampani, V., Sun, D., Chien, S.-Y., Yang, M.-H., and Kautz, J. (2018). Learning superpixels with segmentation-aware affinity loss. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 568–576.
[Uijlings et al., 2013] Uijlings, J. R., Van De Sande, K. E., Gevers, T., and Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171.
[Ulyanov et al., 2016] Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2016). Instance normalization: The missing ingredient for fast stylization. Technical report.
[Ulyanov et al., 2018] Ulyanov, D., Vedaldi, A., and Lempitsky, V. (2018). Deep image prior. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 9446–9454.
[Van den Bergh et al., 2012] Van den Bergh, M., Boix, X., Roig, G., de Capitani, B., and Van Gool, L. (2012). SEEDS: Superpixels extracted via energy-driven sampling. In European Conference on Computer Vision, pages 13–26. Springer.
[van der Walt et al., 2014] van der Walt, S., Sch¨onberger, J. L., Nunez-Iglesias, J., Boulogne, F., Warner, J. D., Yager, N., Gouillart, E., Yu, T., and the scikit-image contributors (2014). scikit-image: image processing in Python. PeerJ, 2:e453.
[Vaswani et al., 2017] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L- ., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
[Veksler et al., 2010] Veksler, O., Boykov, Y., and Mehrani, P. (2010). Superpixels and supervoxels in an energy optimization framework. In European Conference on Computer Vision, pages 211–224. Springer.
[Voigtlaender et al., 2019] Voigtlaender, P., Krause, M., Osep, A., Luiten, J., Sekar, B. B. G., Geiger, A., and Leibe, B. (2019). Mots: Multi-object tracking and segmenta- tion. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 7942–7951.
[Von Luxburg, 2007] Von Luxburg, U. (2007). A tutorial on spectral clustering. Statis- tics and computing, 17(4):395–416.
[Weikersdorfer et al., 2012] Weikersdorfer, D., Gossow, D., and Beetz, M. (2012). Depth-adaptive superpixels. In The IEEE Conference on Pattern Recognition, pages 2087–2090. IEEE.
[Wu and He, 2018] Wu, Y. and He, K. (2018). Group normalization. In European Conference on Computer Vision, pages 3–19.
[Xu et al., 2018] Xu, N., Yang, L., Fan, Y., Yue, D., Liang, Y., Yang, J., and Huang, T. (2018). Youtube-vos: A large-scale video object segmentation benchmark. arXiv preprint arXiv:1809.03327.
[Yang et al., 2020] Yang, F., Sun, Q., Jin, H., and Zhou, Z. (2020). Superpixel seg- mentation with fully convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 13964–13973.
[Yao et al., 2015] Yao, J., Boben, M., Fidler, S., and Urtasun, R. (2015). Real-time coarse-to-fine topologically preserving segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2947–2955.
[Yeh et al., 2017] Yeh, R. A., Chen, C., Yian Lim, T., Schwing, A. G., Hasegawa- Johnson, M., and Do, M. N. (2017). Semantic image inpainting with deep generative models. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 5485–5493.
[Yu and Koltun, 2015] Yu, F. and Koltun, V. (2015). Multi-scale context aggregation by dilated convolutions. Technical report.
[Zhang et al., 2019] Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., and Torr, P. H. (2019). Dual graph convolutional network for semantic segmentation. Technical report.
[Zhao et al., 2017] Zhao, H., Shi, J., Qi, X., Wang, X., and Jia, J. (2017). Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 2881–2890.
[Zheng et al., 2015] Zheng, S., Jayasumana, S., Romera-Paredes, B., Vineet, V., Su, Z., Du, D., Huang, C., and Torr, P. H. (2015). Conditional random fields as recurrent neural networks. In International Conference on Computer Vision, pages 1529–1537.
[Zhi et al., 2019] Zhi, S., Bloesch, M., Leutenegger, S., and Davison, A. J. (2019). Scenecode: Monocular dense semantic reconstruction using learned encoded scene representations. In The IEEE Conference on Computer Vision and Pattern Recog- nition, pages 11776–11785.
[Zhou et al., 2016] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., and Torralba, A. (2016). Semantic understanding of scenes through the ade20k dataset. Technical report.
[Zhu et al., 2019] Zhu, X., Hu, H., Lin, S., and Dai, J. (2019). Deformable convnets v2: More deformable, better results. In The IEEE Conference on Computer Vision and Pattern Recognition, pages 9308–9316.