[1] Sikandar Amin, Mykhaylo Andriluka, Marcus Rohrbach, and Bernt Schiele. Multi- view pictorial structures for 3D human pose estimation. BMVC 2013 - Electronic Proceedings of the British Machine Vision Conference 2013, 2013.
[2] Abhijat Biswas, Henny Admoni, and Aaron Steinfeld. Fast On-Board 3D Torso Pose Recovery and Forecasting*.
[3] João Carreira and Andrew Zisserman. Quo Vadis, action recognition? A new model and the kinetics dataset. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Vol. 2017-Janua, pp. 4724–4733, 2017.
[4] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded Pyramid Network for Multi-Person Pose Estimation.
[5] Vasileios Choutas, Philippe Weinzaepfel, Jerome Revaud, and Cordelia Schmid. PoTion: Pose MoTion Representation for Action Recognition. pp. 7024–7033, 2018.
[6] Piotr Doll, Ross Girshick, and Facebook Ai. Mask R-CNN ar.
[7] Junting Dong, Wen Jiang, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Fast and Robust Multi-Person 3D Pose Estimation from Multiple Views. 2019.
[8] Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. Convolutional Two- Stream Network Fusion for Video Action Recognition. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1933–1941, 2016.
[9] Bernard Ghanem, Juan Carlos Niebles, Cees Snoek, Fabian Caba Heilbron, Hu- mam Alwassel, Victor Escorcia, Ranjay Krishna, Shyamal Buch, and Cuong Duc Dao. The ActivityNet Large-Scale Activity Recognition Challenge 2018 Summary. CoRR, Vol. abs/1808.0, , 2018.
[10] T. Poggio H. Kuehne, H. Jhuang, E. Garrote and T. Serre. HMDB : A Large Video Database for Human Motion Recognition. 2011 International Conference on Computer Vision, pp. 2556–2563, 2011.
[11] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6546–6555, 2018.
[12] Yun He, Soma Shirakabe, Yutaka Satoh, and Hirokatsu Kataoka. Human Action Recognition without Human.
[13] Jian He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016.
[14] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Hu- man3.6M: Large Scale Datasets and Predictive Methods for 3D Human Sensing in Natural Environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 36, pp. 1325–1339, 2013.
[15] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. x Ë .
[16] Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to- end Recovery of Human Shape and Pose. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7122–7131, 2018.
[17] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The Kinetics Human Action Video Dataset. 2017.
[18] Ilya Kostrikov and Juergen Gall. Depth sweep regression forests for estimating 3D human pose from images. BMVC 2014 - Proceedings of the British Machine Vision Conference 2014, pp. 1–13, 2014.
[19] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J. Black. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, Vol. 34, No. 6, 2015.
[20] Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin, Mo- Hammad Shafiei, Hans-peter Seidel, Weipeng Xu, Dan Casas, and Christian Theobalt. VNect. ACM Transactions on Graphics, Vol. 36, No. 4, pp. 1–14, 2017.
[21] Francesc Moreno-noguer and Institut De Rob. 3D Human Pose Estimation from a Single Image via Distance Matrix Regression. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[22] Paritosh Parmar and Brendan Tran Morris. Learning to Score Olympic Events. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Vol. 2017-July, pp. 76–84, 2017.
[23] Georgios Pavlakos, Xiaowei Zhou, Konstantinos G. Derpanis, and Kostas Dani- ilidis. Harvesting multiple views for marker-less 3D human pose annotations. Pro- ceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Vol. 2017-Janua, pp. 1253–1262, 2017.
[24] Dario Pavllo, Christoph Feichtenhofer, David Grangier, and Michael Auli. 3D human pose estimation in video with temporal convolutions and semi-supervised training. 2018.
[25] A. J. Piergiovanni and Michael S. Ryoo. Fine-grained activity recognition in base- ball videos. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Vol. 2018-June, pp. 1821–1830, 2018.
[26] Hamed Pirsiavash, Carl Vondrick, and Antonio Torralba. Assessing the Quality of Actions. In Computer Vision â ECCV 2014, Vol. 8694, 2014.
[27] Leonid Pishchulin, Mykhaylo Andriluka, and Bernt Schiele. Fine-grained activity recognition with holistic and pose based features. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Vol. 8753, pp. 678–689, 2014.
[28] Haibo Qiu, Chunyu Wang, Jingdong Wang, Naiyan Wang, and Wenjun Zeng. Cross View Fusion for 3D Human Pose Estimation. 2019.
[29] Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, Vol. 22, No. 3, pp. 400–407, 1951.
[30] Has¸im Sak, Andrew Senior, and Franc¸oise Beaufays. Long Short-Term Mem- ory Recurrent Neural Network Architectures for Large Scale Acoustic Modeling. IEEE Access, Vol. 6, pp. 15733–15742, 2018.
[31] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedan- tam, Devi Parikh, and Dhruv Batra. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Proceedings of the IEEE International Conference on Computer Vision, Vol. 2017-Octob, pp. 618–626, 2017.
[32] Andrew Simonyan, Karen and Zisserman. Two-Stream Convolutional Networks for Action Recognition in Videos. Advances in Neural Information Processing Systems 27, pp. 568–576, 2014.
[33] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR, No. November, 2012.
[34] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbig-niew Wojna. Rethinking the Inception Architecture for Computer Vision. 2015.
[35] Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3D convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Vol. 2015 Inter, pp. 4489–4497, 2015.
[36] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local Neu- ral Networks. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7794–7803, 2018.
[37] Matthew D. Zeiler. ADADELTA: An Adaptive Learning Rate Method. CoRR, Vol. abs/1212.5, , 2012.
[38] Christian Zimmermann, Tim Welschehold, Christian Dornhege, Wolfram Burgard, and Thomas Brox. 3D Human Pose Estimation in RGBD Images for Robotic Task Learning. Proceedings - IEEE International Conference on Robotics and Automation, pp. 1986–1992, 2018.