{"status":"ok","message-type":"work","message-version":"1.0.0","message":{"indexed":{"date-parts":[[2026,5,2]],"date-time":"2026-05-02T15:08:50Z","timestamp":1777734530292,"version":"3.51.4"},"reference-count":75,"publisher":"MDPI AG","issue":"23","license":[{"start":{"date-parts":[[2020,12,4]],"date-time":"2020-12-04T00:00:00Z","timestamp":1607040000000},"content-version":"vor","delay-in-days":0,"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/creativecommons.org\/licenses\/by\/4.0\/"}],"content-domain":{"domain":[],"crossmark-restriction":false},"short-container-title":["Sensors"],"abstract":"<jats:p>Detecting key frames in videos is a common problem in many applications such as video classification, action recognition and video summarization. These tasks can be performed more efficiently using only a handful of key frames rather than the full video. Existing key frame detection approaches are mostly designed for supervised learning and require manual labelling of key frames in a large corpus of training data to train the models. Labelling requires human annotators from different backgrounds to annotate key frames in videos which is not only expensive and time consuming but also prone to subjective errors and inconsistencies between the labelers. To overcome these problems, we propose an automatic self-supervised method for detecting key frames in a video. Our method comprises a two-stream ConvNet and a novel automatic annotation architecture able to reliably annotate key frames in a video for self-supervised learning of the ConvNet. The proposed ConvNet learns deep appearance and motion features to detect frames that are unique. The trained network is then able to detect key frames in test videos. Extensive experiments on UCF101 human action and video summarization VSUMM datasets demonstrates the effectiveness of our proposed method.<\/jats:p>","DOI":"10.3390\/s20236941","type":"journal-article","created":{"date-parts":[[2020,12,4]],"date-time":"2020-12-04T11:59:00Z","timestamp":1607083140000},"page":"6941","update-policy":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/doi.org\/10.3390\/mdpi_crossmark_policy","source":"Crossref","is-referenced-by-count":37,"title":["Self-Supervised Learning to Detect Key Frames in Videos"],"prefix":"10.3390","volume":"20","author":[{"ORCID":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/orcid.org\/0000-0003-4184-1300","authenticated-orcid":false,"given":"Xiang","family":"Yan","sequence":"first","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, Xi\u2019an 710071, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"ORCID":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/orcid.org\/0000-0002-7448-2327","authenticated-orcid":false,"given":"Syed Zulqarnain","family":"Gilani","sequence":"additional","affiliation":[{"name":"School of Science, Edith Cowan University, Joondalup 6027, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Mingtao","family":"Feng","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xidian University, Xi\u2019an 710071, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Liang","family":"Zhang","sequence":"additional","affiliation":[{"name":"School of Computer Science and Technology, Xidian University, Xi\u2019an 710071, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Hanlin","family":"Qin","sequence":"additional","affiliation":[{"name":"School of Physics and Optoelectronic Engineering, Xidian University, Xi\u2019an 710071, China"}],"role":[{"role":"author","vocabulary":"crossref"}]},{"given":"Ajmal","family":"Mian","sequence":"additional","affiliation":[{"name":"Computer Science and Software Engineering, University of Western Australia, Crawley 6009, Australia"}],"role":[{"role":"author","vocabulary":"crossref"}]}],"member":"1968","published-online":{"date-parts":[[2020,12,4]]},"reference":[{"key":"ref_1","doi-asserted-by":"crossref","first-page":"1923","DOI":"10.1109\/TCYB.2017.2718579","article-title":"Key frame extraction in the summary space","volume":"48","author":"Li","year":"2017","journal-title":"IEEE Trans. Cybern."},{"key":"ref_2","doi-asserted-by":"crossref","first-page":"1","DOI":"10.1145\/3355390","article-title":"Video description: A survey of methods, datasets, and evaluation metrics","volume":"52","author":"Aafaq","year":"2019","journal-title":"ACM Comput. Surv. (CSUR)"},{"key":"ref_3","doi-asserted-by":"crossref","unstructured":"Aafaq, N., Akhtar, N., Liu, W., Gilani, S.Z., and Mian, A. (2019, January 16\u201320). Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.01277"},{"key":"ref_4","doi-asserted-by":"crossref","unstructured":"Mahasseni, B., Lam, M., and Todorovic, S. (2017, January 21\u201326). Unsupervised video summarization with adversarial LSTM networks. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.318"},{"key":"ref_5","doi-asserted-by":"crossref","first-page":"577","DOI":"10.1109\/TCSVT.2019.2890899","article-title":"Novel Key-frames Selection Framework for Comprehensive Video Summarization","volume":"30","author":"Huang","year":"2019","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_6","doi-asserted-by":"crossref","unstructured":"Acuna, D., Ling, H., Kar, A., and Fidler, S. (2018, January 18\u201323). Efficient interactive annotation of segmentation datasets with polygon-rnn++. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.","DOI":"10.1109\/CVPR.2018.00096"},{"key":"ref_7","doi-asserted-by":"crossref","first-page":"197","DOI":"10.1016\/j.neucom.2019.07.108","article-title":"Video summarization via block sparse dictionary selection","volume":"378","author":"Ma","year":"2020","journal-title":"Neurocomputing"},{"key":"ref_8","doi-asserted-by":"crossref","unstructured":"Ji, Z., Zhao, Y., Pang, Y., Li, X., and Han, J. (2020). Deep Attentive Video Summarization With Distribution Consistency Learning. IEEE Trans. Neural Netw. Learn. Syst.","DOI":"10.1109\/TNNLS.2020.2991083"},{"key":"ref_9","doi-asserted-by":"crossref","unstructured":"Kulhare, S., Sah, S., Pillai, S., and Ptucha, R. (2016, January 4\u20138). Key frame extraction for salient activity recognition. Proceedings of the 2016 23rd International Conference on Pattern Recognition (ICPR), Cancun, Mexico.","DOI":"10.1109\/ICPR.2016.7899739"},{"key":"ref_10","doi-asserted-by":"crossref","unstructured":"Wu, Z., Xiong, C., Ma, C.Y., Socher, R., and Davis, L.S. (2019, January 16\u201320). AdaFrame: Adaptive Frame Selection for Fast Video Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00137"},{"key":"ref_11","unstructured":"Korbar, B., Tran, D., and Torresani, L. (November, January 27). SCSampler: Sampling Salient Clips From Video for Efficient Action Recognition. Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Korea."},{"key":"ref_12","doi-asserted-by":"crossref","unstructured":"Kumar, G.N., and Reddy, V. (2019). Key Frame Extraction Using Rough Set Theory for Video Retrieval. Soft Computing and Signal Processing, Springer.","DOI":"10.1007\/978-981-13-3393-4_76"},{"key":"ref_13","doi-asserted-by":"crossref","unstructured":"Hsiao, M., Westman, E., Zhang, G., and Kaess, M. (June, January 29). Keyframe-based dense planar SLAM. Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore.","DOI":"10.1109\/ICRA.2017.7989597"},{"key":"ref_14","unstructured":"Sheng, L., Xu, D., Ouyang, W., and Wang, X. (November, January 27). Unsupervised Collaborative Learning of Keyframe Detection and Visual Odometry Towards Monocular Deep SLAM. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_15","unstructured":"Lin, X., Sun, D., Lin, T.Y., Eustice, R.M., and Ghaffari, M. (2019). A Keyframe-based Continuous Visual SLAM for RGB-D Cameras via Nonparametric Joint Geometric and Appearance Representation. arXiv."},{"key":"ref_16","doi-asserted-by":"crossref","first-page":"14465","DOI":"10.1007\/s11042-018-6826-3","article-title":"An automatic video annotation framework based on two level keyframe extraction mechanism","volume":"78","author":"Aote","year":"2019","journal-title":"Multimed. Tools Appl."},{"key":"ref_17","doi-asserted-by":"crossref","first-page":"2337","DOI":"10.1109\/TCSVT.2018.2867934","article-title":"Generating realistic videos from keyframes with concatenated GANs","volume":"29","author":"Wen","year":"2019","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_18","unstructured":"Simonyan, K., and Zisserman, A. (2014, January 8\u201313). Two-stream convolutional networks for action recognition in videos. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_19","unstructured":"Ye, J., Janardan, R., and Li, Q. (2005, January 5\u20138). Two-dimensional linear discriminant analysis. Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada."},{"key":"ref_20","unstructured":"Soomro, K., Zamir, A.R., and Shah, M. (2012). UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv."},{"key":"ref_21","doi-asserted-by":"crossref","first-page":"56","DOI":"10.1016\/j.patrec.2010.08.004","article-title":"VSUMM: A mechanism designed to produce static video summaries and a novel evaluation method","volume":"32","author":"Lopes","year":"2011","journal-title":"Pattern Recognit. Lett."},{"key":"ref_22","unstructured":"Doermann, D., and Mihalcik, D. (2000, January 3\u20137). Tools and techniques for video performance evaluation. Proceedings of the 15th International Conference on Pattern Recognition. ICPR-2000, Barcelona, Spain."},{"key":"ref_23","doi-asserted-by":"crossref","unstructured":"Yuen, J., Russell, B., Liu, C., and Torralba, A. (October, January 29). Labelme video: Building a video database with human annotations. Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan.","DOI":"10.1109\/ICCV.2009.5459289"},{"key":"ref_24","doi-asserted-by":"crossref","first-page":"2093","DOI":"10.1109\/TMM.2019.2895511","article-title":"End-to-End Automatic Image Annotation Based on Deep CNN and Multi-Label Data Augmentation","volume":"21","author":"Ke","year":"2019","journal-title":"IEEE Trans. Multimed."},{"key":"ref_25","unstructured":"Feng, S., Manmatha, R., and Lavrenko, V. (July, January 27). Multiple bernoulli relevance models for image and video annotation. Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Washington, DC, USA."},{"key":"ref_26","doi-asserted-by":"crossref","unstructured":"Gao, L., Song, J., Nie, F., Yan, Y., Sebe, N., and Tao Shen, H. (2015, January 7\u201312). Optimal graph learning with partial tags and multiple features for image and video annotation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7299066"},{"key":"ref_27","doi-asserted-by":"crossref","first-page":"4999","DOI":"10.1109\/TIP.2016.2601260","article-title":"Optimized graph learning using partial tags and multiple features for image and video annotation","volume":"25","author":"Song","year":"2016","journal-title":"IEEE Trans. Image Process."},{"key":"ref_28","doi-asserted-by":"crossref","first-page":"3210","DOI":"10.1109\/TIP.2018.2814344","article-title":"Self-supervised video hashing with hierarchical binary auto-encoder","volume":"27","author":"Song","year":"2018","journal-title":"IEEE Trans. Image Process."},{"key":"ref_29","doi-asserted-by":"crossref","unstructured":"Berg, A., Johnander, J., Durand de Gevigney, F., Ahlberg, J., and Felsberg, M. (2019, January 27\u201328). Semi-automatic annotation of objects in visual-thermal video. Proceedings of the IEEE International Conference on Computer Vision Workshops, Seoul, Korea.","DOI":"10.1109\/ICCVW.2019.00277"},{"key":"ref_30","doi-asserted-by":"crossref","unstructured":"Porzi, L., Hofinger, M., Ruiz, I., Serrat, J., Bul\u00f2, S.R., and Kontschieder, P. (2019). Learning Multi-Object Tracking and Segmentation from Automatic Annotations. arXiv.","DOI":"10.1109\/CVPR42600.2020.00688"},{"key":"ref_31","doi-asserted-by":"crossref","unstructured":"Gygli, M., and Ferrari, V. (2019). Efficient Object Annotation via Speaking and Pointing. arXiv.","DOI":"10.1007\/s11263-019-01255-4"},{"key":"ref_32","doi-asserted-by":"crossref","unstructured":"Caba Heilbron, F., Escorcia, V., Ghanem, B., and Carlos Niebles, J. (2015, January 7\u201312). Activitynet: A large-scale video benchmark for human activity understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.","DOI":"10.1109\/CVPR.2015.7298698"},{"key":"ref_33","doi-asserted-by":"crossref","unstructured":"Sigurdsson, G.A., Varol, G., Wang, X., Farhadi, A., Laptev, I., and Gupta, A. (2016, January 11\u201314). Hollywood in homes: Crowdsourcing data collection for activity understanding. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46448-0_31"},{"key":"ref_34","doi-asserted-by":"crossref","unstructured":"Gao, J., Sun, C., Yang, Z., and Nevatia, R. (2017, January 22\u201329). Tall: Temporal activity localization via language query. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.","DOI":"10.1109\/ICCV.2017.563"},{"key":"ref_35","doi-asserted-by":"crossref","unstructured":"Gygli, M., Grabner, H., Riemenschneider, H., and Van Gool, L. (2014, January 6\u201312). Creating summaries from user videos. Proceedings of the European Conference on Computer Vision, Zurich, Switzerland.","DOI":"10.1007\/978-3-319-10584-0_33"},{"key":"ref_36","unstructured":"Song, Y., Vallmitjana, J., Stent, A., and Jaimes, A. (2015, January 7\u201312). Tvsum: Summarizing web videos using titles. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA."},{"key":"ref_37","unstructured":"Wolf, W. (1996, January 9). Key frame selection by motion analysis. Proceedings of the 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, Atlanta, GA, USA."},{"key":"ref_38","doi-asserted-by":"crossref","first-page":"91","DOI":"10.1023\/B:VISI.0000029664.99615.94","article-title":"Distinctive image features from scale-invariant keypoints","volume":"60","author":"Lowe","year":"2004","journal-title":"Int. J. Comput. Vis."},{"key":"ref_39","doi-asserted-by":"crossref","first-page":"729","DOI":"10.1109\/TCSVT.2012.2214871","article-title":"Keypoint-based keyframe selection","volume":"23","author":"Guan","year":"2013","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_40","unstructured":"Zhuang, Y., Rui, Y., Huang, T.S., and Mehrotra, S. (1998, January 7). Adaptive key frame extraction using unsupervised clustering. Proceedings of the1998 International Conference on Image Processing. ICIP98 (Cat. No.98CB36269), Chicago, IL, USA."},{"key":"ref_41","doi-asserted-by":"crossref","first-page":"1097","DOI":"10.1109\/TMM.2005.858392","article-title":"Detection and representation of scenes in videos","volume":"7","author":"Rasheed","year":"2005","journal-title":"IEEE Trans. Multimed."},{"key":"ref_42","doi-asserted-by":"crossref","first-page":"82","DOI":"10.1109\/TCSVT.2005.856896","article-title":"Information theory-based shot cut\/fade detection and video summarization","volume":"16","author":"Cernekova","year":"2006","journal-title":"IEEE Trans. Circuits Syst. Video Technol."},{"key":"ref_43","doi-asserted-by":"crossref","first-page":"424","DOI":"10.1016\/j.neucom.2018.11.038","article-title":"Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion","volume":"331","author":"Tang","year":"2019","journal-title":"Neurocomputing"},{"key":"ref_44","doi-asserted-by":"crossref","first-page":"770","DOI":"10.1016\/j.patrec.2012.12.009","article-title":"Spatio-temporal feature-based keyframe detection from video shots using spectral clustering","volume":"34","author":"Bandera","year":"2013","journal-title":"Pattern Recognit. Lett."},{"key":"ref_45","doi-asserted-by":"crossref","first-page":"66","DOI":"10.1109\/TMM.2011.2166951","article-title":"Towards scalable summarization of consumer videos via sparse dictionary selection","volume":"14","author":"Cong","year":"2012","journal-title":"IEEE Trans. Multimed."},{"key":"ref_46","doi-asserted-by":"crossref","first-page":"522","DOI":"10.1016\/j.patcog.2014.08.002","article-title":"Video summarization via minimum sparse reconstruction","volume":"48","author":"Mei","year":"2015","journal-title":"Pattern Recognit."},{"key":"ref_47","doi-asserted-by":"crossref","unstructured":"Meng, J., Wang, H., Yuan, J., and Tan, Y.P. (2016, January 27\u201330). From keyframes to key objects: Video summarization by representative object proposal selection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.118"},{"key":"ref_48","doi-asserted-by":"crossref","unstructured":"Lai, K.T., Yu, F.X., Chen, M.S., and Chang, S.F. (2014, January 23\u201328). Video event detection by inferring temporal instance labels. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA.","DOI":"10.1109\/CVPR.2014.288"},{"key":"ref_49","doi-asserted-by":"crossref","unstructured":"Zhou, L., and Nagahashi, H. (2017, January 24\u201326). Real-time Action Recognition Based on Key Frame Detection. Proceedings of the 9th International Conference on Machine Learning and Computing, Singapore.","DOI":"10.1145\/3055635.3056569"},{"key":"ref_50","doi-asserted-by":"crossref","unstructured":"Yang, H., Wang, B., Lin, S., Wipf, D., Guo, M., and Guo, B. (2015, January 7\u201313). Unsupervised extraction of video highlights via robust recurrent auto-encoders. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.","DOI":"10.1109\/ICCV.2015.526"},{"key":"ref_51","doi-asserted-by":"crossref","unstructured":"Kar, A., Rai, N., Sikka, K., and Sharma, G. (2017, January 21\u201326). AdaScan: Adaptive Scan Pooling in Deep Convolutional Neural Networks for Human Action Recognition in Videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.604"},{"key":"ref_52","doi-asserted-by":"crossref","unstructured":"Bhardwaj, S., Srinivasan, M., and Khapra, M.M. (2019, January 16\u201320). Efficient Video Classification Using Fewer Frames. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00044"},{"key":"ref_53","doi-asserted-by":"crossref","first-page":"319","DOI":"10.1016\/j.ins.2017.12.020","article-title":"A salient dictionary learning framework for activity video summarization via key-frame extraction","volume":"432","author":"Mademlis","year":"2018","journal-title":"Inf. Sci."},{"key":"ref_54","unstructured":"GogiReddy, H.S.S.R., and Sinha, N. (October, January 29). Video Key Frame Detection Using Block Sparse Coding. Proceedings of the 3rd International Conference on Computer Vision and Image Processing, Jabalpur, India."},{"key":"ref_55","doi-asserted-by":"crossref","unstructured":"Kwak, I., Guo, J.Z., Hantman, A., Kriegman, D., and Branson, K. (2020, January 1\u20135). Detecting the starting frame of actions in video. Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093405"},{"key":"ref_56","doi-asserted-by":"crossref","unstructured":"Ren, J., Shen, X., Lin, Z., and Mech, R. (2020, January 1\u20135). Best Frame Selection in a Short Video. Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA.","DOI":"10.1109\/WACV45572.2020.9093615"},{"key":"ref_57","doi-asserted-by":"crossref","first-page":"147","DOI":"10.1016\/j.neucom.2018.03.077","article-title":"Deep key frame extraction for sport training","volume":"328","author":"Jian","year":"2019","journal-title":"Neurocomputing"},{"key":"ref_58","unstructured":"Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv."},{"key":"ref_59","unstructured":"Feichtenhofer, C., Pinz, A., and Wildes, R. (2016, January 5\u201310). Spatiotemporal residual networks for video action recognition. Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain."},{"key":"ref_60","doi-asserted-by":"crossref","unstructured":"Feichtenhofer, C., Pinz, A., and Wildes, R.P. (2017, January 21\u201326). Spatiotemporal multiplier networks for video action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.","DOI":"10.1109\/CVPR.2017.787"},{"key":"ref_61","doi-asserted-by":"crossref","first-page":"420","DOI":"10.1007\/s11263-019-01225-w","article-title":"Deep Insights into Convolutional Networks for Video Recognition","volume":"128","author":"Feichtenhofer","year":"2019","journal-title":"Int. J. Comput. Vis."},{"key":"ref_62","unstructured":"Jiang, B., Wang, M., Gan, W., Wu, W., and Yan, J. (November, January 27). STM: SpatioTemporal and motion encoding for action recognition. Proceedings of the IEEE International Conference on Computer Vision, Seoul, Korea."},{"key":"ref_63","doi-asserted-by":"crossref","unstructured":"Li, C., Zhong, Q., Xie, D., and Pu, S. (2019, January 16\u201320). Collaborative Spatiotemporal Feature Learning for Video Action Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.","DOI":"10.1109\/CVPR.2019.00806"},{"key":"ref_64","doi-asserted-by":"crossref","unstructured":"Prince, S.J., and Elder, J.H. (2007, January 14\u201321). Probabilistic linear discriminant analysis for inferences about identity. Proceedings of the2007 11th IEEE International Conference on Computer Vision, Rio de Janeiro, Brazil.","DOI":"10.1109\/ICCV.2007.4409052"},{"key":"ref_65","doi-asserted-by":"crossref","unstructured":"Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., and Darrell, T. (2014, January 3\u20137). Caffe: Convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM International conference on Multimedia, Orlando, FL, USA.","DOI":"10.1145\/2647868.2654889"},{"key":"ref_66","doi-asserted-by":"crossref","unstructured":"Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., and Fei-Fei, L. (2009, January 20\u201326). Imagenet: A large-scale hierarchical image database. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2009, Miami Beach, FL, USA.","DOI":"10.1109\/CVPR.2009.5206848"},{"key":"ref_67","unstructured":"Zach, C., Pock, T., and Bischof, H. (2007). A duality based approach for realtime TV-L1 optical flow. Pattern Recognition, Springer."},{"key":"ref_68","unstructured":"Gong, B., Chao, W.L., Grauman, K., and Sha, F. (2014, January 8\u201313). Diverse sequential subset selection for supervised video summarization. Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada."},{"key":"ref_69","doi-asserted-by":"crossref","unstructured":"Zhang, K., Chao, W.L., Sha, F., and Grauman, K. (2016, January 11\u201314). Video summarization with long short-term memory. Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands.","DOI":"10.1007\/978-3-319-46478-7_47"},{"key":"ref_70","doi-asserted-by":"crossref","first-page":"1212","DOI":"10.1016\/j.jvcir.2013.08.003","article-title":"Video key frame extraction through dynamic Delaunay clustering with a structural constraint","volume":"24","author":"Kuanar","year":"2013","journal-title":"J. Vis. Commun. Image Represent."},{"key":"ref_71","doi-asserted-by":"crossref","unstructured":"Zhang, K., Chao, W.L., Sha, F., and Grauman, K. (2016, January 27\u201330). Summary transfer: Exemplar-based subset selection for video summarization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA.","DOI":"10.1109\/CVPR.2016.120"},{"key":"ref_72","doi-asserted-by":"crossref","unstructured":"Fu, T.J., Tai, S.H., and Chen, H.T. (2019, January 7\u201311). Attentive and adversarial learning for video summarization. Proceedings of the 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Waikoloa Village, HI, USA.","DOI":"10.1109\/WACV.2019.00173"},{"key":"ref_73","doi-asserted-by":"crossref","unstructured":"Ji, Z., Xiong, K., Pang, Y., and Li, X. (2019). Video summarization with attention-based encoder-decoder networks. IEEE Trans. Circuits Syst. Video Technol.","DOI":"10.1109\/TCSVT.2019.2904996"},{"key":"ref_74","doi-asserted-by":"crossref","first-page":"3652","DOI":"10.1109\/TIP.2017.2695887","article-title":"A general framework for edited video and raw video summarization","volume":"26","author":"Li","year":"2017","journal-title":"IEEE Trans. Image Process."},{"key":"ref_75","doi-asserted-by":"crossref","first-page":"2782","DOI":"10.1109\/TPAMI.2013.65","article-title":"Temporal localization of actions with actoms","volume":"35","author":"Gaidon","year":"2013","journal-title":"IEEE Trans. Pattern Anal. Mach. Intell."}],"container-title":["Sensors"],"original-title":[],"language":"en","link":[{"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/www.mdpi.com\/1424-8220\/20\/23\/6941\/pdf","content-type":"unspecified","content-version":"vor","intended-application":"similarity-checking"}],"deposited":{"date-parts":[[2025,10,11]],"date-time":"2025-10-11T10:41:41Z","timestamp":1760179301000},"score":1,"resource":{"primary":{"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/www.mdpi.com\/1424-8220\/20\/23\/6941"}},"subtitle":[],"short-title":[],"issued":{"date-parts":[[2020,12,4]]},"references-count":75,"journal-issue":{"issue":"23","published-online":{"date-parts":[[2020,12]]}},"alternative-id":["s20236941"],"URL":"https:\/\/summer-heart-0930.chufeiyun1688.workers.dev:443\/https\/doi.org\/10.3390\/s20236941","relation":{},"ISSN":["1424-8220"],"issn-type":[{"value":"1424-8220","type":"electronic"}],"subject":[],"published":{"date-parts":[[2020,12,4]]}}}