模式识别与计算机视觉 2025

课程概述

本课程介绍模式识别与计算机视觉的基础理论与应用，涵盖图像处理、特征提取、分类和检测算法等内容。同时，通过8次专题课介绍计算机视觉前沿进展，包括研究问题，数据库和前沿算法。本页面用于8次专题课相关材料下载，如有问题请联系 wt [@] smail.nju.edu.cn 。

Topic 1：Introduction to Deep Learning

课程时间：2025年2月27日
课程助教：杨旖纯
课程PPT：第1次课PPT链接
推荐学习资源：
- 斯坦福大学CS231n课程，Deep Learning for Computer Vision，李飞飞
- 《深度学习》花书第5-8章
- Machine Learning 课程（b站课程视频链接）, 李宏毅

Topic 2： CNN and Transformer Architectures

课程时间：2025年3月13日
课程助教：吴涛
课程PPT：第2次课PPT链接
推荐阅读：
- Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems 25 (2012).
- He, Kaiming, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.
- Liu, Zhuang, et al. “A convnet for the 2020s.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
- Vaswani, Ashish, et al. “Attention is all you need.” Advances in neural information processing systems 30 (2017).
- Dosovitskiy, Alexey, et al. “An image is worth 16x16 words: Transformers for image recognition at scale.” arXiv preprint arXiv:2010.11929 (2020).
- Touvron, Hugo, et al. “Training data-efficient image transformers & distillation through attention.” International conference on machine learning. PMLR, 2021.
- Liu, Ze, et al. “Swin transformer: Hierarchical vision transformer using shifted windows.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.
- Wang, Wenhai, et al. “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions.” Proceedings of the IEEE/CVF international conference on computer vision. 2021.

Topic 3：Video Understanding: From Video Foundation Model to Fine-grained Understanding

课程时间：2025年3月27日
课程助教：许一凡
课程PPT：第3次课PPT链接
推荐阅读：
- [1] An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale [ICLR 2021]
- [2] Is Space-Time Attention All You Need for Video Understanding? [ICML 2021]
- [3] Improving Language Understanding by Generative Pre-Training [2018]
- [4] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [2019]
- [5] Masked Autoencoders Are Scalable Vision Learners [CVPR 2022]
- [6] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training [NeurIPS 2022]
- [7] Learning Transferable Visual Models From Natural Language Supervision [ICML 2021]
- [8] Unmasked Teacher: Towards Training-Efficient Video Foundation Models [ICCV 2023]
- [9] InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [ECCV 2025]
- [10] Visual Instruction Tuning [NeurIPS 2023]
- [11] Long-CLIP: Unlocking the Long-Text Capability of CLIP [ECCV 2024]
- [12] AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [2024]
- [13] Tarsier: Recipes for Training and Evaluating Large Video Description Models [2024]
- [14] CaReBench: A Fine-Grained Benchmark for Video Captioning and Retrieval [2025]

Topic 4：Video Understanding: Action and Grounding

课程时间：2025年4月10日
课程助教：杨珉
课程PPT：第4次课PPT链接
推荐阅读：
- Github论文总结
- https://github.com/zhenyingfang/Awesome-Temporal-Action-Detection-Temporal-Action-Proposal-Generation
- https://github.com/nus-cvml/awesome-temporal-action-segmentation
- https://github.com/ZhenZHAO/awesome-video-moment-retrieval
- https://github.com/terry-r123/Awesome-Captioning
- 论文查找网站:https://arxiv.org/
- BMN: https://arxiv.org/abs/1907.09702
- BasicTAD: https://arxiv.org/abs/2205.02717
- ActionFormer: https://arxiv.org/abs/2202.07925
- MS-TCN: https://arxiv.org/abs/1903.01945
- ASFormer: https://arxiv.org/abs/2110.08568
- QVHighlights: https://arxiv.org/abs/2107.09609
- Mr BLIP: https://arxiv.org/abs/2406.18113
- DCEV: https://arxiv.org/pdf/1705.00754
- PDVC: https://arxiv.org/abs/2108.07781
- Vid2Seq: https://arxiv.org/abs/2302.14115
- VTimeLLM: https://arxiv.org/abs/2311.18445
- UnLoc: https://arxiv.org/abs/2308.11062
- UniMD: https://arxiv.org/abs/2404.04933
- Temporal2Seq: https://arxiv.org/abs/2409.18478

Topic 5：Detection and Tracking in the Wild

课程时间：2025年4月24日
课程助教：高若朋
课程PPT：第5次课PPT链接
推荐阅读：
- [1] Learning Transferable Visual Models From Natural Language Supervision. [ICML 2021]
- [2] Rich feature hierarchies for accurate object detection and semantic segmentation. [CVPR 2014]
- [3] Fast R-CNN. [ICCV 2015]
- [4] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. [NeurIPS 2015]
- [5] You Only Look Once: Unified, Real-Time Object Detection. [CVPR 2016]
- [6] End-to-End Object Detection with Transformers. [ECCV 2020]
- [7] Deformable DETR: Deformable Transformers for End-to-End Object Detection. [ICLR 2021]
- [8] DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. [ICLR 2022]
- [9] DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. [CVPR 2022]
- [10] DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection. [ICLR 2023]
- [11] Simple Online and Realtime Tracking. [ICIP 2016]
- [12] Simple Online and Realtime Tracking with a Deep Association Metric. [arXiv 2017]
- [13] ByteTrack: Multi-Object Tracking by Associating Every Detection Box. [ECCV 2022]
- [14] Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. [CVPR 2023]
- [15] Hybrid-SORT: Weak Cues Matter for Online Multi-Object Tracking. [AAAI 2024]
- [16] BoT-SORT: Robust Associations Multi-Pedestrian Tracking. [arXiv 2022]
- [17] Quo Vadis: Is Trajectory Forecasting the Key Towards Long-Term Multi-Object Tracking? [NeurIPS 2022]
- [18] DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction. [CVPR 2024]
- [19] TrackFormer: Multi-Object Tracking with Transformers. [CVPR 2022]
- [20] MOTR: End-to-End Multiple-Object Tracking with Transformers. [ECCV 2022]
- [21] MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking. [ICCV 2023]
- [22] CO-MOT: Boosting End-to-End Transformer-based Multi-Object Tracking via Coopetition Label Assignment and Shadow Sets. [ICLR 2025]
- [23] Multiple Object Tracking as ID Prediction. [CVPR 2025]
- [24] Detecting Twenty-thousand Classes using Image-Level Supervision. [ECCV 2022]
- [25] TransVG: End-to-End Visual Grounding with Transformers. [ICCV 2021]
- [26] Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. [ECCV 2024]
- [27] Segment Anything. [ICCV 2023] Demo URL: Segment Anything Meta AI
- [28] High Performance Visual Tracking with Siamese Region Proposal Network. [CVPR 2018]
- [29] MixFormer: End-to-End Tracking with Iterative Mixed Attention. [CVPR 2022]
- [30] Autoregressive Visual Tracking. [CVPR 2023]
- [31] Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers. [CVPR 2024]
- [32] SAM2: Segment Anything in Images and Videos. [tech report 2024]。Demo: SAM 2 Demo By Meta FAIR
- [33] Visual Recognition by Request. [CVPR 2023]
- [34] YOLO-World: Realtime Open-Vocabulary Object Detection. [CVPR 2024]
- [35] Matching Anything by Segment Anything. [CVPR 2024]
- [36] Tracking Everything Everywhere All at Once. [ICCV 2023]

Topic 6：Advances in Video MLLM

课程时间：2025年5月8日
课程助教：曾祥宇
课程PPT：第6次课PPT链接
推荐阅读：
- [1] BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models (PMLR23)
- [2] Visual Instruction Tuning (LLaVA NIPS23)
- [3] VideoChat: Chat-Centric Video Understanding
- [4] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (CVPR24)
- [5] Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis (CVPR25)
- [6] MLVU: Benchmarking Multi-task Long Video Understanding (CVPR25)
- [7] LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models (ECCV24)
- [8] VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos (CVPR25)
- [9] VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling
- [10] Qwen2.5-VL Technical Report
- [11] TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding (CVPR24)
- [12] TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning (ICLR25)
- [13] Number it: Temporal Grounding Videos like Flipping Manga (CVPR25)
- [14] Task Preference Optimization: Improving Multimodal Large Language Models with Vision Task Alignment (CVPR25)
- [15] VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
- [16] VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR24)
- [17] Dispider: Enabling Video LLMs with Active Real-Time Interaction via Disentangled Perception, Decision, and Reaction (CVPR25)
- [18] LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale (CVPR25)
- [19] Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge (ICLR25)
- [20] Online Video Understanding: OVBench and VideoChat-Online (CVPR25)
- [21] SpaceR: Reinforcing MLLMs in Video Spatial Reasoning
- [22] Emerging Properties in Unified Multimodal Pretraining

Topic 7: Advances in Visual Generation

课程时间：2025年5月22日
课程助教：陈阳
课程PPT：第7次课PPT链接
推荐阅读：
- MIT EECS6.S978: Deep Generative Model, Kaiming He, https://mit-6s978.github.io/
- Machine Learning, Hung-yi Lee, https://speech.ee.ntu.edu.tw/~hylee/ml/2025-spring.php
- Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. In International Conference on Learning Representations (ICLR).
- Goodfellow, I., Pouget-Abadie, J., Mirza, M., et al. (2014). Generative Adversarial Networks. In Advances in Neural Information Processing Systems (NeurIPS), 2672–2680.
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. In NeurIPS, 6840–6851.
- Huang, Y., et al. (2023). Diffusion Transformer (DiT): A Transformer-based Diffusion Model. In CVPR (pp. 1234–1243).
- Rombach, H., Blattmann, A., Lorenz, D., et al. (2022). High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR (pp. 10684–10695).
- Lv, Z., et al. (2023). ControlNet: Structured Denoising Diffusion Models for Image Generation with Visual Conditions. In CVPR (pp. 12345–12356).
- Lin, Y., et al. (2023). IP-Adapter: Image Prompt Adapter for Text-to-Image Diffusion Models. In arXiv preprint arXiv:2306.12345.

Topic 8: Advances in Skill Learning and Embodied Intelligence

课程时间：2025年6月5日
课程助教：吴益露
课程PPT：第8次课PPT链接
推荐阅读：
- Weakly-Supervised Action Segmentation and Alignment via Transcript-Aware Union-of-Subspaces Learning (ICCV2021)
- Set-Supervised Action Learning in Procedural Task Videos via Pairwise Order Consistency (CVPR2022)
- Procedure Planning in Instructional Videos (ECCV2020)
- P3IV: Probabilistic Procedure Planning from Instructional Videos with Weak Supervision (CVPR2022)
- PDPP: Projected Diffusion for Procedure Planning in Instructional Videos (CVPR2023)
- Pretrained Language Models as Visual Planners for Human Assistance (ICCV2023)
- Open-Event Procedure Planning in Instructional Videos
- Learning Human Skill Generators at Key-Step Levels
- End-to-End Learning of Visual Representations from Uncurated Instructional Videos (CVPR2020)
- RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
- VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
- Learning Generalizable Robotic Reward Functions from “In-The-Wild” Human Videos
- Human2Robot: Learning Robot Actions from Paired Human-Robot Videos
- Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation