About Me

Hi! This is Zejun Li. I am currently a Ph.D. candidate at Fudan Univeristy, where I am fortunate to be advised by Prof. Zhongyu Wei. I am a member of the Fudan Data Intelligence and Social Computing (Fudan DISC) lab and the Fudan NLP group. Previously, I had the opportunity to visit the UCSB (University of California, Santa Barbara) NLP group, where I worked with Prof. William Wang.

My research focuses on multi-modal learning across vision and language, particularly:

  • Construction and Evaluation of Large Multi-modal Models (LMMs)
  • Exploring Visually-enhanced Reasoning abilities in LMMs
  • Vision-Language Pre-training

📝 Publications

Preprint
sym

Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

Zejun Li*, Yingxiu Zhao*, Jiwen Zhang, Siyuan Wang, Yang Yao, Runzhou Zhao, Jun Song, Bo Zheng, Zhongyu Wei.

[GitHub Project (Coming Soon)]

  • We present Mixture-of-Visual-Thoughts, an adaptive reasoning paradigm that unifies different thinking modes within one model and guides it to select the appropriate one based on context.
  • We introduce an adaptive learning framework, including an RL algorithm AdaGRPO, tailored for inducing context-adaptive mode selection capabilities.
  • We construct AdaVaR-3B and AdaVaR-7B, which achieve consistent improvement across various scenarious!
NAACL 2025
sym

VoCoT: Unleashing Visually-Grounded Multi-Step Reasoning in Large Multi-Modal Models

Zejun Li*, Ruipu Luo*, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, Zhongyu Wei.

[GitHub Project] [Data] [Model]

  • We present VoCoT, a multi-modal interleaved reasoning CoT format for MLLM.
  • We construct a SFT dataset to enable LMMs to reason with VoCoT!
  • We construct VolCano, a VoCoT-enhanced LMM, which is able to perform reasoning with visually-grounded and highly explainable thoughts!
Preprint
sym

AutoJudger: An Agent-Driven Framework for Efficient Benchmarking of MLLMs

Xuanwen Ding*, Chengjun Pan*, Zejun Li*, Jiwen Zhang*, Siyuan Wang, Zhongyu Wei.

[GitHub Project]

  • We formulate the task of multi-modal efficient benchmarking: select representative subsets from benchmarks for efficient evaluation.
  • We propose an agent-driven framework, AutoJudger, which leverages an interviewer agent to dynamically select suitable questions for the subject (the evaluated model).
  • With AutoJudger, only 2%~5% data (<10% computational cost) leads to >90% consistency with full-scale evaluation for most benchmarks.
Preprint
sym

Continuous or Discrete, That Is the Question: A Survey on Large Multi-Modal Models from the Perspective of Input-Output Space Extension

Zejun Li*, Jiwen Zhang*, Dianyi Wang*, Ye Wang*, Xuanjing Huang, Zhongyu Wei.

[Awesome LMMs Reviewed]

  • We review and summarize existing LMMs from an unified and straightforward perspective: input-output space extension.
  • We discuss various model architectures, training datasets, and evaluation stratigies, aiming to provide a comprehensive overview for readers.
  • Our survey also talks about how to extend the input-output space to embodied environments, leaving rooms for future development.
ACM MM 2024
sym

Reform-eval: Evaluating large vision language models via unified re-formulation of task-oriented benchmarks

Zejun Li*, Ye Wang*, Mengfei Du*, Qingwen Liu*, Binhao Wu*, Jiwen Zhang*, et al.

[Benchmark Page]

  • An effective method to transform task-oriented benchmarks to formats that are compatible to evaluate LMMs based on open-ended generation.
  • Please also see EmbSpatial-Bench for a benchmark dedicated for the diagnosis of the main limitation of current LMMs, namely spatial reasoning.
ACL 2023
sym

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

Zejun Li, Zhihao Fan, Jingjing Chen, Qi Zhang, Xuanjing Huang, Zhongyu Wei

[GitHub] [Pretrained Model]

  • We propose a method to transfer English vision-language models to multilingual scenarios without the need for multilingual image-text data.
  • Leveraging English text as anchors, our model is trained with English image-text pairs and English-to-others translation pairs.
  • Multi-modal ability learned from English image-text data can be transfered across languages with our method.
ACM MM 2022
sym

MVPTR: Multi-Level Semantic Alignment for Vision-Language Pre-Training via Multi-Stage Learning

Zejun Li, Huaixiao Tou, Jingjing Chen, Zhongyu Wei, Xuanjing Huang.

[GitHub] [Pretrained Model]

  • MVPTR is a vision-language pre-trained model with the ability to jointly model multi-level semantic alignment.
  • Integrated with multi-grained information and learning, MVPTR demonstrates superior performance on downstream tasks.

đź“– Educations

  • 2020.09 - 2026.06 (estimated), Ph.D. Candidate in Statistics (Statistical Machine Learning Direction), Fudan University, Shanghai, China.
  • 2016.09 - 2020.06, Bachelor of Computer Science, Fudan University, Shanghai, China.

đź’¬ Teaching and Service

Teaching Assistant

  • Head TA for “Introduction to Artificial Intelligence”, Fudan University, Spring 2020, 2021, Fall 2022.
  • Organizer for “NLP training camp of Fudan DISC”, Fudan University, Spring 2023.

Program Committee:

  • Area Chair of Multi-Modality: EMNLP 2024, NAACL 2025.
  • Reviewer: ACL, EMNLP, AAAI, IJCAI, CL, ACM MM, TKDD, NLPCC, CCL.

Invited Talks

  • “Evolution of the Vision-Language Pre-traning Framework”, in CSSNLP 2022 Student Seminar.
  • “How to Eficiently Conduct Team-based Research?”, in SMP 2023, Student Seminar.
  • “Rethinking Research Direction Selection in Your Graduate Career”, in YSSNLP 2024, Student Seminar.

đź’» Internships

  • 2025.05 - Now, intern of T-Star Lab at Taobao & Tmall Group of Alibaba, Beijing, China.
    • T-Star Lab is a premier internship program for top talent within the Taobao & Tmall Group.
    • We explore a novel adaptive visual reasoning paradigm, Mixture-of-Visual-Thoughts.
    • By unifying different reasoning modes within a single model and guiding it to adaptively select the appropriate mode based on the questions, our AdaVaR model achieves consistent improvement across various domains.
  • 2021.09 - 2022.09, intern of CV algorithm engineer at ByteDance E-Commerce Group, Shanghai, China.
    • We construct a vision-language pre-trained model to encode video representations for Douyin Shop.
    • Our model is applied to perform video rating and recommendation, ultimately helping to improve content moderation, distribution efficiency, and recommendation click-through rates.
    • The reasearch findings are summarized in our ACM MM 2022 paper.
  • 2019.06 - 2019.09, intern of machine learning algorithm enginner at Alibaba Koubei Group, Hangzhou, China.
    • We develop a algorithm dedicated for distributing coupons for users based on their history behaviors.
    • By combining deep models with behavioral features, we learn representations of both users and coupons, improving the usage rate of coupons at the same cost.