agcfg

non-profit

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

yilunzhao authored a paper 2 days ago

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

yilunzhao authored a paper 2 days ago

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

entropyhu authored a paper 2 days ago

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

View all activity

agcfg's activity

yilunzhao

authored 2 papers 2 days ago

HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

Paper • 2412.21199 • Published 25 days ago • 13

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published 3 days ago • 73

entropyhu

authored a paper 2 days ago

MMVU: Measuring Expert-Level Multi-Discipline Video Understanding

Paper • 2501.12380 • Published 3 days ago • 73

yilunzhao

authored a paper 10 days ago

ChemAgent: Self-updating Library in Large Language Models Improves Chemical Reasoning

Paper • 2501.06590 • Published 13 days ago • 8

yilunzhao

authored 13 papers about 2 months ago

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

Paper • 2311.09184 • Published Nov 15, 2023 • 1

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Paper • 2311.09783 • Published Nov 16, 2023 • 2

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

Paper • 2311.10537 • Published Nov 16, 2023 • 3

ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples

Paper • 2210.12374 • Published Oct 22, 2022

Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies

Paper • 2305.12586 • Published May 21, 2023

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

Paper • 2408.11878 • Published Aug 20, 2024 • 54

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Paper • 2212.07981 • Published Dec 15, 2022

ReIFE: Re-evaluating Instruction-Following Evaluation

Paper • 2410.07069 • Published Oct 9, 2024

TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models

Paper • 2410.23266 • Published Oct 30, 2024 • 20

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

Paper • 2411.04075 • Published Nov 6, 2024 • 16

yilunzhao

authored 3 papers about 1 year ago

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

Paper • 2311.09805 • Published Nov 16, 2023 • 3

QTSumm: A New Benchmark for Query-Focused Table Summarization

Paper • 2305.14303 • Published May 23, 2023

Large Language Models are Effective Table-to-Text Generators, Evaluators, and Feedback Providers

Paper • 2305.14987 • Published May 24, 2023 • 1

AI & ML interests

Recent Activity

Team members 3

agcfg's activity