CELLO: Causal Evaluation of Large Vision-Language Models (2024.06.27)
Meiqi Chen, Bo Peng, Yan Zhang, Chaochao Lu
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation (2024.06.26)
Christoph Leiter, Steffen Eger
Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models (2024.06.24)
Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, etc
OR-Bench: An Over-Refusal Benchmark for Large Language Models (2024.05.31)
Justin Cui, Wei-Lin Chiang, I. Stoica, Cho-Jui Hsieh . - 【arXiv.org】
TimeChara: Evaluating Point-in-Time Character Hallucination of Role-Playing Large Language Models (2024.05.28)
Jaewoo Ahn, Taehyun Lee, Junyoung Lim, Jin-Hwa Kim, Sangdoo Yun, etc . - 【arXiv.org】
Abhishek Kumar, Sarfaroz Yunusov, Ali Emami . - 【arXiv.org】
HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models (2024.05.16)
R. Sukthanker, Arber Zela, B. Staffler, Jorg K. H. Franke, Frank Hutter . - 【arXiv.org】
Multimodal LLMs Struggle with Basic Visual Network Analysis: a VNA Benchmark (2024.05.10)
Evan M. Williams, K. Carley . - 【arXiv.org】
Vibe-Eval: A hard evaluation suite for measuring progress of multimodal language models (2024.05.03)
Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, etc . - 【arXiv.org】
Causal Evaluation of Language Models (2024.05.01)
Sirui Chen, Bo Peng, Meiqi Chen, Ruiqi Wang, Mengying Xu, etc . - 【arXiv.org】
IndicGenBench: A Multilingual Benchmark to Evaluate Generation Capabilities of LLMs on Indic Languages (2024.04.25)
Harman Singh, Nitish Gupta, Shikhar Bharadwaj, Dinesh Tewari, Partha Talukdar . - 【arXiv.org】
MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI (2024.04.24)
Kaining Ying, Fanqing Meng, Jin Wang, Zhiqiang Li, Han Lin, etc . - 【arXiv.org】
Evaluating LLMs at Detecting Errors in LLM Responses (2024.04.04)
Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, etc
Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers (2024.04.04)
Yuan Wang, Xuyang Wu, Hsin-Tai Wu, Zhiqiang Tao, Yi Fang
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models (2024.03.29)
Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, etc . - 【arXiv.org】
ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models (2024.03.08)
Jio Oh, Soyeon Kim, Junseok Seo, Jindong Wang, Ruochen Xu, etc
Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation (2024.03.05)
Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, etc
Beyond Specialization: Assessing the Capabilities of MLLMs in Age and Gender Estimation (2024.03.04)
Maksim Kuprashevich, Grigorii Alekseenko, Irina Tolstykh
A Cognitive Evaluation Benchmark of Image Reasoning and Description for Large Vision Language Models (2024.02.28)
Xiujie Song, Mengyue Wu, Ke Zhu, Chunhao Zhang, Yanyi Chen
Evaluating Very Long-Term Conversational Memory of LLM Agents (2024.02.27)
A. Maharana, Dong-Ho Lee, S. Tulyakov, Mohit Bansal, Francesco Barbieri, etc . - 【arXiv.org】
Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (2024.02.21)
Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Hansheng Fang, Aishan Liu, etc . - 【arXiv.org】
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization (2024.02.20)
Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W. Vincent, etc
How Well Can LLMs Negotiate? NegotiationArena Platform and Analysis (2024.02.08)
Federico Bianchi, P. Chia, Mert Yüksekgönül, Jacopo Tagliabue, Daniel Jurafsky, etc . - 【arXiv.org】
Can Large Language Models Understand Context? (2024.02.01)
Yilun Zhu, Joel Ruben Antony Moniz, Shruti Bhargava, Jiarui Lu, Dhivya Piraviperumal, etc . - 【arXiv.org】
Evaluating Large Language Models for Generalization and Robustness via Data Compression (2024.02.01)
Yucheng Li, Yunhao Guo, Frank Guerin, Chenghua Lin . - 【arXiv.org】
PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models (2024.01.26)
Haochen Tan, Zhijiang Guo, Zhan Shi, Lu Xu, Zhili Liu, etc . - 【arXiv.org】
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks (2024.01.24)
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, etc . - 【arXiv.org】
Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 (2023.12.26)
S. Bsharat, Aidar Myrzakhan, Zhiqiang Shen . - 【arXiv.org】
TouchStone: Evaluating Vision-Language Models by Language Models (2023.08.31)
Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xing Zhang, etc . - 【arXiv.org】
Shepherd: A Critic for Language Model Generation (2023.08.08)
Tianlu Wang, Ping Yu, Xiaoqing Tan, Sean O'Brien, Ramakanth Pasunuru, etc . - 【arXiv.org】
Self-consistency for open-ended generations (2023.07.11)
Siddhartha Jain, Xiaofei Ma, Anoop Deoras, Bing Xiang . - 【arXiv.org】
Jailbroken: How Does LLM Safety Training Fail? (2023.07.05)
Alexander Wei, Nika Haghtalab, J. Steinhardt . - 【arXiv.org】
Towards Measuring the Representation of Subjective Global Opinions in Language Models (2023.06.28)
Esin Durmus, Karina Nyugen, Thomas Liao, Nicholas Schiefer, Amanda Askell, etc . - 【arXiv.org】
On the Reliability of Watermarks for Large Language Models (2023.06.07)
John Kirchenbauer, Jonas Geiping, Yuxin Wen, Manli Shu, Khalid Saifullah, etc . - 【arXiv.org】
SETI: Systematicity Evaluation of Textual Inference (2023.05.24)
Xiyan Fu, Anette Frank
From Words to Wires: Generating Functioning Electronic Devices from Natural Language Descriptions (2023.05.24)
Peter Jansen
Testing the General Deductive Reasoning Capacity of Large Language Models Using OOD Examples (2023.05.24)
Abulhair Saparov, Richard Yuanzhe Pang, Vishakh Padmakumar, Nitish Joshi, Seyed Mehran Kazemi, etc
EvEval: A Comprehensive Evaluation of Event Semantics for Large Language Models (2023.05.24)
Zhengwei Tao, Zhi Jin, Xiaoying Bai, Haiyan Zhao, Yanlin Feng, etc
Eliciting the Translation Ability of Large Language Models via Multilingual Finetuning with Translation Instructions (2023.05.24)
Jiahuan Li, Hao Zhou, Shujian Huang, Shanbo Chen, Jiajun Chen
HuatuoGPT, towards Taming Language Model to Be a Doctor (2023.05.24)
Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, etc
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For Large Language Models (2023.05.24)
Daman Arora, Himanshu Gaurav Singh, Mausam
Is GPT-4 a Good Data Analyst? (2023.05.24)
Liying Cheng, Xingxuan Li, Lidong Bing
ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories (2023.05.24)
Heming Xia, Qingxiu Dong, Lei Li, Jingjing Xu, Ziwei Qin, etc
Sentiment Analysis in the Era of Large Language Models: A Reality Check (2023.05.24)
Wenxuan Zhang, Yue Deng, Bing Liu, Sinno Jialin Pan, Lidong Bing
A RelEntLess Benchmark for Modelling Graded Relations between Named Entities (2023.05.24)
Asahi Ushio, Jose Camacho Collados, S. Schockaert
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models (2023.05.24)
Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, etc
GPTAraEval: A Comprehensive Evaluation of ChatGPT on Arabic NLP (2023.05.24)
Md Tawkat Islam Khondaker, Abdul Waheed, El Moatez Billah Nagoudi, Muhammad Abdul-Mageed
Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, etc
PESCO: Prompt-enhanced Self Contrastive Learning for Zero-shot Text Classification (2023.05.24)
Yau-Shian Wang, Ta-Chung Chi, Ruohong Zhang, Yiming Yang
Evaluating NLG Evaluation Metrics: A Measurement Theory Perspective (2023.05.24)
Ziang Xiao, Susu Zhang, Vivian Lai, Q. Vera Liao
ByteSized32: A Corpus and Challenge Task for Generating Task-Specific World Models Expressed as Text Games (2023.05.24)
Ruoyao Wang, Graham Todd, Eric Yuan, Ziang Xiao, Marc-Alexandre Cot'e, etc
Estimating Large Language Model Capabilities without Labeled Test Data (2023.05.24)
Harvey Yiyun Fu, Qinyuan Ye, Albert Xu, Xiang Ren, Robin Jia
Faithful Low-Resource Data-to-Text Generation through Cycle Training (2023.05.24)
Zhuoer Wang, Marcus Collins, Nikhita Vedula, Simone Filice, Shervin Malmasi, etc
Large Language Models as Counterfactual Generator: Strengths and Weaknesses (2023.05.24)
Yongqi Li, Mayi Xu, Xin Miao, Shen Zhou, T. Qian
ChatGPT and Simple Linguistic Inferences: Blind Spots and Blinds (2023.05.24)
Victoria Basmov, Yoav Goldberg, Reut Tsarfaty
Measuring the Knowledge Acquisition-Utilization Gap in Pretrained Language Models (2023.05.24)
Amirhossein Kazemnejad, Mehdi Rezagholizadeh, Prasanna Parthasarathi, Sarath Chandar
Using Natural Language Explanations to Rescale Human Judgments (2023.05.24)
Manya Wadhwa, Jifan Chen, Junyi Jessy Li, Greg Durrett
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models (2023.05.24)
Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin Choi, etc
ECHo: Event Causality Inference via Human-centric Reasoning (2023.05.24)
Yuxi Xie, Guanzhen Li, MingSung Kan
In-Context Demonstration Selection with Cross Entropy Difference (2023.05.24)
Dan Iter, Reid Pryzant, Ruochen Xu, Shuohang Wang, Yang Liu, etc
I Spy a Metaphor: Large Language Models and Diffusion Models Co-Create Visual Metaphors (2023.05.24)
Tuhin Chakrabarty, Arkadiy Saakyan, Olivia Winn, Artemis Panagopoulou, Yue Yang, etc
Gender Biases in Automatic Evaluation Metrics: A Case Study on Image Captioning (2023.05.24)
Haoyi Qiu, Zi-Yi Dou, Tianlu Wang, Asli Celikyilmaz, Nanyun Peng
Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models (2023.05.24)
Jiashu Xu, Mingyu Derek Ma, Fei Wang, Chaowei Xiao, Muhao Chen
Can Transformers Learn to Solve Problems Recursively? (2023.05.24)
Shizhuo Dylan Zhang, Curt Tigges, Stella Rose Biderman, M. Raginsky, T. Ringer
Enabling Large Language Models to Generate Text with Citations (2023.05.24)
Tianyu Gao, Howard Yen, Jiatong Yu, Danqi Chen
Attentiveness to Answer Choices Doesn't Always Entail High QA Accuracy (2023.05.24)
Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, Ashish Sabharwal
WikiChat: A Few-Shot LLM-Based Chatbot Grounded with Wikipedia (2023.05.23)
Sina J. Semnani, Violet Z. Yao, Heidi C. Zhang, M. Lam
Efficient Open Domain Multi-Hop Question Answering with Few-Shot Data Synthesis (2023.05.23)
Mingda Chen, Xilun Chen, Wen-tau Yih
Learn from Mistakes through Cooperative Interaction with Study Assistant (2023.05.23)
Danqing Wang, Lei Li
RET-LLM: Towards a General Read-Write Memory for Large Language Models (2023.05.23)
Ali Modarressi, Ayyoob Imani, Mohsen Fayyaz, Hinrich Schütze
ZeroSCROLLS: A Zero-Shot Benchmark for Long Text Understanding (2023.05.23)
Uri Shaham, Maor Ivgi, Avia Efrat, Jonathan Berant, Omer Levy
Evaluating Factual Consistency of Summaries with Large Language Models (2023.05.23)
Shiqi Chen, Siyang Gao, Junxian He
Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge Retrieval from Foundation Language Models (2023.05.23)
Tim Schott, Daniel Furman, Shreshta Bhat
Detecting and Mitigating Indirect Stereotypes in Word Embeddings (2023.05.23)
Erin George, Joyce Chew, Deanna Needell
PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents (2023.05.23)
Simeng Sun, Yang Liu, Shuohang Wang, Chenguang Zhu, Mohit Iyyer
Unraveling ChatGPT: A Critical Analysis of AI-Generated Goal-Oriented Dialogues and Annotations (2023.05.23)
Tiziano Labruna, Sofia Brenna, Andrea Zaninello, Bernardo Magnini
Sources of Hallucination by Large Language Models on Inference Tasks (2023.05.23)
Nick McKenna, Tianyi Li, Liang Cheng, Mohammad Javad Hosseini, Mark Johnson, etc
LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond (2023.05.23)
Philippe Laban, Wojciech Kry'sci'nski, Divyansh Agarwal, Alexander R. Fabbri, Caiming Xiong, etc
Automatic Model Selection with Large Language Models for Reasoning (2023.05.23)
Xu Zhao, Yuxi Xie, Kenji Kawaguchi, Junxian He, Qizhe Xie
Cascaded Beam Search: Plug-and-Play Terminology-Forcing For Neural Machine Translation (2023.05.23)
Fr'ed'eric Odermatt, B'eni Egressy, Roger Wattenhofer
CREATOR: Disentangling Abstract and Concrete Reasonings of Large Language Models through Tool Creation (2023.05.23)
Cheng Qian, Chi Han, Yi R. Fung, Yujia Qin, Zhiyuan Liu, etc
Deduction under Perturbed Evidence: Probing Student Simulation Capabilities of Large Language Models (2023.05.23)
Shashank Sonkar, Richard Baraniuk
Prompt position really matters in few-shot and zero-shot NLU tasks (2023.05.23)
Junyu Mao, Stuart E. Middleton, Mahesan Niranjan
CGCE: A Chinese Generative Chat Evaluation Benchmark for General and Financial Domains (2023.05.23)
Xuanyu Zhang, Bingbing Li, Qing Yang
Dancing Between Success and Failure: Edit-level Simplification Evaluation using SALSA (2023.05.23)
David Heineman, Yao Dou, Mounica Maddela, Wei Xu
Pre-training Language Models for Comparative Reasoning (2023.05.23)
Mengxia Yu, Zhihan Zhang, Wenhao Yu, Meng Jiang
Having Beer after Prayer? Measuring Cultural Bias in Large Language Models (2023.05.23)
Tarek Naous, Michael J. Ryan, Wei Xu
On Robustness of Finetuned Transformer-based NLP Models (2023.05.23)
Pavan Kalyan Reddy Neerudu, Subba Reddy Oota, Mounika Marreddy, Venkateswara Rao Kagita, Manish Gupta
Is Information Extraction Solved by ChatGPT? An Analysis of Performance, Evaluation Criteria, Robustness and Errors (2023.05.23)
Ridong Han, T. Peng, Chaohao Yang, Benyou Wang, Lu Liu, etc
Exploring Contrast Consistency of Open-Domain Question Answering Systems on Minimally Edited Questions (2023.05.23)
Zhihan Zhang, Wenhao Yu, Zheng Ning, Mingxuan Ju, Meng Jiang
Domain-Expanded ASTE: Rethinking Generalization in Aspect Sentiment Triplet Extraction (2023.05.23)
Yew Ken Chia, Hui Chen, Wei Han, Guizhen Chen, Sharifah Mahani Aljunied, etc
TaDSE: Template-aware Dialogue Sentence Embeddings (2023.05.23)
Minsik Oh, Jiwei Li, Guoyin Wang
WebIE: Faithful and Robust Information Extraction on the Web (2023.05.23)
Chenxi Whitehouse, Clara Vania, Alham Fikri Aji, Christos Christodoulopoulos, Andrea Pierleoni
Exploring Representational Disparities Between Multilingual and Bilingual Translation Models (2023.05.23)
Neha Verma, Kenton Murray, Kevin Duh
How Old is GPT?: The HumBEL Framework for Evaluating Language Models using Human Demographic Data (2023.05.23)
Anthony Sicilia, Jennifer C. Gates, Malihe Alikhani
Out-of-Distribution Generalization in Text Classification: Past, Present, and Future (2023.05.23)
Linyi Yang, Y. Song, Xuan Ren, Chenyang Lyu, Yidong Wang, etc
Target-Agnostic Gender-Aware Contrastive Learning for Mitigating Bias in Multilingual Machine Translation (2023.05.23)
Minwoo Lee, Hyukhun Koh, Kang-il Lee, Dongdong Zhang, Minsung Kim, etc
LLM-powered Data Augmentation for Enhanced Crosslingual Performance (2023.05.23)
Chenxi Whitehouse, Monojit Choudhury, Alham Fikri Aji
FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation (2023.05.23)
Sewon Min, Kalpesh Krishna, Xinxi Lyu, M. Lewis, Wen-tau Yih, etc
On Learning to Summarize with Large Language Models as References (2023.05.23)
Yixin Liu, Alexander R. Fabbri, Pengfei Liu, Dragomir Radev, Arman Cohan
Reducing Sensitivity on Speaker Names for Text Generation from Dialogues (2023.05.23)
Qi Jia, Haifeng Tang, Kenny Q. Zhu
Self-Critique Prompting with Large Language Models for Inductive Instructions (2023.05.23)
Rui Wang, Hongru Wang, Fei Mi, Yi Chen, Ruifeng Xu, etc
ChipGPT: How far are we from natural language hardware design (2023.05.23)
Kaiyan Chang, Ying Wang, Haimeng Ren, Mengdi Wang, Shengwen Liang, etc
Prompt-Based Monte-Carlo Tree Search for Goal-Oriented Dialogue Policy Planning (2023.05.23)
Xiao Yu, Maximillian Chen, Zhou Yu
Improving Self-training for Cross-lingual Named Entity Recognition with Contrastive and Prototype Learning (2023.05.23)
Ran Zhou, Xin Li, Lidong Bing, E. Cambria, Chun Miao
Better Low-Resource Entity Recognition Through Translation and Annotation Fusion (2023.05.23)
Yang Chen, Vedaant Shah, Alan Ritter
Cross-functional Analysis of Generalisation in Behavioural Learning (2023.05.22)
Pedro Henrique Luz de Araujo, Benjamin Roth
Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations (2023.05.22)
Chenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, etc
LM vs LM: Detecting Factual Errors via Cross Examination (2023.05.22)
Roi Cohen, May Hamri, Mor Geva, A. Globerson
Jesus Solano, Oana-Maria Camburu, Pasquale Minervini
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation (2023.05.22)
Elizabeth Clark, Shruti Rijhwani, Sebastian Gehrmann, Joshua Maynez, Roee Aharoni, etc
Kanbun-LM: Reading and Translating Classical Chinese in Japanese Methods by Language Models (2023.05.22)
Hao Wang, Hirofumi Shimizu, Daisuke Kawahara
Beyond Labels: Empowering Human with Natural Language Explanations through a Novel Active-Learning Architecture (2023.05.22)
Bingsheng Yao, Ishan Jindal, Lucian Popa, Yannis Katsis, Sayan Ghosh, etc
llm-japanese-dataset v0: Construction of Japanese Chat Dataset for Large Language Models and its Methodology (2023.05.22)
Masanori Hirano, Masahiro Suzuki, Hiroki Sakaji
AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback (2023.05.22)
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, etc
Multilingual Holistic Bias: Extending Descriptors and Patterns to Unveil Demographic Biases in Languages at Scale (2023.05.22)
M. Costa-jussà, Pierre Yves Andrews, Eric J. M. Smith, Prangthip Hansanti, Christophe Ropers, etc
Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design (2023.05.22)
Ibrahim M. Alabdulmohsin, Xiaohua Zhai, Alexander Kolesnikov, L. Beyer
MultiTabQA: Generating Tabular Answers for Multi-Table Question Answering (2023.05.22)
Vaishali Pal, Andrew Yates, E. Kanoulas, M. de Rijke
Cross-lingual Transfer Can Worsen Bias in Sentiment Analysis (2023.05.22)
Seraphina Goldfarb-Tarrant, Björn Ross, Adam Lopez
Model Analysis&Evaluation for Ambiguous Question Answering (2023.05.21)
Konstantinos Papakostas, Irene Papadopoulou
TheoremQA: A Theorem-driven Question Answering dataset (2023.05.21)
Wenhu Chen, Ming Yin, Max Ku, Yixin Wan, Xueguang Ma, etc
Evaluating the Performance of Large Language Models on GAOKAO Benchmark (2023.05.21)
Xiaotian Zhang, Chun-yan Li, Yi Zong, Zhengyu Ying, Liang He, etc
Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions? (2023.05.20)
Neeraj Varshney, Mihir Parmar, Nisarg Patel, Divij Handa, Sayantan Sarkar, etc
Evaluation of medium-large Language Models at zero-shot closed book generative question answering (2023.05.19)
Ren'e Peinl, Johannes Wirth
Separating form and meaning: Using self-consistency to quantify task understanding across multiple senses (2023.05.19)
Xenia Ohmer, Elia Bruni, Dieuwke Hupkes . - 【arXiv.org】
OPT-R: Exploring the Role of Explanations in Finetuning and Prompting for Reasoning Skills of Large Language Models (2023.05.19)
Badr AlKhamissi, Siddharth Verma, Ping Yu, Zhijing Jin, Asli Celikyilmaz, etc
Examining the Inter-Consistency of Large Language Models: An In-depth Analysis via Debate (2023.05.19)
Kai Xiong, Xiao Ding, Yixin Cao, Ting Liu, Bing Qin . - 【arXiv.org】
TELeR: A General Taxonomy of LLM Prompts for Benchmarking Complex Tasks (2023.05.19)
Shubhra (Santu) Karmaker, Dongji Feng . - 【arXiv.org】
Efficient Prompting via Dynamic In-Context Learning (2023.05.18)
Wangchunshu Zhou, Yuchen Jiang, Ryan Cotterell, Mrinmaya Sachan . - 【arXiv.org】
Qualifying Chinese Medical Licensing Examination with Knowledge Enhanced Generative Pre-training Model (2023.05.17)
Jiageng Wu, X. Wu, Zhaopeng Qiu, Minghui Li, Yefeng Zheng, etc . - 【arXiv.org】
Can Language Models Solve Graph Problems in Natural Language? (2023.05.17)
Heng Wang, Shangbin Feng, Tianxing He, Zhaoxuan Tan, Xiaochuang Han, etc . - 【arXiv.org】
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models (2023.04.13)
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, etc
GPTEval: NLG Evaluation using GPT-4 with Better Human Alignment (2023.03.29)
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, etc
How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language Understanding Tasks (2023.03.01)
Xuanting Chen, Junjie Ye, Can Zu, Nuo Xu, Rui Zheng, etc . - 【ArXiv】
Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints (2023.02.17)
Albert Lu, Hongxin Zhang, Yanzhe Zhang, Xuezhi Wang, Diyi Yang . - 【ArXiv】
Evaluating the Robustness of Discrete Prompts (2023.02.11)
Yoichi Ishibashi, D. Bollegala, Katsuhito Sudoh, Satoshi Nakamura . - 【ArXiv】
Controlling for Stereotypes in Multimodal Language Model Evaluation (2023.02.03)
Manuj Malik, Richard Johansson . - 【BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP】
Large Language Models Can Be Easily Distracted by Irrelevant Context (2023.01.31)
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, etc . - 【ArXiv】
Emergent Analogical Reasoning in Large Language Models (2022.12.19)
Taylor W. Webb, K. Holyoak, Hongjing Lu . - 【ArXiv】
Discovering Language Model Behaviors with Model-Written Evaluations (2022.12.19)
Ethan Perez, Sam Ringer, Kamilė Lukošiūtė, Karina Nguyen, Edwin Chen, etc . - 【ArXiv】
Constitutional AI: Harmlessness from AI Feedback (2022.12.15)
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, John Kernion, etc . - 【ArXiv】
On Second Thought, Let's Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning (2022.12.15)
Omar Shaikh, Hongxin Zhang, William B. Held, Michael Bernstein, Diyi Yang . - 【ArXiv】
Demystifying Prompts in Language Models via Perplexity Estimation (2022.12.08)
Hila Gonen, Srini Iyer, Terra Blevins, Noah A. Smith, Luke Zettlemoyer . - 【ArXiv】
Solving math word problems with process- and outcome-based feedback (2022.11.25)
J. Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, etc . - 【ArXiv】
Holistic Evaluation of Language Models (2022.11.16)
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, etc . - 【ArXiv】
Can language models handle recursively nested grammatical structures? A case study on comparing models and humans (2022.10.27)
Andrew Kyle Lampinen . - 【ArXiv】
Prompting GPT-3 To Be Reliable (2022.10.17)
Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, etc . - 【ArXiv】
An Interpretability Evaluation Benchmark for Pre-trained Language Models (2022.07.28)
Ya-Ming Shen, Lijie Wang, Ying Chen, Xinyan Xiao, Jing Liu, etc . - 【ArXiv】
Re-Examining Calibration: The Case of Question Answering (2022.05.25)
Chenglei Si, Chen Zhao, Sewon Min, Jordan L. Boyd-Graber . - 【Conference on Empirical Methods in Natural Language Processing】
Evaluating the Impact of Model Scale for Compositional Generalization in Semantic Parsing (2022.05.24)
Linlu Qiu, Peter Shaw, Panupong Pasupat, Tianze Shi, Jonathan Herzig, etc . - 【Conference on Empirical Methods in Natural Language Processing】
The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning (2022.05.06)
Xi Ye, Greg Durrett
Training Verifiers to Solve Math Word Problems (2021.10.27)
Karl Cobbe, V. Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, etc . - 【ArXiv】
BBQ: A hand-built bias benchmark for question answering (2021.10.15)
Alicia Parrish, Angelica Chen, Nikita Nangia, Vishakh Padmakumar, Jason Phang, etc . - 【Findings】
BARTScore: Evaluating Generated Text as Text Generation (2021.06.22)
Weizhe Yuan, Graham Neubig, Pengfei Liu . - 【Neural Information Processing Systems】
Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning (2021.06.13)
Bill Yuchen Lin, Seyeon Lee, Xiaoyang Qiao, Xiang Ren . - 【Annual Meeting of the Association for Computational Linguistics】
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity (2021.04.18)
Yao Lu, Max Bartolo, Alastair Moore, S. Riedel, Pontus Stenetorp . - 【Annual Meeting of the Association for Computational Linguistics】