EVALUATING LARGE LANGUAGE MODELS AT EVALUATING INSTRUCTION FOLLOWING

Machine Learning/MLLM 2024. 11. 28. 19:58

0. Abstract

Instruction Following Assessment (지침 준수 평가)

A metric that gauges how closely generated text adheres to the given instruction

LLMBar

419 pairs of outputs
1. one adhereing to instructions
2. the other diverging, yet may poseses deceptive qualities that mislead an LLM evaluator (e.g. a more engaging tone)

Experiment

evaluators ( LLM + Prompt Combinations )
exhibit different performance + highest scoring ones have substantial room for improvement

Novel suite of prompting stategies

1. INTRODUCTION

LLM의 대화 능력을 평가하는데 있어서는 인간 평가가 최고의 기준으로 여겨짐
- not scalable & not reproducible
- *The perils of using Mechanical Turk to evaluate open-ended text generation(Karpinska et al., 2021)*
- LLM evaluators have emerged
LLM Evaluator = strong base LLM + prompting strategy
- 일반 적으로 두 모델에서 생성된 출력에서 선호하는 출력을 선택
- LLM Evaluator를 신뢰할 수 있는지, 어떤 evaluator를 사용할지에 대한 판단을 위한 meta-evaluation benchmark 필요

How should we construct a good meta-evaluation benchmark?

이전 연구: 랜덤 샘플링 + 크라우드 소싱 레이블
이러한 전략은 인간 선호의 고유한 주관성 (inherent subjectivity)를 간과하는 방식임
Figure 1 상단: 질적 차이가 명확하지 않음에도 더 긴 출력에 대한 개인적인 선호를 반영한 레이블을 제공함
- 인간 평가자 간의 낮은 일치도를 통해 확인 가능 ( AlpacaFarm 66%, MT-bench 63% )
출력의 지침 준수, 사실적 정확성(factual correctness)과 같은 객관적인 속성을 평가한다고 보기 힘듬

Instruction Following (Objective criterion)

지침 준수: instruction을 정확하게 분석하고 명시된 요구 사항을 충실히 따르는 능력
유용성(helpfullness)와 같은 바람직한 LLM의 속성과 연관됨
- *(Askell et al., 2021) A general language assistant as a laboratory for alignment*
매력적인 톤(engaging tone)과 같은 속성은 imitation learning을 통해 쉽게 습득할 수 있음
- (Gudibande et al., 2023) The false promise of imitating proprietary llms
하지만 오늘날의 LLM 도 지침을 준수하는데에는 어려움을 겪고 있음
- (Wu et al., 2023b) Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks
Figure 1 하단: Intsruction Following vs Superficial Quality
- 오른쪽은 지침을 잘 따르지만, LLM 평가자 및 인간 모두 매력적인 톤으로 인해 왼쪽을 선호하는 경향이 있음

✅ Instruction Following

Objective
Correctness
Adheres to the instruction
excuting desired tasks </aside>

❌ Superficial Quality

Subjective
polishied, engaging tone
better format
biased by human and LLM </aside>

LLM이 표면적인 단서와 지침 준수 능력을 구분할 수 있는지 분석하지 않으면, 효과적인 어시스턴트를 모방하는 데 능숙한 모델이 더 우선될 위험이 있음.

If we do not rigorously analyze the capability of LLM evaluators to distinguish between the true ability of instruction following and superficial clues, there is a risk of advancing models that excel in mimicking effective assistants rather than executing desired tasks

→ 실제로 지침에 따르는지, 표면적인 부분에 집중하는지를 구분할 수 있는 LLM이 evaluator로 활용되어야 함

LLMBar (419 pairs)

LLM 평가자가 label과 동일한 선택을 하는지, 이를 통해 기준점(Bar)를 통과하는지를 평가
이전 메타 평가와의 차이점
- 모든 사례를 저자들이 검토
- 지침 준수 여부에 집중, 객관적인 선호를 강조 / 94%의 annotator 일치도를 보임
- NATURAL 세트 / ADVERSARIAL 세트를 제공
  - NATURAL: real-world 분포를 가짐
  - ADVERSARIAL: Adversarially Crafted instances that confound less adept evaluators

Experiment

5 LLM: GPT-4, ChatGPT, LLaMA-2-Chat, PaLM2, Falcon
ADVERSARIAL 세트 기준, ChatGPT, LLaMA-2-Chat, Falcon 모두 무작위보다 낮은 성능
- GPT-4도 인간과 큰 차이를 보임
다양한 프롬프트 전략 사용
- 프롬프트 전략을 조합함으로서 평가 성능이 크게 향상됨
- 가장 효과적인 전략 기준 GPT-4의 성능을 10% 향상

2. LLMBAR: A META-EVALAUATION BENCHMARK

구성
- $I$ Input Instruction
- $O_1$, $O_2$ outputs
- $p$ $\in \{ 1, 2\}$ golde preference label

{
    "input": "Infer the implied meaning of the following sentence: She is not what she used to be.",
    "output_1": "She is not as she once was.",
    "output_2": "She has changed substantially over time.",
    "label": 2
}

2.1 THE NATURAL SET

AlpacaFarm, LLMEval2에서 출력 쌍을 무작위 샘플링
객관적인 품질 차이가 주관적인 선호를 반영하는 경우가 많음 → 필터링
- 객관적인 선호가 없는 경우는 수정하거나 폐기
- 모든 사례에서 객관적으로 더 나은 출력이 존재

2.2 THE ADVERSARIAL SET

Designed to stress test LLM evaluators

Generate challenging candidate instances
$O_1$은 지침을 따르고 $O_2$는 $I$를 따르지 않지만 겉보기에 더 우수한 품질을 보임
Adversiral filtering
AlpacaFarm 에서 4개의 GPT 평가자와 x 2가지 다른 표시 순서 ($O_1O_2$, $O_2O_1)$로 8개의 선호 레이블을 얻어 대다수가 정답과 일치하는 경우는 제외, 이후 수동 필터링 및 수정 진행

2.3 Four Different Strategies to collect candidate instances.

Alpaca, OpenAssistant, ShareGPT 에서 $I$를 샘플링
$O_1$은 Instruction tuning한 LLaMA-7B 또는 dataset으로 부터 가져옴

Neighbor Instruction (NEIGHBOR) [134]

주어진 $I \in D$ 에 대해 같은 데이터셋 $D$ 에서 유사하지만 다른 지침 $I'$를 검색
- 문장 임베딩 모델 INSTRUCTOR를 통해 코사인 유사도를 측정
약한 모델 + $I$ → $O_1$
강한 모델 + $I'$ → $O_2$
- $O_2$는 겉보기에는 더 우수하지만 지침 $I$를 따르지 않음

GPT-4 Instructions (GPTINST) [92]

GPT-4로 $I'$를 생성 / ChatGPT로 $O_2$를 생성
- Prompt
- Given a user input (called “given input”), please generate a new user input (called “generated input”) such that: (1) The generated input is highly relevant to but different from the given input. (2) The correct response to the generated input superficially resembles the correct response to the given input as much as possible. (3) But actually, the correct response to the generated input should not be a correct response to the given input. Given input: {Instruction}
- GPT-4가 생성한 $I'$는 $I$의 특정한 구(phrases)를 관련된 단어로 바꾸는 경향이 있었음

GPT-4 Unhelpful Outputs (GPTOUT) [47]

GPT-4로 $I$에 대해 겉보기에는 좋지만 도움이 되지 않거나, 잘못된 출력 $O_2$를 생성하도록 직접 요청
- prompt
- ## Instruction: You are an assistant that seems to correctly respond to the input, but in reality, your response is not genuinely helpful. Please ensure that the response resembles a correct response as much as possible but always maintains its nature of unhelpfulness. Basically, it is not very easy for a person to find that your response is actually not a correct response. Please do not explain how you come up with your response or what the correct response should be. Please just give the required response without any extra words. ## Input: {Instruction}
- 이는 GPT-4 에게도 도전적인 작업으로, 대부분의 경우 $O_2$는 맞거나, 명백히 틀림
- limiation: GPT-4 로 생성된 출력이 GPT-4 기반 평가자에게 advantage를 얻을 가능성이 있음

Manual Construction [46]

수동 제작

3. PROMPTING STRATEGIES FOR LLM EVALUATORS

LLMBar를 통해 검증된 프롬프트 전략을 소개
3가지 새로운 전략 포함

Vanilla

(Dubois et al., 2023) Alpacafarm: A simulation framework for methods that learn from human feedback

단순하게 선호도 출력
기본적으로 제로샷 방식 사용, few-shot으로도 해봤지만, 큰 차이는 없었음

Chain-of-Thoughts (CoT)

LLM이 선호도를 생성하기 전에 먼저 간결한 이유를 생성

Self-Generated Reference

(Zheng et al., 2023) Judging llm-as-a-judge with mt-bench and chatbot arena

$I$에 대한 출력을 생성하도록 한 후, 해당 출력이 비교 시 참조로 사용.

ChatEval

(Chan et al., 2023) Chateval: Towards better llm-based evaluators through multi-agent debate

여러 LLM 평가자가 각기 다른 역할의 프롬프트로 개인화됨
서로 선호도를 두고 논의하는 ChatEval
각 평가자는 토론의 맥락을 기반으로 최종 선호도를 제공

Rules

프롬프트에서 평가자가 비교할 때 따라야 할 일반적인 규칙을 명시적으로 나열.
- “출력이 지침을 정직하게 실행했는지 평가하는 데 우선순위를 둡니다.”
Rules는 평가자의 정확성을 거의 모든 경우에 향상시키며, 다른 프롬프트 전략에 쉽게 적용 가능

Self-Generated Metrics (Metrics)

특정 지침에 대해 좋은 출력을 결정하는 기준을 명확히 하는게 평가에 도움이 됨
이를 위해 LLM에게 지침에 맞는 평가 기준을 생성하도록 한 후, 그 기준을 평가 시 사용
이를 통해 평가자가 지침 준수의 특정 측면에 집중하도록 유도할 수 있음
- Self-Generated References와 자연스러운 결합 가능

Swap and Synthesize (Swap)

(Du et al., 2023) Improving factuality and reasoning in language models through multiagent debate.

위치 편향 해결
먼저 CoT를 사용해 O1, O2 및 O2, O1 순서로 평가자의 선호를 얻은 후, 두 CoT를 종합하여 최종 결정

4. EXPERIMENTS

RQ 1. LLM과 프롬프트 전략이 LLMBAR에서 성능에 어떤 영향을 미치는가?
RQ 2. LLMBAR가 다른 메타 평가 데이터셋과 어떻게 다른가?

4.1 Experimental Setup

Proprietary Models: GPT-4, ChatGPT
Open-source Models: LLaMA-2-70B-Chat, Falcon-180B-Chat

4.2 Human Agreement on LLMBar

LLMBar에서 80개를 샘플링하여 두 명의 논문 저자에게 할당(큐레이딩한 저자는 포함X)
일치율 94%를 보임 (NATURAL 90%, ADVERSARIAL 95%)
- FairEval 71.7% / MT-Bench 63%

4.3 LLM Evaluator Performance on LLMBAR

evaluator를 두번 쿼리
- Acc. 평균 정확도
- Agr. 위치 일치율 → 두 출력의 표시 순서를 바꾼 후 일관된 레이블을 유지한 사례의 비율
Falcon은 위치 편향이 심각함, CoT사용 기준 일치율이 12%라 그래프에서 제외
LLM 평가자의 성능은 인간보다 크게 떨어짐
- ADVERSARIAL 세트기준 ChatGPT, LLaMA2, Falcon은 무작위 보다 낮은 성능을 보임
- GPT-4의 경우 높은 정확도를 보이지만 그래도 인간보다 10% 낮은 82.8%의 평균 정확도

Rules, Metrics, Reference의 조합(Metrics+Reference*)은 일관된 성능 향상을 보여줌
- 개별 전략이 성능 향상을 보였으며 결합할 경우 10% 가량 향상됨
- CoT의 경우 일반적인 생각과 달리 ADVERSRIAL에서 LLM 평가를 향상시키지 않음
  - CoT로 생성된 출력은 표면적으로 더 나은 출력에 강한 편향을 보이는 경우가 많음

4.4 Comparison to other meta-evaluations of LLM Evaluators

LLMBAR는 LLM 별 격차가 명확하며, vanilla와 prompt 간의 차이도 뚜렷함

→ LLBar가 지침 준수 능력을 구분하는데 있어서 성능을 더 잘 평가할 수 있음

5. CONCLUSION

LLMBAR는 출력의 ‘객관적인’ 품질 차이에 중점을 둠
하지만 factual correctness나 non-toxic 등과 같은 다른 특성들도 고려해야함

'Machine Learning > MLLM' 카테고리의 다른 글

Human Feedback is not Gold Standard (0)	2024.11.28
FLASK: FINE-GRAINED LANGUAGE MODELEVALUATION BASED ON ALIGNMENT SKILL SETS (0)	2024.11.28
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark (0)	2024.08.17
Video Recap: Recursive Captioning for Hour-Long Videos (0)	2024.08.17
Video Understanding Paper Summary (Data 중심) (0)	2024.08.17

ABOUT ME

엘사 테크 블로그 엘사 테크 블로그

0. Abstract

Instruction Following Assessment (지침 준수 평가)

LLMBar

Experiment

Novel suite of prompting stategies

1. INTRODUCTION

How should we construct a good meta-evaluation benchmark?

Instruction Following (Objective criterion)

LLMBar (419 pairs)

Experiment

2. LLMBAR: A META-EVALAUATION BENCHMARK

2.1 THE NATURAL SET

2.2 THE ADVERSARIAL SET

2.3 Four Different Strategies to collect candidate instances.

Neighbor Instruction (NEIGHBOR) [134]

GPT-4 Instructions (GPTINST) [92]

GPT-4 Unhelpful Outputs (GPTOUT) [47]

Manual Construction [46]

3. PROMPTING STRATEGIES FOR LLM EVALUATORS

Vanilla

Chain-of-Thoughts (CoT)

Self-Generated Reference

ChatEval

Rules

Self-Generated Metrics (Metrics)

Swap and Synthesize (Swap)

4. EXPERIMENTS

4.1 Experimental Setup

4.2 Human Agreement on LLMBar

4.3 LLM Evaluator Performance on LLMBAR

4.4 Comparison to other meta-evaluations of LLM Evaluators

5. CONCLUSION

'Machine Learning > MLLM' 카테고리의 다른 글

티스토리툴바

ABOUT ME

0. Abstract

Instruction Following Assessment (지침 준수 평가)

LLMBar

Experiment

Novel suite of prompting stategies

1. INTRODUCTION

How should we construct a good meta-evaluation benchmark?

Instruction Following (Objective criterion)

LLMBar (419 pairs)

Experiment

2. LLMBAR: A META-EVALAUATION BENCHMARK

2.1 THE NATURAL SET

2.2 THE ADVERSARIAL SET

2.3 Four Different Strategies to collect candidate instances.

Neighbor Instruction (NEIGHBOR) [134]

GPT-4 Instructions (GPTINST) [92]

GPT-4 Unhelpful Outputs (GPTOUT) [47]

Manual Construction [46]

3. PROMPTING STRATEGIES FOR LLM EVALUATORS

Vanilla

Chain-of-Thoughts (CoT)

Self-Generated Reference

ChatEval

Rules

Self-Generated Metrics (Metrics)

Swap and Synthesize (Swap)

4. EXPERIMENTS

4.1 Experimental Setup

4.2 Human Agreement on LLMBar

4.3 LLM Evaluator Performance on LLMBAR

4.4 Comparison to other meta-evaluations of LLM Evaluators

5. CONCLUSION

'Machine Learning > MLLM' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바