Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproduction issue with qwen2.5-7B-Instruct #7

Open
hzy312 opened this issue Dec 17, 2024 · 6 comments
Open

reproduction issue with qwen2.5-7B-Instruct #7

hzy312 opened this issue Dec 17, 2024 · 6 comments

Comments

@hzy312
Copy link

hzy312 commented Dec 17, 2024

When I use gpt4o, it will generate critique then with the final prediction like this:

{"id": "gsm8k-0", "generator": "Qwen2-7B-Instruct", "problem": "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?", "steps": ["To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.", "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.", "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.", "To find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\)."], "final_answer_correct": false, "label": 1, "generated_critique": "Let's review the solution paragraph by paragraph:\n\n<paragraph_0>\nTo find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.\n</paragraph_0>\nThis paragraph correctly states the initial number of pink flamingos. No errors here.\n\n<paragraph_1>\nOn Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.\n</paragraph_1>\nThis paragraph correctly calculates the number of flamingos taken back, painted, and returned. It also correctly states the number of pink and white flamingos by the end of Saturday. No errors here.\n\n<paragraph_2>\nOn Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.\n</paragraph_2>\nThis paragraph correctly adds the additional pink flamingos on Sunday and states the correct totals. No errors here.\n\n<paragraph_3>\nTo find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\).\n</paragraph_3>\nThis paragraph correctly calculates the difference between the number of pink and white flamingos. No errors here.\n\nSince there are no errors in any of the paragraphs, the final answer is \(\boxed{-1}\).", "prediction": -1, "match": false}

Bt when i use the qwen2.5-7b-instruct, it will generate the answer direct. And the final F1 score is totally different with the score in original paper. The trajectory like this:

{"id": "gsm8k-0", "generator": "Qwen2-7B-Instruct", "problem": "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?", "steps": ["To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.", "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.", "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.", "To find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\)."], "final_answer_correct": false, "label": 1, "generated_critique": "\boxed{1}", "prediction": 1, "match": true}

It's weird, and i use the same code in your repo.

transformers==4.47.0
vllm==0.6.4.post1

@chujiezheng
Copy link
Collaborator

chujiezheng commented Dec 17, 2024

Yes, I noticed the same behavior of Qwen2.5-7B-Instruct. Probably because it is not capable enough.

I have updated the metric calculation codes:

acc1 = np.mean([e['match'] for e in error_data]) * 100
acc2 = np.mean([e['match'] for e in correct_data]) * 100
f1 = 2 * acc1 * acc2 / (acc1 + acc2)
print(f'{config} error acc: {acc1:.1f}, correct acc: {acc2:.1f}, f1: {f1:.1f}')

Could you share your calculated scores?

@hzy312
Copy link
Author

hzy312 commented Dec 17, 2024

For Qwen2.5-7B-Instruct-gredy-decoding:
{"correct_acc": 0.46113989637305697, "error_acc": 0.4396135265700483, "f1": 0.45011949091313286}

Then I modify the prompt from
“Please put your final answer (i.e., the index) in \boxed{{}}.”
to
"Please critique first and put your final answer (i.e., the index) in \boxed{{}}."

It will generate cot first and the performance total changes:
{"correct_acc": 0.7979274611398963, "error_acc": 0.3140096618357488, "f1": 0.45066744704319833}

@hzy312
Copy link
Author

hzy312 commented Dec 17, 2024

For Qwen2.5-7B-Instruct-gredy-decoding: {"correct_acc": 0.46113989637305697, "error_acc": 0.4396135265700483, "f1": 0.45011949091313286}

Then I modify the prompt from “Please put your final answer (i.e., the index) in \boxed{{}}.” to "Please critique first and put your final answer (i.e., the index) in \boxed{{}}."

It will generate cot first and the performance total changes: {"correct_acc": 0.7979274611398963, "error_acc": 0.3140096618357488, "f1": 0.45066744704319833}

The modified version can reproduce the trend in your paper at least. But the original prompt performance has the different trend (correct and error have the almost same performance, but in Table9, the correct is way high than error)

@hzy312
Copy link
Author

hzy312 commented Dec 17, 2024

s

These results are on gsm8k

@chujiezheng
Copy link
Collaborator

I used vllm==0.6.3.post1 and 2 gpus (tp=2) for Qwen2.5-7B-Instruct inference - could you try with this config?

@hzy312
Copy link
Author

hzy312 commented Dec 17, 2024

I used vllm==0.6.3.post1 and 2 gpus (tp=2) for Qwen2.5-7B-Instruct inference - could you try with this config?

I tried. And i get this:
{"correct_acc": 0.48704663212435234, "error_acc": 0.4251207729468599, "f1": 0.4539816695022227}

The performance is dependent on prompt, vllm version and the number of gpu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants