-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reproduction issue with qwen2.5-7B-Instruct #7
Comments
Yes, I noticed the same behavior of Qwen2.5-7B-Instruct. Probably because it is not capable enough. I have updated the metric calculation codes: Lines 127 to 130 in 519df16
Could you share your calculated scores? |
For Qwen2.5-7B-Instruct-gredy-decoding: Then I modify the prompt from It will generate cot first and the performance total changes: |
The modified version can reproduce the trend in your paper at least. But the original prompt performance has the different trend (correct and error have the almost same performance, but in Table9, the correct is way high than error) |
These results are on gsm8k |
I used vllm==0.6.3.post1 and 2 gpus (tp=2) for Qwen2.5-7B-Instruct inference - could you try with this config? |
I tried. And i get this: The performance is dependent on prompt, vllm version and the number of gpu. |
When I use gpt4o, it will generate critique then with the final prediction like this:
{"id": "gsm8k-0", "generator": "Qwen2-7B-Instruct", "problem": "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?", "steps": ["To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.", "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.", "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.", "To find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\)."], "final_answer_correct": false, "label": 1, "generated_critique": "Let's review the solution paragraph by paragraph:\n\n<paragraph_0>\nTo find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.\n</paragraph_0>\nThis paragraph correctly states the initial number of pink flamingos. No errors here.\n\n<paragraph_1>\nOn Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.\n</paragraph_1>\nThis paragraph correctly calculates the number of flamingos taken back, painted, and returned. It also correctly states the number of pink and white flamingos by the end of Saturday. No errors here.\n\n<paragraph_2>\nOn Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.\n</paragraph_2>\nThis paragraph correctly adds the additional pink flamingos on Sunday and states the correct totals. No errors here.\n\n<paragraph_3>\nTo find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\).\n</paragraph_3>\nThis paragraph correctly calculates the difference between the number of pink and white flamingos. No errors here.\n\nSince there are no errors in any of the paragraphs, the final answer is \(\boxed{-1}\).", "prediction": -1, "match": false}
Bt when i use the qwen2.5-7b-instruct, it will generate the answer direct. And the final F1 score is totally different with the score in original paper. The trajectory like this:
{"id": "gsm8k-0", "generator": "Qwen2-7B-Instruct", "problem": "Sue lives in a fun neighborhood. One weekend, the neighbors decided to play a prank on Sue. On Friday morning, the neighbors placed 18 pink plastic flamingos out on Sue's front yard. On Saturday morning, the neighbors took back one third of the flamingos, painted them white, and put these newly painted white flamingos back out on Sue's front yard. Then, on Sunday morning, they added another 18 pink plastic flamingos to the collection. At noon on Sunday, how many more pink plastic flamingos were out than white plastic flamingos?", "steps": ["To find out how many more pink plastic flamingos were out than white plastic flamingos at noon on Sunday, we can break down the problem into steps. First, on Friday, the neighbors start with 18 pink plastic flamingos.", "On Saturday, they take back one third of the flamingos. Since there were 18 flamingos, \(1/3 \times 18 = 6\) flamingos are taken back. So, they have \(18 - 6 = 12\) flamingos left in their possession. Then, they paint these 6 flamingos white and put them back out on Sue's front yard. Now, Sue has the original 12 pink flamingos plus the 6 new white ones. Thus, by the end of Saturday, Sue has \(12 + 6 = 18\) pink flamingos and 6 white flamingos.", "On Sunday, the neighbors add another 18 pink plastic flamingos to Sue's front yard. By the end of Sunday morning, Sue has \(18 + 18 = 36\) pink flamingos and still 6 white flamingos.", "To find the difference, subtract the number of white flamingos from the number of pink flamingos: \(36 - 6 = 30\). Therefore, at noon on Sunday, there were 30 more pink plastic flamingos out than white plastic flamingos. The answer is \(\boxed{30}\)."], "final_answer_correct": false, "label": 1, "generated_critique": "\boxed{1}", "prediction": 1, "match": true}
It's weird, and i use the same code in your repo.
transformers==4.47.0
vllm==0.6.4.post1
The text was updated successfully, but these errors were encountered: