Update evaluation examples with clearer instructions and improved configurations
- Revised instructions in multiple JSON examples to enhance clarity and specificity, ensuring users understand the tasks better.
- Adjusted configurations for tasks involving Chrome, GIMP, LibreOffice, and VLC to reflect more precise requirements and expected behaviors.
- These changes aim to improve the usability and effectiveness of the evaluation examples, making them more intuitive for users.
a month ago
fix: refine task evaluators for Chrome and Thunderbird (#445)
- chrome/480bcfea: mark task as infeasible
- thunderbird/15c3b339: relax evaluator to only require email and password
(remove Anonym Tester and IMAP rules; instruction says stay on page)
- thunderbird/dd84e895: require every message starred in Bills folder
(SQL now checks starred count = total count, not just sum(1) > 0)
5 days ago
fix: refine task evaluators for Chrome and Thunderbird (#445)
- chrome/480bcfea: mark task as infeasible
- thunderbird/15c3b339: relax evaluator to only require email and password
(remove Anonym Tester and IMAP rules; instruction says stay on page)
- thunderbird/dd84e895: require every message starred in Bills folder
(SQL now checks starred count = total count, not just sum(1) > 0)
5 days ago
added missing instruction information (#440)
13 days ago
ver Jan27th (#423)
missing concrete sheet name in the instructions is causing false negative
evaluation results, so I supplement the requirement of the new sheet
name in the instructions.
a month ago
Seed VLC auto-close task with play-and-exit enabled
a month ago
Make PPTX run-count comparison configurable for task b8adbc24 (#443)
Add an `examine_run_count` flag to `compare_pptx_files` (defaulting to true) and gate run-count mismatch checks for both text paragraphs and table cells. Disable this check in `b8adbc24-cef2-4b15-99d5-ecbe7ff445eb.json` to prevent false negatives from non-semantic LibreOffice run segmentation differences.
7 days ago
Mirror external setup/eval download links to HF cache
12 days ago
Mirror external setup/eval download links to HF cache
12 days ago
Dunjie LuMerge pull request #452 from xlang-ai/dev_djlu/gpt54_agent
optimize gpt5.4 promptcda933f
feat: standardize configuration fields across all evaluation examples
- Add `fixed_ip` field to all 369 JSON files in examples directory
- Set to `true` for 8 files listed in google_chrome.json multi_apps
- Set to `false` for remaining 361 files
- Add `possibility_of_env_change` field to 363 JSON files missing this field
- Set to "low" for newly added fields
- Preserve existing values (4 medium, 2 high) for 6 files that already had this field
This ensures consistent configuration schema across all evaluation examples
while maintaining backward compatibility with existing settings.
8 months ago