OS-World/xiangyi-li · BenchFlow

chrome
-
fix: refine task evaluators for Chrome and Thunderbird (#445) - chrome/480bcfea: mark task as infeasible - thunderbird/15c3b339: relax evaluator to only require email and password (remove Anonym Tester and IMAP rules; instruction says stay on page) - thunderbird/dd84e895: require every message starred in Bills folder (SQL now checks starred count = total count, not just sum(1) > 0)
5 days ago
gimp
-
Update evaluation examples with clearer instructions and improved configurations - Revised instructions in multiple JSON examples to enhance clarity and specificity, ensuring users understand the tasks better. - Adjusted configurations for tasks involving Chrome, GIMP, LibreOffice, and VLC to reflect more precise requirements and expected behaviors. - These changes aim to improve the usability and effectiveness of the evaluation examples, making them more intuitive for users.
a month ago
libreoffice_calc
-
ver Jan27th (#423) missing concrete sheet name in the instructions is causing false negative evaluation results, so I supplement the requirement of the new sheet name in the instructions.
a month ago
libreoffice_impress
-
Make PPTX run-count comparison configurable for task b8adbc24 (#443) Add an `examine_run_count` flag to `compare_pptx_files` (defaulting to true) and gate run-count mismatch checks for both text paragraphs and table cells. Disable this check in `b8adbc24-cef2-4b15-99d5-ecbe7ff445eb.json` to prevent false negatives from non-semantic LibreOffice run segmentation differences.
7 days ago
libreoffice_writer
-
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
multi_apps
-
Mirror external setup/eval download links to HF cache
12 days ago
os
-
added missing instruction information (#440)
13 days ago
thunderbird
-
fix: refine task evaluators for Chrome and Thunderbird (#445) - chrome/480bcfea: mark task as infeasible - thunderbird/15c3b339: relax evaluator to only require email and password (remove Anonym Tester and IMAP rules; instruction says stay on page) - thunderbird/dd84e895: require every message starred in Bills folder (SQL now checks starred count = total count, not just sum(1) > 0)
5 days ago
vlc
-
Seed VLC auto-close task with play-and-exit enabled
a month ago
vs_code
-
Mirror external setup/eval download links to HF cache
12 days ago