Hub
    Docs
Try for Free
xiangyi-li
/
OS-World
mirrored 15 minutes ago
Benchmark CardFiles and versionsLeaderboard
  • Hub
  • Contact
DiscordGitHubXLinkedIn
0
  • chrome
    -
    ​
  • gimp
    -
    ​
  • libreoffice_calc
    -
    ​
  • libreoffice_impress
    -
    ​
  • libreoffice_writer
    -
    ​
  • multi_apps
    -
    ​
  • os
    -
    ​
  • thunderbird
    -
    ​
  • vlc
    -
    ​
  • vs_code
    -
    ​
Update evaluation examples with clearer instructions and improved configurations - Revised instructions in multiple JSON examples to enhance clarity and specificity, ensuring users understand the tasks better. - Adjusted configurations for tasks involving Chrome, GIMP, LibreOffice, and VLC to reflect more precise requirements and expected behaviors. - These changes aim to improve the usability and effectiveness of the evaluation examples, making them more intuitive for users.
a month ago
fix: refine task evaluators for Chrome and Thunderbird (#445) - chrome/480bcfea: mark task as infeasible - thunderbird/15c3b339: relax evaluator to only require email and password (remove Anonym Tester and IMAP rules; instruction says stay on page) - thunderbird/dd84e895: require every message starred in Bills folder (SQL now checks starred count = total count, not just sum(1) > 0)
5 days ago
fix: refine task evaluators for Chrome and Thunderbird (#445) - chrome/480bcfea: mark task as infeasible - thunderbird/15c3b339: relax evaluator to only require email and password (remove Anonym Tester and IMAP rules; instruction says stay on page) - thunderbird/dd84e895: require every message starred in Bills folder (SQL now checks starred count = total count, not just sum(1) > 0)
5 days ago
added missing instruction information (#440)
13 days ago
ver Jan27th (#423) missing concrete sheet name in the instructions is causing false negative evaluation results, so I supplement the requirement of the new sheet name in the instructions.
a month ago
Seed VLC auto-close task with play-and-exit enabled
a month ago
Make PPTX run-count comparison configurable for task b8adbc24 (#443) Add an `examine_run_count` flag to `compare_pptx_files` (defaulting to true) and gate run-count mismatch checks for both text paragraphs and table cells. Disable this check in `b8adbc24-cef2-4b15-99d5-ecbe7ff445eb.json` to prevent false negatives from non-semantic LibreOffice run segmentation differences.
7 days ago
  1. /
  2. evaluation_examples
  3. examples
Mirror external setup/eval download links to HF cache
12 days ago
Mirror external setup/eval download links to HF cache
12 days ago
Dunjie LuMerge pull request #452 from xlang-ai/dev_djlu/gpt54_agent optimize gpt5.4 promptcda933f
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago