Hub
    Docs
Try for Free
xiangyi-li
/
OS-World
mirrored 8 minutes ago
Benchmark CardFiles and versionsLeaderboard
  • Hub
  • Contact
DiscordGitHubXLinkedIn
0
  • README.md
    1.42 kB
    ​
  • examples
    -
    ​
  • examples_windows
    -
    ​
  • settings
    -
    ​
  • test_all.json
    16.5 kB
    ​
  • test_infeasible.json
    1.42 kB
    ​
  • test_nogdrive.json
    16.9 kB
    ​
  • test_small.json
    1.93 kB
    ​
Server setup readme revision (#108) * Initialize * add note for resolution * Organize * draft version and todos * ver Nov24th supplemented socat installation and switching off automatic suspend and screen-off * Finish Tianbao todos * Finish Tianbao todos * Fix typos * update font install * Finish Xiaochuan's Part * Finish Xiaochuan's Part update * Update README.md * Fix format --------- Co-authored-by: zdy023 <zdy004007@126.com> Co-authored-by: tsuky_chen <3107760494@qq.com> Co-authored-by: Jason Lee <lixiaochuan20@gmail.com> Co-authored-by: Siheng Zhao <77528902+sihengz02@users.noreply.github.com>
a year ago
  1. evaluation_examples
Dunjie LuMerge pull request #452 from xlang-ai/dev_djlu/gpt54_agent optimize gpt5.4 promptcda933f
fix some multi_apps task (#243) * fix chrome * fix: fix proxy setup * feat&fix: add proxy support in setup and remove hardcoded proxy from example * fix tasks * fix chrome finished * fix * clean chrome_fix code * clean chrome_fix code * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix multiapps * fix chrome 2888b4e6-5b47-4b57-8bf5-c73827890774 * fix some multi_apps tasks * fix some multi_apps tasks --------- Co-authored-by: adlsdztony <zzl0712@connect.hku.hk>
8 months ago
Restore vscode task 982d12a5 and re-include in test sets
15 days ago
Restore vscode task 982d12a5 and re-include in test sets
15 days ago
fix: refine task evaluators for Chrome and Thunderbird (#445) - chrome/480bcfea: mark task as infeasible - thunderbird/15c3b339: relax evaluator to only require email and password (remove Anonym Tester and IMAP rules; instruction says stay on page) - thunderbird/dd84e895: require every message starred in Bills folder (SQL now checks starred count = total count, not just sum(1) > 0)
5 days ago
Fix chrome dark-mode task evaluation for appearance settings
24 days ago
Mirror external setup/eval download links to HF cache
12 days ago
Fix Duplicate ids; Remove unused JSON files across multiple applications
a year ago