OS-World/xiangyi-li · BenchFlow

mirrored a minute ago

Benchmark Card Files and versions Leaderboard

shenyqFeature/task refine Fix task configurations and evaluators for improved robustness (#420) * Relax reference similarity threshold for APA formatting task Reduce the reference_base_result threshold from 0.93/0.92 to 0.6 for task 2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e (APA 7th edition reference formatting). The original threshold was overly strict and failed submissions with minor formatting variations that are acceptable in APA style (e.g., spacing around journal volume numbers). The new threshold of 0.6 maintains validation of core content accuracy (authors, dates, titles, journals) while accepting reasonable formatting variants, better reflecting real-world APA formatting standards. Changes: - Ubuntu version: 0.93 -> 0.6 - Windows version: 0.92 -> 0.6 * Fix XPath selectors for TripAdvisor hotel search task Replace brittle absolute XPath expressions with robust relative XPath selectors for task b7895e80-f4d1-4648-bee0-4eb45a6f1fa8 (TripAdvisor hotel search in NYC). The original absolute XPath paths (e.g., /html/body/div[1]/main/div[3]/...) are fragile and break when the website's DOM structure changes. The new relative XPath expressions use stable attributes like data-automation and aria-label, making the evaluator resilient to minor UI updates. Changes: - Replaced absolute paths with attribute-based selectors - Check-in date: //button[@data-automation='checkin']//... - Check-out date: //button[@data-automation='checkout']//... - City header: //h2[@data-automation='header_geo_title'] - Room/guest selector: //button[@data-automation='roomsandguests']//... - Price sorting: //button[contains(@aria-label,'PRICE_LOW_TO_HIGH')]//... - Updated expected value for adult field in first test case to match actual output format: "Rooms/Guests1 Room, 2 Guests" This ensures the evaluation will work correctly even when TripAdvisor updates their page layout or CSS classes. * Fix expected output format for file compression task Correct the expected success message format for task 37887e8c-da15-4192-923c-08fa390a176d (compress files modified 30 days ago). The evaluation script eval_20250703.sh outputs "SUCCESS: The task was completed correctly." with "SUCCESS" in all uppercase, but the expected value was using "Success" with only the first letter capitalized. This case mismatch caused the evaluator to incorrectly fail valid submissions. Change: - Expected message: "Success: ..." -> "SUCCESS: ..." This ensures the string matching in check_include_exclude() correctly identifies successful task completion. * Add workspace trust settings to VS Code theme change task Update expected configuration for task 982d12a5-beab-424f-8d38-d2a48429e511 (change VS Code color theme) to include workspace trust security settings. When VS Code launches, it automatically adds workspace trust configuration to settings.json to handle security prompts. The original expected output only included the color theme setting, causing the evaluator to fail even when the task was completed correctly. Changes: - Added three workspace trust settings to expected configuration: - security.workspace.trust.enabled: false - security.workspace.trust.startupPrompt: "never" - security.workspace.trust.emptyWindow: false - Retained the required workbench.colorTheme setting These settings prevent workspace trust dialogs from interfering with automated testing and ensure the evaluator correctly matches the actual settings.json content generated during task execution. * Improve GIMP character extraction task evaluation robustness Enhance task e8172110-ec08-421b-a6f5-842e6451911f (extract pixel art character in GIMP) with more flexible evaluation and reference implementation. Changes: 1. Add reference Python implementation: - Provide complete example script for automated character extraction - Implements background removal using color detection and threshold - Includes helper functions for checkerboard background preview - Serves as learning reference while maintaining task difficulty 2. Add new evaluation function check_structure_sim_with_threshold(): - Customizable SSIM threshold (default 0.85, more lenient than 0.9) - Automatic image resizing when dimensions differ using LANCZOS - Comprehensive logging for debugging (comparison details, SSIM scores) - Handles edge cases: file not found, size mismatch, conversion errors 3. Update task configuration: - Switch from check_structure_sim to check_structure_sim_with_threshold - Set SSIM threshold to 0.85 for both GIMP and code outputs - Add options parameter to pass threshold to evaluator Rationale: - Manual GIMP extraction and Python script may produce slightly different pixel-level results due to different image processing methods (anti-aliasing, edge handling) - 0.85 threshold accepts minor pixel differences while ensuring visual similarity - Auto-resize prevents failures from slight dimension mismatches during cropping - Reference script helps users understand automation requirements without compromising the learning objective This makes evaluation more realistic while maintaining visual quality standards. * Delete evaluation_examples/examples/vs_code/982d12a5-beab-424f-8d38-d2a48429e511.json Just fixed it in a different way in the last commit. Thanks! --------- Co-authored-by: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com>70df33c

Raw

{
  "id": "2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e",
  "snapshot": "libreoffice_calc",
  "instruction": "Could you please take a moment to review the 'case study' file located within the 'student work' folder? I'm particularly interested in ensuring that the references section at the end of the document adheres to the APA 7th edition formatting guidelines. Making the necessary adjustments if it turns out that the current formatting does not align with APA 7 standards or exists some errors.",
  "source": "authors",
  "config": [
    {
      "type": "command",
      "parameters": {
        "command": [
          "mkdir",
          "-p",
          "/home/user/Desktop/students work/",
          "/home/user/Desktop/Lec powerpoint/",
          "/home/user/Desktop/Grammar test/",
          "/home/user/Desktop/Grammar rules PDF/",
          "/home/user/Desktop/FDI/"
        ]
      }
    },
    {
      "type": "download",
      "parameters": {
        "files": [
          {
            "path": "/home/user/Desktop/students work/Zheng He .docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/Zheng%20He%20.docx"
          },
          {
            "path": "/home/user/Desktop/students work/The literature reviews of weekly readings.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/The%20literature%20reviews%20of%20weekly%20readings.docx"
          },
          {
            "path": "/home/user/Desktop/students work/The British Justice System.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/The%20British%20Justice%20System.docx"
          },
          {
            "path": "/home/user/Desktop/students work/quiz2.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/quiz2.docx"
          },
          {
            "path": "/home/user/Desktop/students work/quiz.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/quiz.docx"
          },
          {
            "path": "/home/user/Desktop/students work/Q1&2&3.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/Q1%262%263.docx"
          },
          {
            "path": "/home/user/Desktop/students work/Photo Ethics in Journalism.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/Photo%20Ethics%20in%20Journalism.docx"
          },
          {
            "path": "/home/user/Desktop/students work/cassie.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/cassie.docx"
          },
          {
            "path": "/home/user/Desktop/students work/case study.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/case%20study.docx"
          },
          {
            "path": "/home/user/Desktop/Grammar rules PDF/irregularrules02.pdf",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/irregularrules02.pdf"
          },
          {
            "path": "/home/user/Desktop/Grammar rules PDF/irregularrules01.pdf",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/irregularrules01.pdf"
          },
          {
            "path": "/home/user/Desktop/Grammar rules PDF/fragrules.pdf",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/fragrules.pdf"
          },
          {
            "path": "/home/user/Desktop/Grammar rules PDF/csfsrules.pdf",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/csfsrules.pdf"
          },
          {
            "path": "/home/user/Desktop/Public Lecture Teaching Plan.docx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/Public%20Lecture%20Teaching%20Plan.docx"
          },
          {
            "path": "/home/user/Desktop/Course Timetable.xlsx",
            "url": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/Course%20Timetable.xlsx"
          }
        ]
      }
    }
  ],
  "trajectory": "trajectories/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e",
  "related_apps": [],
  "evaluator": {
    "postconfig": [
      {
        "type": "activate_window",
        "parameters": {
          "window_name": "case study.docx - LibreOffice Writer",
          "strict": true
        }
      },
      {
        "type": "sleep",
        "parameters": {
          "seconds": 0.5
        }
      },
      {
        "type": "execute",
        "parameters": {
          "command": [
            "python",
            "-c",
            "import pyautogui; import time; pyautogui.hotkey('ctrl', 's'); time.sleep(0.5); "
          ]
        }
      }
    ],
    "func": "compare_references",
    "expected": {
      "type": "cloud_file",
      "path": "https://huggingface.co/datasets/xlangai/ubuntu_osworld_file_cache/resolve/main/multi_apps/2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e/case%20study%20gold.docx",
      "dest": "case study gold.docx"
    },
    "result": {
      "type": "vm_file",
      "path": "/home/user/Desktop/students work/case study.docx",
      "dest": "case study.docx"
    },
    "options": {
      "content_only": true,
      "reference_base_result": 0.6
    }
  },
  "proxy": false,
  "fixed_ip": false,
  "possibility_of_env_change": "low"
}