OS-World/xiangyi-li · BenchFlow

mirrored 14 minutes ago

Benchmark Card Files and versions Leaderboard

Tianbao XieOrganize run scripts into structured directories (#424) * Organize run scripts into structured directories Move all run_*.py and run_*.sh scripts from the root directory into a new scripts/ directory with the following structure: - scripts/python/ - Contains all Python run scripts (29 files) - scripts/bash/ - Contains all bash scripts (2 files) This improves repository organization and makes it easier to locate and manage model run scripts. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix import paths and update documentation for reorganized scripts Changes: - Added sys.path configuration to all Python scripts in scripts/python/ to enable imports from project root - Updated README.md with new script paths (scripts/python/run_multienv.py) - Enhanced scripts/README.md with detailed usage instructions and technical details about path resolution - All scripts now work correctly when run from project root directory Technical details: - Each script now includes: sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../..")) - This allows scripts to import lib_run_single, desktop_env, and mm_agents modules - Scripts must be run from OSWorld root directory (not from scripts/ subdirectory) Tested: python scripts/python/run_multienv.py --help works correctly Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add manual examination tool and remove deprecated main.py Changes: - Added scripts/python/manual_examine.py for manual task verification - Fixed imports with sys.path configuration - Allows manual execution and verification of benchmark tasks - Records screenshots, videos, and evaluation results - Added scripts/bash/run_manual_examine.sh with example task IDs - Updated README.md with manual examination section - Updated scripts/README.md with manual examination documentation - Removed main.py (replaced by manual_examine.py) The manual examination tool provides: - Manual task execution in the environment - Task correctness verification - Execution recording with screenshots and videos - Examination of specific problematic tasks Usage: python scripts/python/manual_examine.py \ --domain libreoffice_impress \ --example_id a669ef01-ded5-4099-9ea9-25e99b569840 \ --headless \ --observation_type screenshot \ --result_dir ./results_human_examine Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Update show_result.py with detailed scores and argument parsing Changes: - Added argparse for command-line argument parsing - Added --detailed flag to show compact "score/total" format per domain - Removed hardcoded example paths - Added comprehensive docstring for get_result function - Added parameter descriptions and help text - Updated README.md with detailed usage examples New features: - Standard mode: Shows per-domain success rates and statistics - Detailed mode (--detailed): Shows compact "score/total" format - All parameters now configurable via command line - Better error handling for missing domains in category statistics Usage examples: python show_result.py python show_result.py --model gpt-4o --detailed python show_result.py --result_dir ./custom_results --action_space computer_13 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Add note about bash scripts and community contributions Added a note in scripts/README.md explaining that: - Many bash scripts were not preserved during reorganization - More bash scripts will be gradually added in future updates - Community contributions are welcome This provides transparency about the current state and encourages community participation in expanding the bash scripts collection. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Merge lib_run_single files into unified lib_run_single.py Changes: - Merged lib_run_single_mobileagent_v3.py into lib_run_single.py - Added run_single_example_mobileagent_v3() function - Merged lib_run_single_os_symphony.py into lib_run_single.py - run_single_example_os_symphony() was already present - Removed lib_run_single_mobileagent_v3.py - Removed lib_run_single_os_symphony.py - Updated scripts/python/run_multienv_mobileagent_v3.py to use unified lib_run_single Benefits: - Single source of truth for all run_single_example functions - Easier maintenance and consistency - Reduced code duplication - All specialized agent functions in one place All run_single_example functions now available in lib_run_single.py: - run_single_example (default) - run_single_example_human - run_single_example_agi - run_single_example_openaicua - run_single_example_opencua - run_single_example_autoglm - run_single_example_mano - run_single_example_uipath - run_single_example_os_symphony - run_single_example_evocua - run_single_example_mobileagent_v3 (newly merged) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Consolidate setup guidelines and remove empty CONTRIBUTION.md Changes: - Created unified SETUP_GUIDELINE.md merging: - ACCOUNT_GUIDELINE.md (Google account setup) - PROXY_GUIDELINE.md (Proxy configuration) - PUBLIC_EVALUATION_GUIDELINE.md (AWS platform setup) - Removed CONTRIBUTION.md (empty file) - Removed individual guideline files - Updated all references in README.md to point to SETUP_GUIDELINE.md Benefits: - Single comprehensive guide for all setup needs - Better organization with clear table of contents - Easier to maintain and update - Reduced file clutter in repository root The new SETUP_GUIDELINE.md includes: 1. Google Account Setup - OAuth2.0 configuration for Google Drive tasks 2. Proxy Configuration - For users behind firewalls or GFW 3. Public Evaluation Platform - AWS-based parallel evaluation setup All sections are properly cross-referenced and include detailed step-by-step instructions with screenshots and troubleshooting tips. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>75fd8c0

Raw

import os
import argparse


def get_result(action_space, use_model, observation_type, result_dir, show_detailed_scores=False):
    """
    Calculate and display evaluation results from OSWorld benchmark runs.

    Args:
        action_space (str): Action space used (e.g., "pyautogui", "computer_13")
        use_model (str): Model name used for evaluation (e.g., "gpt-4o", "claude-3")
        observation_type (str): Observation type used (e.g., "screenshot", "a11y_tree")
        result_dir (str): Root directory containing results
        show_detailed_scores (bool): If True, show detailed scores per domain in format "score/total"

    Returns:
        list: List of all individual task results, or None if no results found
    """
    target_dir = os.path.join(result_dir, action_space, observation_type, use_model)
    if not os.path.exists(target_dir):
        print("New experiment, no result yet.")
        return None

    all_result = []
    domain_result = {}
    all_result_for_analysis = {}

    for domain in os.listdir(target_dir):
        domain_path = os.path.join(target_dir, domain)
        if os.path.isdir(domain_path):
            for example_id in os.listdir(domain_path):
                example_path = os.path.join(domain_path, example_id)
                if os.path.isdir(example_path):
                    if "result.txt" in os.listdir(example_path):
                        if domain not in domain_result:
                            domain_result[domain] = []
                        result = open(os.path.join(example_path, "result.txt"), "r").read()
                        try:
                            domain_result[domain].append(float(result))
                        except:
                            domain_result[domain].append(float(eval(result)))

                        if domain not in all_result_for_analysis:
                            all_result_for_analysis[domain] = {}
                        all_result_for_analysis[domain][example_id] = domain_result[domain][-1]

                        try:
                            result = open(os.path.join(example_path, "result.txt"), "r").read()
                            try:
                                all_result.append(float(result))
                            except:
                                all_result.append(float(bool(result)))
                        except:
                            all_result.append(0.0)

    if show_detailed_scores:
        # Print detailed scores in format "score/total" for each domain
        result_order = ["chrome", "gimp", "libreoffice_calc", "libreoffice_impress",
                       "libreoffice_writer", "multi_apps", "os", "thunderbird", "vlc", "vs_code"]
        output_row = []
        for d in result_order:
            if d in domain_result:
                output_row.append(f"{round(sum(domain_result[d]),2)}/{len(domain_result[d])}")
            else:
                output_row.append("0.00/0")
        print(" ".join(output_row))
    else:
        # Print standard per-domain statistics
        for domain in domain_result:
            print("Domain:", domain, "Runned:", len(domain_result[domain]), "Success Rate:",
                  sum(domain_result[domain]) / len(domain_result[domain]) * 100, "%")

    print(">>>>>>>>>>>>>")

    # Print category-level statistics
    if all(d in domain_result for d in ["libreoffice_calc", "libreoffice_impress", "libreoffice_writer"]):
        print("Office", "Success Rate:", sum(
            domain_result["libreoffice_calc"] + domain_result["libreoffice_impress"] + domain_result[
                "libreoffice_writer"]) / len(
            domain_result["libreoffice_calc"] + domain_result["libreoffice_impress"] + domain_result[
                "libreoffice_writer"]) * 100, "%")

    if all(d in domain_result for d in ["vlc", "thunderbird", "chrome"]):
        print("Daily", "Success Rate:",
              sum(domain_result["vlc"] + domain_result["thunderbird"] + domain_result["chrome"]) / len(
                  domain_result["vlc"] + domain_result["thunderbird"] + domain_result["chrome"]) * 100, "%")

    if all(d in domain_result for d in ["gimp", "vs_code"]):
        print("Professional", "Success Rate:", sum(domain_result["gimp"] + domain_result["vs_code"]) / len(
            domain_result["gimp"] + domain_result["vs_code"]) * 100, "%")

    with open(os.path.join(target_dir, "all_result.json"), "w") as f:
        f.write(str(all_result_for_analysis))

    if not all_result:
        print("New experiment, no result yet.")
        return None
    else:
        print("Runned:", len(all_result), "Current Success Rate:",
              round(sum(all_result) / len(all_result) * 100, 2), "%",
              f"{round(sum(all_result), 2)}", "/", str(len(all_result)))
        return all_result


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description="Calculate and display OSWorld evaluation results"
    )
    parser.add_argument(
        "--action_space",
        type=str,
        default="pyautogui",
        help="Action space used (e.g., 'pyautogui', 'computer_13')"
    )
    parser.add_argument(
        "--model",
        type=str,
        default="gpt-4o",
        help="Model name used for evaluation (e.g., 'gpt-4o', 'claude-3')"
    )
    parser.add_argument(
        "--observation_type",
        type=str,
        default="screenshot",
        help="Observation type used (e.g., 'screenshot', 'a11y_tree', 'som')"
    )
    parser.add_argument(
        "--result_dir",
        type=str,
        default="./results",
        help="Root directory containing results (default: ./results)"
    )
    parser.add_argument(
        "--detailed",
        action="store_true",
        help="Show detailed scores per domain in format 'score/total'"
    )

    args = parser.parse_args()

    get_result(
        args.action_space,
        args.model,
        args.observation_type,
        args.result_dir,
        args.detailed
    )