Organize run scripts into structured directories (#424)
* Organize run scripts into structured directories
Move all run_*.py and run_*.sh scripts from the root directory into a new scripts/ directory with the following structure:
- scripts/python/ - Contains all Python run scripts (29 files)
- scripts/bash/ - Contains all bash scripts (2 files)
This improves repository organization and makes it easier to locate and manage model run scripts.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Fix import paths and update documentation for reorganized scripts
Changes:
- Added sys.path configuration to all Python scripts in scripts/python/
to enable imports from project root
- Updated README.md with new script paths (scripts/python/run_multienv.py)
- Enhanced scripts/README.md with detailed usage instructions and
technical details about path resolution
- All scripts now work correctly when run from project root directory
Technical details:
- Each script now includes: sys.path.insert(0, os.path.join(os.path.dirname(__file__), "../.."))
- This allows scripts to import lib_run_single, desktop_env, and mm_agents modules
- Scripts must be run from OSWorld root directory (not from scripts/ subdirectory)
Tested: python scripts/python/run_multienv.py --help works correctly
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add manual examination tool and remove deprecated main.py
Changes:
- Added scripts/python/manual_examine.py for manual task verification
- Fixed imports with sys.path configuration
- Allows manual execution and verification of benchmark tasks
- Records screenshots, videos, and evaluation results
- Added scripts/bash/run_manual_examine.sh with example task IDs
- Updated README.md with manual examination section
- Updated scripts/README.md with manual examination documentation
- Removed main.py (replaced by manual_examine.py)
The manual examination tool provides:
- Manual task execution in the environment
- Task correctness verification
- Execution recording with screenshots and videos
- Examination of specific problematic tasks
Usage:
python scripts/python/manual_examine.py \
--domain libreoffice_impress \
--example_id a669ef01-ded5-4099-9ea9-25e99b569840 \
--headless \
--observation_type screenshot \
--result_dir ./results_human_examine
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Update show_result.py with detailed scores and argument parsing
Changes:
- Added argparse for command-line argument parsing
- Added --detailed flag to show compact "score/total" format per domain
- Removed hardcoded example paths
- Added comprehensive docstring for get_result function
- Added parameter descriptions and help text
- Updated README.md with detailed usage examples
New features:
- Standard mode: Shows per-domain success rates and statistics
- Detailed mode (--detailed): Shows compact "score/total" format
- All parameters now configurable via command line
- Better error handling for missing domains in category statistics
Usage examples:
python show_result.py
python show_result.py --model gpt-4o --detailed
python show_result.py --result_dir ./custom_results --action_space computer_13
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Add note about bash scripts and community contributions
Added a note in scripts/README.md explaining that:
- Many bash scripts were not preserved during reorganization
- More bash scripts will be gradually added in future updates
- Community contributions are welcome
This provides transparency about the current state and encourages
community participation in expanding the bash scripts collection.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Merge lib_run_single files into unified lib_run_single.py
Changes:
- Merged lib_run_single_mobileagent_v3.py into lib_run_single.py
- Added run_single_example_mobileagent_v3() function
- Merged lib_run_single_os_symphony.py into lib_run_single.py
- run_single_example_os_symphony() was already present
- Removed lib_run_single_mobileagent_v3.py
- Removed lib_run_single_os_symphony.py
- Updated scripts/python/run_multienv_mobileagent_v3.py to use unified lib_run_single
Benefits:
- Single source of truth for all run_single_example functions
- Easier maintenance and consistency
- Reduced code duplication
- All specialized agent functions in one place
All run_single_example functions now available in lib_run_single.py:
- run_single_example (default)
- run_single_example_human
- run_single_example_agi
- run_single_example_openaicua
- run_single_example_opencua
- run_single_example_autoglm
- run_single_example_mano
- run_single_example_uipath
- run_single_example_os_symphony
- run_single_example_evocua
- run_single_example_mobileagent_v3 (newly merged)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Consolidate setup guidelines and remove empty CONTRIBUTION.md
Changes:
- Created unified SETUP_GUIDELINE.md merging:
- ACCOUNT_GUIDELINE.md (Google account setup)
- PROXY_GUIDELINE.md (Proxy configuration)
- PUBLIC_EVALUATION_GUIDELINE.md (AWS platform setup)
- Removed CONTRIBUTION.md (empty file)
- Removed individual guideline files
- Updated all references in README.md to point to SETUP_GUIDELINE.md
Benefits:
- Single comprehensive guide for all setup needs
- Better organization with clear table of contents
- Easier to maintain and update
- Reduced file clutter in repository root
The new SETUP_GUIDELINE.md includes:
1. Google Account Setup - OAuth2.0 configuration for Google Drive tasks
2. Proxy Configuration - For users behind firewalls or GFW
3. Public Evaluation Platform - AWS-based parallel evaluation setup
All sections are properly cross-referenced and include detailed
step-by-step instructions with screenshots and troubleshooting tips.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
a month ago