Mirror external setup/eval download links to HF cache
12 days ago
Feature/task refine Fix task configurations and evaluators for improved robustness (#420)
* Relax reference similarity threshold for APA formatting task
Reduce the reference_base_result threshold from 0.93/0.92 to 0.6 for task
2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e (APA 7th edition reference formatting).
The original threshold was overly strict and failed submissions with minor
formatting variations that are acceptable in APA style (e.g., spacing around
journal volume numbers). The new threshold of 0.6 maintains validation of
core content accuracy (authors, dates, titles, journals) while accepting
reasonable formatting variants, better reflecting real-world APA formatting
standards.
Changes:
- Ubuntu version: 0.93 -> 0.6
- Windows version: 0.92 -> 0.6
* Fix XPath selectors for TripAdvisor hotel search task
Replace brittle absolute XPath expressions with robust relative XPath
selectors for task b7895e80-f4d1-4648-bee0-4eb45a6f1fa8 (TripAdvisor
hotel search in NYC).
The original absolute XPath paths (e.g., /html/body/div[1]/main/div[3]/...)
are fragile and break when the website's DOM structure changes. The new
relative XPath expressions use stable attributes like data-automation
and aria-label, making the evaluator resilient to minor UI updates.
Changes:
- Replaced absolute paths with attribute-based selectors
- Check-in date: //button[@data-automation='checkin']//...
- Check-out date: //button[@data-automation='checkout']//...
- City header: //h2[@data-automation='header_geo_title']
- Room/guest selector: //button[@data-automation='roomsandguests']//...
- Price sorting: //button[contains(@aria-label,'PRICE_LOW_TO_HIGH')]//...
- Updated expected value for adult field in first test case to match
actual output format: "Rooms/Guests1 Room, 2 Guests"
This ensures the evaluation will work correctly even when TripAdvisor
updates their page layout or CSS classes.
* Fix expected output format for file compression task
Correct the expected success message format for task
37887e8c-da15-4192-923c-08fa390a176d (compress files modified 30 days ago).
The evaluation script eval_20250703.sh outputs "SUCCESS: The task was
completed correctly." with "SUCCESS" in all uppercase, but the expected
value was using "Success" with only the first letter capitalized. This
case mismatch caused the evaluator to incorrectly fail valid submissions.
Change:
- Expected message: "Success: ..." -> "SUCCESS: ..."
This ensures the string matching in check_include_exclude() correctly
identifies successful task completion.
* Add workspace trust settings to VS Code theme change task
Update expected configuration for task 982d12a5-beab-424f-8d38-d2a48429e511
(change VS Code color theme) to include workspace trust security settings.
When VS Code launches, it automatically adds workspace trust configuration
to settings.json to handle security prompts. The original expected output
only included the color theme setting, causing the evaluator to fail even
when the task was completed correctly.
Changes:
- Added three workspace trust settings to expected configuration:
- security.workspace.trust.enabled: false
- security.workspace.trust.startupPrompt: "never"
- security.workspace.trust.emptyWindow: false
- Retained the required workbench.colorTheme setting
These settings prevent workspace trust dialogs from interfering with
automated testing and ensure the evaluator correctly matches the actual
settings.json content generated during task execution.
* Improve GIMP character extraction task evaluation robustness
Enhance task e8172110-ec08-421b-a6f5-842e6451911f (extract pixel art
character in GIMP) with more flexible evaluation and reference implementation.
Changes:
1. Add reference Python implementation:
- Provide complete example script for automated character extraction
- Implements background removal using color detection and threshold
- Includes helper functions for checkerboard background preview
- Serves as learning reference while maintaining task difficulty
2. Add new evaluation function check_structure_sim_with_threshold():
- Customizable SSIM threshold (default 0.85, more lenient than 0.9)
- Automatic image resizing when dimensions differ using LANCZOS
- Comprehensive logging for debugging (comparison details, SSIM scores)
- Handles edge cases: file not found, size mismatch, conversion errors
3. Update task configuration:
- Switch from check_structure_sim to check_structure_sim_with_threshold
- Set SSIM threshold to 0.85 for both GIMP and code outputs
- Add options parameter to pass threshold to evaluator
Rationale:
- Manual GIMP extraction and Python script may produce slightly different
pixel-level results due to different image processing methods (anti-aliasing,
edge handling)
- 0.85 threshold accepts minor pixel differences while ensuring visual similarity
- Auto-resize prevents failures from slight dimension mismatches during cropping
- Reference script helps users understand automation requirements without
compromising the learning objective
This makes evaluation more realistic while maintaining visual quality standards.
* Delete evaluation_examples/examples/vs_code/982d12a5-beab-424f-8d38-d2a48429e511.json
Just fixed it in a different way in the last commit. Thanks!
---------
Co-authored-by: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com>
2 months ago
Dunjie LuMerge pull request #452 from xlang-ai/dev_djlu/gpt54_agent
optimize gpt5.4 promptcda933f
refactor: update URLs in multiple JSON files to ensure proper encoding of special characters
9 months ago
refactor: update URLs in multiple JSON files to ensure proper encoding of special characters
9 months ago
refactor: update URLs in multiple JSON files to ensure proper encoding of special characters
9 months ago
refactor: update URLs in multiple JSON files to ensure proper encoding of special characters
9 months ago
refactor: update URLs in multiple JSON files to ensure proper encoding of special characters
9 months ago
refactor: update URLs in multiple JSON files to ensure proper encoding of special characters
9 months ago
refactor: update URLs in multiple JSON files to ensure proper encoding of special characters
9 months ago
Fix Duplicate ids; Remove unused JSON files across multiple applications
a year ago
Fix Duplicate ids; Remove unused JSON files across multiple applications
a year ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago
feat: Migrate OSWorld files to HuggingFace cache with comprehensive documentation
- Add detailed README for file cache repository
- Implement migration script with retry logic and browser simulation
- Support automatic file type detection and deduplication
- Ensure reliable hosting for OSWorld evaluation files
9 months ago