4.08 kB
1.81 kB
21.2 kB
42 kB
25.2 kB
34.6 kB
1.21 kB
3.85 kB
801 B
55.1 kB
30.4 kB
6.63 kB
35.8 kB
21 kB
13.9 kB
Fix plugin-task evaluation
22 days ago
feat: add X11 image handling and enhanced OCR processing
- Introduced a new function `read_x11_image` to read and convert X11 (XWD) format images to PIL Image, supporting both 24-bit and 32-bit formats.
- Enhanced the `compare_image_text` function to include checks for X11 image formats, with multiple conversion attempts using PIL, a custom reader, and netpbm tools.
- Improved error handling and logging for OCR processing, providing detailed feedback on conversion attempts and potential issues with X11 images.
- Maintained existing logic while expanding functionality for better image processing reliability.
8 months ago
fix: improve EPUB processing by checking for file existence before reading
- Added checks for the presence of "toc.ncx" and "content.opf" in the EPUB file before attempting to process them.
- Introduced debug logging to notify when these files are not found, enhancing error handling and traceability.
- Maintained existing logic while improving robustness of the EPUB processing function.
7 months ago
Make PPTX run-count comparison configurable for task b8adbc24 (#443)
Add an `examine_run_count` flag to `compare_pptx_files` (defaulting to true) and gate run-count mismatch checks for both text paragraphs and table cells. Disable this check in `b8adbc24-cef2-4b15-99d5-ecbe7ff445eb.json` to prevent false negatives from non-semantic LibreOffice run segmentation differences.
7 days ago
fix: Enhance error handling and logging across multiple evaluators
- Added logging for file retrieval and error handling in file.py, improving robustness during file operations.
- Implemented checks for file existence and parsing errors in general.py, enhancing reliability in JSON/YAML processing.
- Improved table comparison logic in table.py with detailed error logging for sheet loading and cell value reading.
- Enhanced metrics evaluation in slides.py with additional checks for paragraph and run counts, ensuring thorough comparison.
- Updated utils.py to include file existence checks and detailed error logging during cell value reading.
8 months ago
Dunjie LuMerge pull request #452 from xlang-ai/dev_djlu/gpt54_agent
optimize gpt5.4 promptcda933f
Fix VLC auto-close task: evaluate play-and-exit instead of infeasible
a month ago
Fix VLC auto-close task: evaluate play-and-exit instead of infeasible
a month ago
Add VM machine retrieval and enhance Chrome page info handling
- Added `get_vm_machine` method to `PythonController` for retrieving the VM's machine type.
- Updated `get_page_info` function in `chrome.py` to utilize the new VM machine retrieval for determining the appropriate Chrome launch command.
- Improved error handling and logging for page loading and connection attempts.
- Adjusted JSON comparison logic in `vscode.py` for better subset checking and handling of expected results.
- Fixed minor typos in JSON example files for clarity.
2 months ago
Calc eval fix (#273)
* ver Jun17th
updating annotations
* ver Jun17th
corrected annotation of 1d17
added check for cell merge
* ver Jun17th
updated several annotations
* ver Jun20th
fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08
* fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations.
* ver Jun21st
updating calc evals
* ver Jun22nd
fixed an impress task
* ver Jun22ndv2
adjusted several calc tasks
* Clean scalfolds
* ver Jul18th
added two try-excepts to handle possible formula parsing and calculation
failures
* ver Jul19th
added supports for cellIs and some other new types of conditional
formatting for calc evaluation
---------
Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk>
Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
8 months ago
Task fix batch (#383)
* update 873cafdd-a581-47f6-8b33-b9696ddb7b05 task eval
* c1fa57f3-c3db-4596-8f09-020701085416 fix, add tolerance to url matching
* 8df7e444-8e06-4f93-8a1a-c5c974269d82 add more clear instruction to the filename for compress
* add address string normalization for 6f4073b8-d8ea-4ade-8a18-c5d1d5d5aa9a
---------
Co-authored-by: Jiaqi <dengjiaqi@moonshot.cn>
4 months ago
Feature/task refine Fix task configurations and evaluators for improved robustness (#420)
* Relax reference similarity threshold for APA formatting task
Reduce the reference_base_result threshold from 0.93/0.92 to 0.6 for task
2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e (APA 7th edition reference formatting).
The original threshold was overly strict and failed submissions with minor
formatting variations that are acceptable in APA style (e.g., spacing around
journal volume numbers). The new threshold of 0.6 maintains validation of
core content accuracy (authors, dates, titles, journals) while accepting
reasonable formatting variants, better reflecting real-world APA formatting
standards.
Changes:
- Ubuntu version: 0.93 -> 0.6
- Windows version: 0.92 -> 0.6
* Fix XPath selectors for TripAdvisor hotel search task
Replace brittle absolute XPath expressions with robust relative XPath
selectors for task b7895e80-f4d1-4648-bee0-4eb45a6f1fa8 (TripAdvisor
hotel search in NYC).
The original absolute XPath paths (e.g., /html/body/div[1]/main/div[3]/...)
are fragile and break when the website's DOM structure changes. The new
relative XPath expressions use stable attributes like data-automation
and aria-label, making the evaluator resilient to minor UI updates.
Changes:
- Replaced absolute paths with attribute-based selectors
- Check-in date: //button[@data-automation='checkin']//...
- Check-out date: //button[@data-automation='checkout']//...
- City header: //h2[@data-automation='header_geo_title']
- Room/guest selector: //button[@data-automation='roomsandguests']//...
- Price sorting: //button[contains(@aria-label,'PRICE_LOW_TO_HIGH')]//...
- Updated expected value for adult field in first test case to match
actual output format: "Rooms/Guests1 Room, 2 Guests"
This ensures the evaluation will work correctly even when TripAdvisor
updates their page layout or CSS classes.
* Fix expected output format for file compression task
Correct the expected success message format for task
37887e8c-da15-4192-923c-08fa390a176d (compress files modified 30 days ago).
The evaluation script eval_20250703.sh outputs "SUCCESS: The task was
completed correctly." with "SUCCESS" in all uppercase, but the expected
value was using "Success" with only the first letter capitalized. This
case mismatch caused the evaluator to incorrectly fail valid submissions.
Change:
- Expected message: "Success: ..." -> "SUCCESS: ..."
This ensures the string matching in check_include_exclude() correctly
identifies successful task completion.
* Add workspace trust settings to VS Code theme change task
Update expected configuration for task 982d12a5-beab-424f-8d38-d2a48429e511
(change VS Code color theme) to include workspace trust security settings.
When VS Code launches, it automatically adds workspace trust configuration
to settings.json to handle security prompts. The original expected output
only included the color theme setting, causing the evaluator to fail even
when the task was completed correctly.
Changes:
- Added three workspace trust settings to expected configuration:
- security.workspace.trust.enabled: false
- security.workspace.trust.startupPrompt: "never"
- security.workspace.trust.emptyWindow: false
- Retained the required workbench.colorTheme setting
These settings prevent workspace trust dialogs from interfering with
automated testing and ensure the evaluator correctly matches the actual
settings.json content generated during task execution.
* Improve GIMP character extraction task evaluation robustness
Enhance task e8172110-ec08-421b-a6f5-842e6451911f (extract pixel art
character in GIMP) with more flexible evaluation and reference implementation.
Changes:
1. Add reference Python implementation:
- Provide complete example script for automated character extraction
- Implements background removal using color detection and threshold
- Includes helper functions for checkerboard background preview
- Serves as learning reference while maintaining task difficulty
2. Add new evaluation function check_structure_sim_with_threshold():
- Customizable SSIM threshold (default 0.85, more lenient than 0.9)
- Automatic image resizing when dimensions differ using LANCZOS
- Comprehensive logging for debugging (comparison details, SSIM scores)
- Handles edge cases: file not found, size mismatch, conversion errors
3. Update task configuration:
- Switch from check_structure_sim to check_structure_sim_with_threshold
- Set SSIM threshold to 0.85 for both GIMP and code outputs
- Add options parameter to pass threshold to evaluator
Rationale:
- Manual GIMP extraction and Python script may produce slightly different
pixel-level results due to different image processing methods (anti-aliasing,
edge handling)
- 0.85 threshold accepts minor pixel differences while ensuring visual similarity
- Auto-resize prevents failures from slight dimension mismatches during cropping
- Reference script helps users understand automation requirements without
compromising the learning objective
This makes evaluation more realistic while maintaining visual quality standards.
* Delete evaluation_examples/examples/vs_code/982d12a5-beab-424f-8d38-d2a48429e511.json
Just fixed it in a different way in the last commit. Thanks!
---------
Co-authored-by: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com>
2 months ago