OS-World/xiangyi-li · BenchFlow

__init__.py
4.08 kB
basic_os.py
1.81 kB
chrome.py
21.2 kB
docs.py
42 kB
general.py
25.2 kB
gimp.py
34.6 kB
libreoffice.py
1.21 kB
others.py
3.85 kB
pdf.py
801 B
slides.py
55.1 kB
table.py
30.4 kB
thunderbird.py
6.63 kB
utils.py
35.8 kB
vlc.py
21 kB
vscode.py
13.9 kB

__init__.py
4.08 kB
Fix VLC auto-close task: evaluate play-and-exit instead of infeasible
a month ago
basic_os.py
1.81 kB
Code clean
2 years ago
chrome.py
21.2 kB
Fix plugin-task evaluation
22 days ago
docs.py
42 kB
feat: add X11 image handling and enhanced OCR processing - Introduced a new function `read_x11_image` to read and convert X11 (XWD) format images to PIL Image, supporting both 24-bit and 32-bit formats. - Enhanced the `compare_image_text` function to include checks for X11 image formats, with multiple conversion attempts using PIL, a custom reader, and netpbm tools. - Improved error handling and logging for OCR processing, providing detailed feedback on conversion attempts and potential issues with X11 images. - Maintained existing logic while expanding functionality for better image processing reliability.
8 months ago
general.py
25.2 kB
fix: Enhance error handling and logging across multiple evaluators - Added logging for file retrieval and error handling in file.py, improving robustness during file operations. - Implemented checks for file existence and parsing errors in general.py, enhancing reliability in JSON/YAML processing. - Improved table comparison logic in table.py with detailed error logging for sheet loading and cell value reading. - Enhanced metrics evaluation in slides.py with additional checks for paragraph and run counts, ensuring thorough comparison. - Updated utils.py to include file existence checks and detailed error logging during cell value reading.
8 months ago
gimp.py
34.6 kB
Feature/task refine Fix task configurations and evaluators for improved robustness (#420) * Relax reference similarity threshold for APA formatting task Reduce the reference_base_result threshold from 0.93/0.92 to 0.6 for task 2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e (APA 7th edition reference formatting). The original threshold was overly strict and failed submissions with minor formatting variations that are acceptable in APA style (e.g., spacing around journal volume numbers). The new threshold of 0.6 maintains validation of core content accuracy (authors, dates, titles, journals) while accepting reasonable formatting variants, better reflecting real-world APA formatting standards. Changes: - Ubuntu version: 0.93 -> 0.6 - Windows version: 0.92 -> 0.6 * Fix XPath selectors for TripAdvisor hotel search task Replace brittle absolute XPath expressions with robust relative XPath selectors for task b7895e80-f4d1-4648-bee0-4eb45a6f1fa8 (TripAdvisor hotel search in NYC). The original absolute XPath paths (e.g., /html/body/div[1]/main/div[3]/...) are fragile and break when the website's DOM structure changes. The new relative XPath expressions use stable attributes like data-automation and aria-label, making the evaluator resilient to minor UI updates. Changes: - Replaced absolute paths with attribute-based selectors - Check-in date: //button[@data-automation='checkin']//... - Check-out date: //button[@data-automation='checkout']//... - City header: //h2[@data-automation='header_geo_title'] - Room/guest selector: //button[@data-automation='roomsandguests']//... - Price sorting: //button[contains(@aria-label,'PRICE_LOW_TO_HIGH')]//... - Updated expected value for adult field in first test case to match actual output format: "Rooms/Guests1 Room, 2 Guests" This ensures the evaluation will work correctly even when TripAdvisor updates their page layout or CSS classes. * Fix expected output format for file compression task Correct the expected success message format for task 37887e8c-da15-4192-923c-08fa390a176d (compress files modified 30 days ago). The evaluation script eval_20250703.sh outputs "SUCCESS: The task was completed correctly." with "SUCCESS" in all uppercase, but the expected value was using "Success" with only the first letter capitalized. This case mismatch caused the evaluator to incorrectly fail valid submissions. Change: - Expected message: "Success: ..." -> "SUCCESS: ..." This ensures the string matching in check_include_exclude() correctly identifies successful task completion. * Add workspace trust settings to VS Code theme change task Update expected configuration for task 982d12a5-beab-424f-8d38-d2a48429e511 (change VS Code color theme) to include workspace trust security settings. When VS Code launches, it automatically adds workspace trust configuration to settings.json to handle security prompts. The original expected output only included the color theme setting, causing the evaluator to fail even when the task was completed correctly. Changes: - Added three workspace trust settings to expected configuration: - security.workspace.trust.enabled: false - security.workspace.trust.startupPrompt: "never" - security.workspace.trust.emptyWindow: false - Retained the required workbench.colorTheme setting These settings prevent workspace trust dialogs from interfering with automated testing and ensure the evaluator correctly matches the actual settings.json content generated during task execution. * Improve GIMP character extraction task evaluation robustness Enhance task e8172110-ec08-421b-a6f5-842e6451911f (extract pixel art character in GIMP) with more flexible evaluation and reference implementation. Changes: 1. Add reference Python implementation: - Provide complete example script for automated character extraction - Implements background removal using color detection and threshold - Includes helper functions for checkerboard background preview - Serves as learning reference while maintaining task difficulty 2. Add new evaluation function check_structure_sim_with_threshold(): - Customizable SSIM threshold (default 0.85, more lenient than 0.9) - Automatic image resizing when dimensions differ using LANCZOS - Comprehensive logging for debugging (comparison details, SSIM scores) - Handles edge cases: file not found, size mismatch, conversion errors 3. Update task configuration: - Switch from check_structure_sim to check_structure_sim_with_threshold - Set SSIM threshold to 0.85 for both GIMP and code outputs - Add options parameter to pass threshold to evaluator Rationale: - Manual GIMP extraction and Python script may produce slightly different pixel-level results due to different image processing methods (anti-aliasing, edge handling) - 0.85 threshold accepts minor pixel differences while ensuring visual similarity - Auto-resize prevents failures from slight dimension mismatches during cropping - Reference script helps users understand automation requirements without compromising the learning objective This makes evaluation more realistic while maintaining visual quality standards. * Delete evaluation_examples/examples/vs_code/982d12a5-beab-424f-8d38-d2a48429e511.json Just fixed it in a different way in the last commit. Thanks! --------- Co-authored-by: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com>
2 months ago
libreoffice.py
1.21 kB
Code clean
2 years ago
others.py
3.85 kB
fix: improve EPUB processing by checking for file existence before reading - Added checks for the presence of "toc.ncx" and "content.opf" in the EPUB file before attempting to process them. - Introduced debug logging to notify when these files are not found, enhancing error handling and traceability. - Maintained existing logic while improving robustness of the EPUB processing function.
7 months ago
pdf.py
801 B
Code clean
2 years ago
slides.py
55.1 kB
Make PPTX run-count comparison configurable for task b8adbc24 (#443) Add an `examine_run_count` flag to `compare_pptx_files` (defaulting to true) and gate run-count mismatch checks for both text paragraphs and table cells. Disable this check in `b8adbc24-cef2-4b15-99d5-ecbe7ff445eb.json` to prevent false negatives from non-semantic LibreOffice run segmentation differences.
7 days ago
table.py
30.4 kB
Task fix batch (#383) * update 873cafdd-a581-47f6-8b33-b9696ddb7b05 task eval * c1fa57f3-c3db-4596-8f09-020701085416 fix, add tolerance to url matching * 8df7e444-8e06-4f93-8a1a-c5c974269d82 add more clear instruction to the filename for compress * add address string normalization for 6f4073b8-d8ea-4ade-8a18-c5d1d5d5aa9a --------- Co-authored-by: Jiaqi <dengjiaqi@moonshot.cn>
4 months ago
thunderbird.py
6.63 kB
Code clean
2 years ago
utils.py
35.8 kB
Calc eval fix (#273) * ver Jun17th updating annotations * ver Jun17th corrected annotation of 1d17 added check for cell merge * ver Jun17th updated several annotations * ver Jun20th fixed set-up config of 2bd59342-0664-4ccb-ba87-79379096cc08 * fix: Enhance instructions in LibreOffice Calc examples for clarity and specificity, including details on using Pivot Tables, column placements, and revenue calculations. * ver Jun21st updating calc evals * ver Jun22nd fixed an impress task * ver Jun22ndv2 adjusted several calc tasks * Clean scalfolds * ver Jul18th added two try-excepts to handle possible formula parsing and calculation failures * ver Jul19th added supports for cellIs and some other new types of conditional formatting for calc evaluation --------- Co-authored-by: BowenBryanWang <bryanwang.nlp@connect.hku.hk> Co-authored-by: yuanmengqi <yuanmengqi@mail.ustc.edu.cn>
8 months ago
vlc.py
21 kB
Fix VLC auto-close task: evaluate play-and-exit instead of infeasible
a month ago
vscode.py
13.9 kB
Add VM machine retrieval and enhance Chrome page info handling - Added `get_vm_machine` method to `PythonController` for retrieving the VM's machine type. - Updated `get_page_info` function in `chrome.py` to utilize the new VM machine retrieval for determining the appropriate Chrome launch command. - Improved error handling and logging for page loading and connection attempts. - Adjusted JSON comparison logic in `vscode.py` for better subset checking and handling of expected results. - Fixed minor typos in JSON example files for clarity.
2 months ago