OS-World/xiangyi-li · BenchFlow

030eeff7-b492-4218-b312-701ec99ee0cc.json
1.43 kB
06fe7178-4491-4589-810f-2e2bc9502122.json
1.46 kB
0d8b7de3-e8de-4d86-b9fd-dd2dce58a217.json
1.58 kB
12086550-11c0-466b-b367-1d9e75b3910e.json
1.17 kB
121ba48f-9e17-48ce-9bc6-a4fb17a7ebba.json
1.29 kB
1704f00f-79e6-43a7-961b-cedd3724d5fd.json
2.4 kB
2888b4e6-5b47-4b57-8bf5-c73827890774.json
1.58 kB
2ad9387a-65d8-4e33-ad5b-7580065a27ca.json
1.51 kB
2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3.json
1.53 kB
3299584d-8f11-4457-bf4c-ce98f7600250.json
1.6 kB
35253b65-1c19-4304-8aa4-6884b8218fc0.json
1.26 kB
368d9ba4-203c-40c1-9fa3-da2f1430ce63.json
1.78 kB
3720f614-37fd-4d04-8a6b-76f54f8c222d.json
581 B
44ee5668-ecd5-4366-a6ce-c1c9b8d4e938.json
9.56 kB
47543840-672a-467d-80df-8f7c3b9788c9.json
3.18 kB
480bcfea-d68f-4aaa-a0a9-2589ef319381.json
900 B
59155008-fe71-45ec-8a8f-dc35497b6aa8.json
1.28 kB
6766f2b8-8a72-417f-a9e5-56fcaa735837.json
1.81 kB
6c4c23a1-42a4-43cc-9db1-2f86ff3738cc.json
1.84 kB
7a5a7856-f1b6-42a4-ade9-1ca81ca0f263.json
1.77 kB
7b6c7e24-c58a-49fc-a5bb-d57b80e5b4c3.json
1.91 kB
7f52cab9-535c-4835-ac8c-391ee64dc930.json
2.02 kB
82279c77-8fc6-46f6-9622-3ba96f61b477.json
1.49 kB
82bc8d6a-36eb-4d2d-8801-ef714fb1e55a.json
1.72 kB
93eabf48-6a27-4cb6-b963-7d5fe1e0d3a9.json
2.34 kB
9656a811-9b5b-4ddf-99c7-5117bcef0626.json
2.32 kB
99146c54-4f37-4ab8-9327-5f3291665e1e.json
1.44 kB
9f3f70fc-5afc-4958-a7b7-3bb4fcb01805.json
1.61 kB
9f935cce-0a9f-435f-8007-817732bfc0a5.json
1.27 kB
a728a36e-8bf1-4bb6-9a03-ef039a5233f0.json
1.32 kB
a96b564e-dbe9-42c3-9ccf-b4498073438a.json
1.28 kB
ae78f875-5b98-4907-bbb5-9c737fc68c03.json
655 B
af630914-714e-4a24-a7bb-f9af687d3b91.json
1.58 kB
b070486d-e161-459b-aa2b-ef442d973b92.json
2.02 kB
b4f95342-463e-4179-8c3f-193cd7241fb2.json
1.47 kB
b7895e80-f4d1-4648-bee0-4eb45a6f1fa8.json
3.56 kB
bb5e4c0d-f964-439c-97b6-bdb9747de3f4.json
1.5 kB
c1fa57f3-c3db-4596-8f09-020701085416.json
1.29 kB
cabb3bae-cccb-41bd-9f5d-0f3a9fecd825.json
1.81 kB
da46d875-6b82-4681-9284-653b0c7ae241.json
2.63 kB
e1e75309-3ddb-4d09-92ec-de869c928143.json
1.45 kB
f0b971a1-6831-4b9b-a50e-22a6e47f45ba.json
1.32 kB
f3b19d1e-2d48-44e9-b4e1-defcae1a0197.json
1.21 kB
f5d96daf-83a8-4c86-9686-bada31fc66ab.json
1.48 kB
f79439ad-3ee8-4f99-a518-0eb60e5652b0.json
1.8 kB
fc6d8143-9452-4171-9459-7f515143419a.json
1.71 kB

030eeff7-b492-4218-b312-701ec99ee0cc.json
1.43 kB
feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
8 months ago
06fe7178-4491-4589-810f-2e2bc9502122.json
1.46 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
0d8b7de3-e8de-4d86-b9fd-dd2dce58a217.json
1.58 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
12086550-11c0-466b-b367-1d9e75b3910e.json
1.17 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
121ba48f-9e17-48ce-9bc6-a4fb17a7ebba.json
1.29 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
1704f00f-79e6-43a7-961b-cedd3724d5fd.json
2.4 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
2888b4e6-5b47-4b57-8bf5-c73827890774.json
1.58 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
2ad9387a-65d8-4e33-ad5b-7580065a27ca.json
1.51 kB
feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
8 months ago
2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3.json
1.53 kB
feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
8 months ago
3299584d-8f11-4457-bf4c-ce98f7600250.json
1.6 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
35253b65-1c19-4304-8aa4-6884b8218fc0.json
1.26 kB
Refine task evaluators and configurations for higher reliability (#422) * fix chrome & thunderbird & vlc & impress * Delete evaluation_examples/examples/chrome/44ee5668-ecd5-4366-a6ce-c1c9b8d4e938.json * Update 15c3b339-88f7-4a86-ab16-e71c58dcb01e.json updated instructions synchronously with the new evaluation approach --------- Co-authored-by: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com> Co-authored-by: Danyang Zhang <zdy004007@126.com>
a month ago
368d9ba4-203c-40c1-9fa3-da2f1430ce63.json
1.78 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
3720f614-37fd-4d04-8a6b-76f54f8c222d.json
581 B
Update evaluation examples with clearer instructions and improved configurations - Revised instructions in multiple JSON examples to enhance clarity and specificity, ensuring users understand the tasks better. - Adjusted configurations for tasks involving Chrome, GIMP, LibreOffice, and VLC to reflect more precise requirements and expected behaviors. - These changes aim to improve the usability and effectiveness of the evaluation examples, making them more intuitive for users.
a month ago
44ee5668-ecd5-4366-a6ce-c1c9b8d4e938.json
9.56 kB
Restore chrome example 44ee5668 (#427)
a month ago
47543840-672a-467d-80df-8f7c3b9788c9.json
3.18 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
480bcfea-d68f-4aaa-a0a9-2589ef319381.json
900 B
fix: refine task evaluators for Chrome and Thunderbird (#445) - chrome/480bcfea: mark task as infeasible - thunderbird/15c3b339: relax evaluator to only require email and password (remove Anonym Tester and IMAP rules; instruction says stay on page) - thunderbird/dd84e895: require every message starred in Bills folder (SQL now checks starred count = total count, not just sum(1) > 0)
5 days ago
59155008-fe71-45ec-8a8f-dc35497b6aa8.json
1.28 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
6766f2b8-8a72-417f-a9e5-56fcaa735837.json
1.81 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
6c4c23a1-42a4-43cc-9db1-2f86ff3738cc.json
1.84 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
7a5a7856-f1b6-42a4-ade9-1ca81ca0f263.json
1.77 kB
feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
8 months ago
7b6c7e24-c58a-49fc-a5bb-d57b80e5b4c3.json
1.91 kB
feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
8 months ago
7f52cab9-535c-4835-ac8c-391ee64dc930.json
2.02 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
82279c77-8fc6-46f6-9622-3ba96f61b477.json
1.49 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
82bc8d6a-36eb-4d2d-8801-ef714fb1e55a.json
1.72 kB
Allow ARN as valid Stockholm code for chrome 82bc8 task
a month ago
93eabf48-6a27-4cb6-b963-7d5fe1e0d3a9.json
2.34 kB
Fix chrome dark-mode task evaluation for appearance settings
24 days ago
9656a811-9b5b-4ddf-99c7-5117bcef0626.json
2.32 kB
Add safe browsing feature to Chrome evaluator - Implemented `get_enable_safe_browsing` function to retrieve safe browsing settings based on the operating system. - Updated the `__init__.py` to include the new function. - Modified JSON examples to reflect the change from enabling enhanced safety browsing to enabling safe browsing. - Added necessary commands in the JSON examples for setting up preferences for safe browsing.
5 months ago
99146c54-4f37-4ab8-9327-5f3291665e1e.json
1.44 kB
feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
8 months ago
9f3f70fc-5afc-4958-a7b7-3bb4fcb01805.json
1.61 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
9f935cce-0a9f-435f-8007-817732bfc0a5.json
1.27 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
a728a36e-8bf1-4bb6-9a03-ef039a5233f0.json
1.32 kB
Refactor evaluator functions in JSON examples to use URL pattern matching. Update expected URL formats to regex patterns for better validation in chrome evaluation examples.
5 months ago
a96b564e-dbe9-42c3-9ccf-b4498073438a.json
1.28 kB
Update evaluation examples with clearer instructions and improved configurations - Revised instructions in multiple JSON examples to enhance clarity and specificity, ensuring users understand the tasks better. - Adjusted configurations for tasks involving Chrome, GIMP, LibreOffice, and VLC to reflect more precise requirements and expected behaviors. - These changes aim to improve the usability and effectiveness of the evaluation examples, making them more intuitive for users.
a month ago
ae78f875-5b98-4907-bbb5-9c737fc68c03.json
655 B
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
af630914-714e-4a24-a7bb-f9af687d3b91.json
1.58 kB
feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
8 months ago
b070486d-e161-459b-aa2b-ef442d973b92.json
2.02 kB
Fix the config error (chrome/b070486d-e161-459b-aa2b-ef442d973b92) (#446) The evaluator function lens should equal to result_getter and metrics, otherwise it will raise Assertion Error in `line 604 desktop_env.py` ```python assert (not isinstance(self.evaluator["func"], list) or (len(self.metric) == len(self.result_getter) == len(self.expected_getter) == len( self.metric_options))) ```
7 days ago
b4f95342-463e-4179-8c3f-193cd7241fb2.json
1.47 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
b7895e80-f4d1-4648-bee0-4eb45a6f1fa8.json
3.56 kB
Feature/task refine Fix task configurations and evaluators for improved robustness (#420) * Relax reference similarity threshold for APA formatting task Reduce the reference_base_result threshold from 0.93/0.92 to 0.6 for task 2c1ebcd7-9c6d-4c9a-afad-900e381ecd5e (APA 7th edition reference formatting). The original threshold was overly strict and failed submissions with minor formatting variations that are acceptable in APA style (e.g., spacing around journal volume numbers). The new threshold of 0.6 maintains validation of core content accuracy (authors, dates, titles, journals) while accepting reasonable formatting variants, better reflecting real-world APA formatting standards. Changes: - Ubuntu version: 0.93 -> 0.6 - Windows version: 0.92 -> 0.6 * Fix XPath selectors for TripAdvisor hotel search task Replace brittle absolute XPath expressions with robust relative XPath selectors for task b7895e80-f4d1-4648-bee0-4eb45a6f1fa8 (TripAdvisor hotel search in NYC). The original absolute XPath paths (e.g., /html/body/div[1]/main/div[3]/...) are fragile and break when the website's DOM structure changes. The new relative XPath expressions use stable attributes like data-automation and aria-label, making the evaluator resilient to minor UI updates. Changes: - Replaced absolute paths with attribute-based selectors - Check-in date: //button[@data-automation='checkin']//... - Check-out date: //button[@data-automation='checkout']//... - City header: //h2[@data-automation='header_geo_title'] - Room/guest selector: //button[@data-automation='roomsandguests']//... - Price sorting: //button[contains(@aria-label,'PRICE_LOW_TO_HIGH')]//... - Updated expected value for adult field in first test case to match actual output format: "Rooms/Guests1 Room, 2 Guests" This ensures the evaluation will work correctly even when TripAdvisor updates their page layout or CSS classes. * Fix expected output format for file compression task Correct the expected success message format for task 37887e8c-da15-4192-923c-08fa390a176d (compress files modified 30 days ago). The evaluation script eval_20250703.sh outputs "SUCCESS: The task was completed correctly." with "SUCCESS" in all uppercase, but the expected value was using "Success" with only the first letter capitalized. This case mismatch caused the evaluator to incorrectly fail valid submissions. Change: - Expected message: "Success: ..." -> "SUCCESS: ..." This ensures the string matching in check_include_exclude() correctly identifies successful task completion. * Add workspace trust settings to VS Code theme change task Update expected configuration for task 982d12a5-beab-424f-8d38-d2a48429e511 (change VS Code color theme) to include workspace trust security settings. When VS Code launches, it automatically adds workspace trust configuration to settings.json to handle security prompts. The original expected output only included the color theme setting, causing the evaluator to fail even when the task was completed correctly. Changes: - Added three workspace trust settings to expected configuration: - security.workspace.trust.enabled: false - security.workspace.trust.startupPrompt: "never" - security.workspace.trust.emptyWindow: false - Retained the required workbench.colorTheme setting These settings prevent workspace trust dialogs from interfering with automated testing and ensure the evaluator correctly matches the actual settings.json content generated during task execution. * Improve GIMP character extraction task evaluation robustness Enhance task e8172110-ec08-421b-a6f5-842e6451911f (extract pixel art character in GIMP) with more flexible evaluation and reference implementation. Changes: 1. Add reference Python implementation: - Provide complete example script for automated character extraction - Implements background removal using color detection and threshold - Includes helper functions for checkerboard background preview - Serves as learning reference while maintaining task difficulty 2. Add new evaluation function check_structure_sim_with_threshold(): - Customizable SSIM threshold (default 0.85, more lenient than 0.9) - Automatic image resizing when dimensions differ using LANCZOS - Comprehensive logging for debugging (comparison details, SSIM scores) - Handles edge cases: file not found, size mismatch, conversion errors 3. Update task configuration: - Switch from check_structure_sim to check_structure_sim_with_threshold - Set SSIM threshold to 0.85 for both GIMP and code outputs - Add options parameter to pass threshold to evaluator Rationale: - Manual GIMP extraction and Python script may produce slightly different pixel-level results due to different image processing methods (anti-aliasing, edge handling) - 0.85 threshold accepts minor pixel differences while ensuring visual similarity - Auto-resize prevents failures from slight dimension mismatches during cropping - Reference script helps users understand automation requirements without compromising the learning objective This makes evaluation more realistic while maintaining visual quality standards. * Delete evaluation_examples/examples/vs_code/982d12a5-beab-424f-8d38-d2a48429e511.json Just fixed it in a different way in the last commit. Thanks! --------- Co-authored-by: Tianbao Xie <47296835+Timothyxxx@users.noreply.github.com>
2 months ago
bb5e4c0d-f964-439c-97b6-bdb9747de3f4.json
1.5 kB
feat: enhance evaluator configuration for Chrome with post-execution commands - Added postconfig commands to multiple JSON files for Chrome evaluation examples. - Included commands to terminate existing Chrome processes, launch Chrome with remote debugging, and introduce sleep intervals for timing. - Updated logging messages in the AWS manager to improve clarity and user experience. These changes enhance the automation and usability of the evaluation examples while preserving existing logic.
8 months ago
c1fa57f3-c3db-4596-8f09-020701085416.json
1.29 kB
Task fix batch (#383) * update 873cafdd-a581-47f6-8b33-b9696ddb7b05 task eval * c1fa57f3-c3db-4596-8f09-020701085416 fix, add tolerance to url matching * 8df7e444-8e06-4f93-8a1a-c5c974269d82 add more clear instruction to the filename for compress * add address string normalization for 6f4073b8-d8ea-4ade-8a18-c5d1d5d5aa9a --------- Co-authored-by: Jiaqi <dengjiaqi@moonshot.cn>
4 months ago
cabb3bae-cccb-41bd-9f5d-0f3a9fecd825.json
1.81 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
da46d875-6b82-4681-9284-653b0c7ae241.json
2.63 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
e1e75309-3ddb-4d09-92ec-de869c928143.json
1.45 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
f0b971a1-6831-4b9b-a50e-22a6e47f45ba.json
1.32 kB
Enhance PPTX comparison logic and update evaluation examples - Introduced a new helper function `nonempty_runs` to filter out formatting-only runs in PPTX comparisons, improving accuracy in slide evaluations. - Updated the logic for comparing runs in paragraphs to handle cases where both paragraphs are empty, ensuring consistent evaluation results. - Revised JSON examples to clarify instructions and enhance evaluator configurations, including adding post-execution commands for script downloads and permissions. These changes improve the robustness of the PPTX file comparison and ensure that evaluation examples are more precise and functional.
2 months ago
f3b19d1e-2d48-44e9-b4e1-defcae1a0197.json
1.21 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
f5d96daf-83a8-4c86-9686-bada31fc66ab.json
1.48 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
f79439ad-3ee8-4f99-a518-0eb60e5652b0.json
1.8 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago
fc6d8143-9452-4171-9459-7f515143419a.json
1.71 kB
feat: standardize configuration fields across all evaluation examples - Add `fixed_ip` field to all 369 JSON files in examples directory - Set to `true` for 8 files listed in google_chrome.json multi_apps - Set to `false` for remaining 361 files - Add `possibility_of_env_change` field to 363 JSON files missing this field - Set to "low" for newly added fields - Preserve existing values (4 medium, 2 high) for 6 files that already had this field This ensures consistent configuration schema across all evaluation examples while maintaining backward compatibility with existing settings.
8 months ago