I spent today experimenting with OpenEvolve (the open-source counterpart to DeepMind’s AlphaEvolve) on a RHEL 9 laptop using the OpenAI API. To get it running, I fixed a small hashing issue—swapping hashlib.md5 for hashlib.sha256 due to Red Hat constraints—and added support for gpt-5, modeling it after the other o-series entries.
For a test task, I used the classic Two Sum Python problem (scan a list and find two numbers that add up to a target). With the default settings, OpenEvolve failed to reach the optimal solution and declared the O(n²) brute-force approach as the best.
Digging into the logs (and checking with AI) revealed that evaluator.py sets the optimization objective. The defaults overweight correctness and rely on overly simple tests. I expanded the test suite with edge cases and larger inputs, and rebalanced the objective to 50/50 for correctness and efficiency (instead of 70/30). Last but not least, I enabled full rewrite (not the default diff-based evolution).
After rerunning, OpenEvolve discovered the optimal O(n) solution in a few iterations. My takeaway: the hard part is crafting the evaluator—the tests and weightings heavily steer outcomes. I also noticed it starts from an initial program rather than generating from scratch, which I hadn’t expected.
Overall, in 1–2 hours I got OpenEvolve working and producing useful results. I’m pleased with the progress and plan to keep exploring; it’s a very promising open-source project.



a log file
Program ID: cc0f70ed-2f47-43e2-9587-0b90b1321546
Checkpoint: examples/my_leetcode_program/openevolve_output/checkpoints/checkpoint_30
Island: 2
Generation: 3
Parent ID: 43636357-2231-4937-94e6-51d8f30e1e32
Metrics:
combined_score: 0.9997711896896362
correctness: 1.0
efficiency: 0.9995423793792725
total_time: 0.0022881031036376953
Code:
def two_sum(nums, target):
"""
Given an array of integers nums and an integer target,
return indices of the two numbers such that they add up to target.
"""
# Dictionary to map numbers to their indices
num_to_index = {}
if name == "main":
# Example usage
nums = [2, 7, 11, 15]
target = 9
result = two_sum(nums, target)
print(f"Result: {result}")
Prompts:
Full_rewrite_user
system:
You are an expert competitive programmer. Optimize the given code for correctness and efficiency.
user:
Current Program Information
Last Execution Output
test_results
execution_time
Program Evolution History
Previous Attempts
Attempt 3
Attempt 2
Attempt 1
Top Performing Programs
Program 1 (Score: 0.9999)
Key features: Performs well on combined_score (0.9999), Performs well on correctness (1.0000), Performs well on efficiency (0.9998), Performs well on total_time (0.0008)
Program 2 (Score: 0.9999)
Key features: Performs well on combined_score (0.9999), Performs well on correctness (1.0000), Performs well on efficiency (0.9998), Performs well on total_time (0.0009)
Program 3 (Score: 0.9999)
Key features: Performs well on combined_score (0.9999), Performs well on correctness (1.0000), Performs well on efficiency (0.9998), Performs well on total_time (0.0010)
Diverse Programs
Program D1 (Score: 0.9999)
Key features: Alternative approach to combined_score, Alternative approach to correctness
Inspiration Programs
These programs represent diverse approaches and creative solutions that may inspire new ideas:
Inspiration 1 (Score: 0.9999, Type: High-Performer)
Unique approach: Modification: Full rewrite, Excellent combined_score (1.000), Excellent correctness (1.000)
Inspiration 2 (Score: 0.9999, Type: High-Performer)
Unique approach: Modification: Full rewrite, Excellent combined_score (1.000), Excellent correctness (1.000)
Inspiration 3 (Score: 0.9999, Type: High-Performer)
Unique approach: Modification: Full rewrite, Excellent combined_score (1.000), Excellent correctness (1.000)
Current Program
Task
Rewrite the program to improve its FITNESS SCORE.
The system maintains diversity across these dimensions: complexity, diversity
Different solutions with similar fitness but different features are valuable.
Provide the complete new program code.
IMPORTANT: Make sure your rewritten program maintains the same inputs and outputs
as the original program, but with improved internal implementation.
# Your rewritten program hereresponses:
The current program is already highly optimized in terms of correctness and efficiency for the "Two Sum" problem. It uses a hash map to achieve a time complexity of O(n), which is optimal for this problem. Given the constraints and the existing performance, there isn't much room for improvement in terms of raw speed or complexity.
However, we can explore slight variations or stylistic improvements that might be beneficial in other contexts, such as readability or maintainability. Here's a version that focuses on using Python's dictionary methods more explicitly and ensuring the code is concise and clear:
Key Points:
This solution maintains the same performance characteristics as the original while ensuring that the code is straightforward and easy to read.
Artifacts:
{"test_results": "Passed 15/15 tests\nTest 0: PASS (0.000004s)\nTest 1: PASS (0.000003s)\nTest 2: PASS (0.000001s)\nTest 3: PASS (0.000002s)\nTest 4: PASS (0.000002s)\nTest 5: PASS (0.000001s)\nTest 6: PASS (0.000001s)\nTest 7: PASS (0.000003s)\nTest 8: PASS (0.000001s)\nTest 9: PASS (0.000089s)\nTest 10: PASS (0.000043s)\nTest 11: PASS (0.000005s)\nTest 12: PASS (0.000824s)\nTest 13: PASS (0.000918s)\nTest 14: PASS (0.000391s)", "execution_time": "0.002288 seconds"}