I spent today experimenting with OpenEvolve (the open-source counterpart to DeepMind’s AlphaEvolve) on a RHEL 9 laptop using the OpenAI API. To get it running, I fixed a small hashing issue—swapping hashlib.md5 for hashlib.sha256 due to Red Hat constraints—and added support for gpt-5, modeling it after the other o-series entries.
For a test task, I used the classic Two Sum Python problem (scan a list and find two numbers that add up to a target). With the default settings, OpenEvolve failed to reach the optimal solution and declared the O(n²) brute-force approach as the best.
Digging into the logs (and checking with AI) revealed that evaluator.py sets the optimization objective. The defaults overweight correctness and rely on overly simple tests. I expanded the test suite with edge cases and larger inputs, and rebalanced the objective to 50/50 for correctness and efficiency (instead of 70/30). Last but not least, I enabled full rewrite (not the default diff-based ev