Was surprised and somewhat disappointed that the article doesn’t appear to evaluate how well the models work when running in the harnesses optimized for the other models. Do they still do better than with the baseline harness? Does each model do worse with a harness optimized (by this process) for the other models, than it does for the harness optimized for itself?
monkmartinez 18 hours ago [-]
Not really an article, but yeah, I was hoping they went into the underlying mechanism a bit deeper. This paper could be confirmation of what localllamaians have been saying for months; Keep your harness surface small, allow the model to use the harness to build _your workflow_.
I have been doing a LOT of work around this with Qwen3.6 and its been super fun. There are some neat benchmarks that help guide, but nothing beats reading the output... and there is a lot of output to read when trying different quants, etc. Which leads me too...
The other thing I have learned is the "harness" is only as good as the model tuning that goes into it. If your prompt(s) are buggered from the beginning, you are going to have a bad time. The prompt structure and special tokens can be a PITA or really help depending on how much you know.
I don't know how agentic harnesses can work without being optimized for the models running within them. This is the biggest insight into working with agents for me. First thing I have always looked at were the prompts and parameters... everything else is orchestration to me.
clickety_clack 3 hours ago [-]
Where would I find a good write up on where to start with this?
behnamoh 21 hours ago [-]
What else is new? Put it in emacs and let the model improve the harness over time.
7e 20 hours ago [-]
Pretty obvious stuff; see Terminator for the conclusion (SkyNet). Or the Matrix. We really need more work on model alignment, trustworthiness, and control.
https://github.com/skorotkiewicz/nano-agent/blob/main/pi_ext...
I have been doing a LOT of work around this with Qwen3.6 and its been super fun. There are some neat benchmarks that help guide, but nothing beats reading the output... and there is a lot of output to read when trying different quants, etc. Which leads me too...
The other thing I have learned is the "harness" is only as good as the model tuning that goes into it. If your prompt(s) are buggered from the beginning, you are going to have a bad time. The prompt structure and special tokens can be a PITA or really help depending on how much you know.
I don't know how agentic harnesses can work without being optimized for the models running within them. This is the biggest insight into working with agents for me. First thing I have always looked at were the prompts and parameters... everything else is orchestration to me.