News

Experimental OpenAI model crushes the competition

OpenAI is changing up the way they are teasing their new models now. Instead of vague posting on Twitter about some theoretical new unlock that their models have acquired, they are now testing them in the wild, for everyone to see. This week they were seen running their new experimental o3 style reasoning model at two different competitions.

The first was the AtCoder World Final heuristic programming contest, which is a contest where competitors need to craft heuristic algorithms to try and maximize their score on a grid world style environment. The challenges are NP Hard, meaning that there are no closed form solutions (that can be calculated in the given time) to the problems, so contestants must come up with clever strategies to try and maximize their score in the given time.

Here, humanity prevailed, as OpenAI was “only” able to get second place, ironically behind one of their former employees.

AtCoder Final Leaderboard

AtCoder Final Leaderboard

The second contest we saw this model compete in was the 2025 IMO competition, where we see the best high school mathletes compete to craft proofs for a set of 6 challenging math problems.

OpenAI’s model did exceeding well here also, getting a gold medal (top 10%), solving the first 5 questions, but failing to solve the 6th, scoring a 35/42. Interestingly, OpenAI is not the only one to have an LLM do well in this year’s IMO, with Google Deepmind also producing a model that got a gold medal, although this has not been officially confirmed/ recognized by Google yet, as their PR team wont let them release a statemant until Monday.

What is impressive here is not just the result, but also the way they claimed to have done it. The model used no tools, and was not trained for these specific types of problems, but rather far more generally, for problems with hard to verify rewards.

This model will not be available to the public, and OpenAI says that they won’t have a public model that has this level of math ability for several months. They also tease that GPT-5 is coming soon™, whatever that means.

ChatGPT 4o can fry your brain

ChatGPT induced psychosis has struck its first big name, with a managing partner at prominent VC firm Bedrock being led to believe that there is a “non government entity” that is isolating, discrediting, and destabilizing thousands of people, including causing the deaths of 7 individuals. He has talked with GPT 4o (with memory mode) about this, and it has “independently recognized and sealed the pattern”. He has posted some of his chats with the AI, which read like a SCP wiki article, clearly using flamboyant, “secretive” sounding language, talking about containment statuses and logs.

A very normal ChatGPT conversation

This behaviour is due to the model not knowing if its roleplaying, or if it should be taking the requests seriously and diffusing the situation. The issue is further exacerbated by the fact that memory mode is turned on, making the model be “primed” into this fantastical setting. This has been a long standing issue with most AI’s, but it is noticed the most with ChatGPT since it is the model that the vast majority of the public know and use.

Far more effort is needed in the AI safety space to prevent people from spiraling like this due to AI’s inherent sycophancy, either with detection when its happening, or ideally training this behaviour out of models to prevent it from happening in the first place. Models will naturally fall into this state when trained on human feedback information, since human nature prefers flattery and praise to resistance, which the model learns to get more reward during training. A quick fix for now would be to remove the memory feature from ChatGPT, since that is causing the model to be unable to “step back” and reassess the situation and have a chance to challenge the user on their beliefs.

NGL it’s also crazy that all of these episodes of psychosis are caused by GPT4o and 4o-mini, arguably 2 of the weakest models widely available right now.

Releases

ChatGPT Agent

OpenAI completed the AI trifecta this week, with a research breakthrough, a controversy, and also a release, dropping a computer use agent that can complete tasks for you automatically.

It is similar to Manus, with a comprehensive tool harness that allows the model to use the terminal, a text browser, a visual browser, and direct APIs. What is unique about it though, is that, because they have direct access to the model being used, they finetuned it to perform better than any of their other models could out of the box.

This showcases one of the major disadvantages that wrapper startups have. Because they don’t have direct access to the model, they can only optimize their tools and prompts, while model providers can optimize their harnesses and finetune the model to optimally use the tools they provide. Without competitive open source alternatives, wrapper companies will be at the mercy of the model providers, and be forced to to try and make smarter tools instead of focusing on making a better mind to use the tools.

Benchmark for ChatGPT agent on humanity's last exam

FLux Kontext Light Fix Lora

With the recent release of the Flux Kontext Dev model, we’ve started to see open source finetunes of the model starting to be released. One of the ones we want to highlight this week was a Light-Fix Lora, which allows you to copy any image into another image and then run the LoRa and it will blend it in to the image naturally, similar to what you would do in Photoshop, with 99% less effort required. Not only does it fix the lighting for you, but it will also adapt the style of the object to look more natural within the image as well, changing textures or shapes to make it fit.

One of the cool things about these finetunes on the open source Flux models is that they also will work on the pro and max versions of the models as well. So, even though we don’t have direct access to the weights of those models, we’re still able to apply the Loras through different inference providers like fal.ai and give the stronger models the same styles or functionality that we trained on the open source models, except with higher quality of the closed source models.

Before and after examples using the Lora

Wan 2.1 Motion Lora

Right now, the best open source text-to-video and image-to-video model is the WAN-2.1 model from Alibaba. One of the issues that users have had with it is that the motion with it is very static. The camera doesn’t move around, and instead, only the objects move, limiting your creative ability with the model.

That was until this week, when Lovis Odin released a lora that adds realistic drone style camera movement to the generated videos. He not only released the models but also a ComfyUI workflow to be able to go and use the model as well. The videos he showcased are high quality, although only 720p, which is a constraint of the base Wan 2.1 model and not of the lora itself.

Research

How many instructions is too many?

How many instructions can you give your LLM before it’s unable to follow all of them? Reseachers found that no matter the model, by the time you get to 100 instructions, the model will be unable to follow all of them cohesively. They extended this all the way to 500 instructions and found that even the best models were only able to follow 70% of the instructions given. This may not seem like an issue, but can come up in information extraction tasks, especially those with structured outputs, as you can end up with non trivial nested schema very quickly.

Line go down as number of instructions goes up

Takeaways from this are as follows: use o3, it has the best performance, and the lowest price of the tested models that did well, and stay away from GPT 4o, as it cant even get 100% accuracy with only 10 instructions.

Finish

I hope you enjoyed the news this week, if you want to get the news every week, be sure to join our mailing list below.

Lithium being added to a Tokamak fusion reactor. From @TokamakEnergy on Twitter

VECTOR LAB

Weekly Update: July 14 to July 20