LATEST AI MODELS 2025
BY ANDREW MEAD

What LLMs You Should Use

GPT5, Claude Opus, or Open Source, what is the way to go?

Written in collaboration with Harvard Business School, INSEAD Univeristy, and Sundai Club

When looking at the AI space right now, it can be intimidating to keep up with all the models being released and keeping track of which is the best for what. In this article we want to demystify the AI ecosystem a bit, and give you picks for the top models across a variety of use cases.

If you want to try these models, we suggest using the LMArena or OpenRouter to access the wide variety of models that are available today.

If you are reading this in the future and want to know what the current best models are, I would recommend checking Artificial Analysis, as they have most of the major models benchmarked there and are very up to date. Note that high benchmarks don’t always ways translate to stellar real world performance, so be sure to test the model on your use case before deploying it to production.

Day to Day Use

For day-to-day use, we’re going to be looking at the models in terms of how good they are to use given the provider’s UI. Imagine this as the best general use AI product out there right now.

There is one clear winner here, which is OpenAI. The experience using GPT-5 in the browser on the ChatGPT website is one of the cleanest experiences you will get right now. The automatic enabling of web search and other tools like image generation and also the ability to directly connect to different services like the Microsoft 365 or any of the GSuite products makes this the best service to use.

You can make custom GPTs so that you can cater them to your specific context or understanding. You also get code agents with the codex functionality and also image generation using GPT Image and video generation with Sora as well, all available from the comfort of your own browser.

If you had $20 and only could only pay for one service, I would recommend getting a ChatGPT account (if you are using for general use, if you are looking to primarily code, Cursor would be my weapon of choice, but you will learn more about that in the coming weeks).

Api Pricing

See full pricing here

Model$ per million input tokens$ per million output tokens
GPT5$1.25$10

Multimodal

For multimodal models, there’s one clear winner here as well. Google’s Gemini family of models.

The Gemini models are the only mainstream models that can handle text, image, video, and audio inputs. With their extremely long 1 million context length, they are able to process up to 45 minute long videos with audio included, and over 8 hours for audio only. They also top pretty much all the benchmarks for image and video understanding.

They also happen to be some of the best price to performance models out there as well, especially the Gemini 2.5 Flash model, which is priced at only $2.50 per million tokens (4x cheaper than GPT 5).

You can test these models now for free on the Google AI Studio, which gives you large amounts of control to tinker with the models and see what they can do.

Api Pricing

See full pricing here

Model$ per million input tokens$ per million output tokens
Gemini 2.5 Flash$0.3$2.50
Gemini 2.5 Pro$1.25$10

Coding and agentic tasks

Once again, we have a pretty clear winner here for code writing and other agentic tasks, which is Claude 4.

Claude has been the number one name in the game when it has come to coding and agentic tasks for over a year now. And that has not stopped with their Claude 4 Sonnet and Opus.

It is the de facto model used by Cursor and also used in the top CLI coding tool, Claude Code.

I recommend using Sonnet for most tasks, as it will be good enough and is 5x cheaper than Opus. If money doesn’t matter or if you have a particularly hard task, then you can try Opus, which has been bumped up to version 4.1 recently, and is a small improvement

One notable mention here is from the open source community represented by Z.ai’s GLM 4.5 model. This is one of the first models that is able to go blow for blow with Sonnet 4 in my testing and also has the added benefit of being almost 10 times cheaper than Sonnet. It sometimes falls a little bit behind on more complicated tasks, but for day-to-day use, I see little difference.

Api Pricing

See full pricing Claude pricing here.

GLM 4.5 model pricing taken from OpenRouter.

OpenRouter is a platform that allows you to use both closed and open source models all from one place (one url and api key to access all of them). The open source models are hosted by various inference provider companies like TogetherAI and Chutes, as well as first party providers like Z.ai.

OpenRouter also provides information about each provider like reliability, latency (time to first token), and throughput (how fast the model is).

For this chart, we are using the pricing directly from Z.ai.

Model$ per million input tokens$ per million output tokens
GLM 4.5$0.60$2.20
Claude Sonnet 4$3$15
Claude Opus 4.1$15$75

Hosted AI (Bedrock, Azure, etc.)

AWS Bedrock allows you to run Claude (Sonnet 4, Opus 4.1), and a variety of other open source models in your own VPC and pay per token. The pricing for Amazon Nova is VERY competitive if the model quality is good enough for you (and it’s pretty good). Claude token prices are exactly the same as going via Anthropic directly. The catch is that all the models aren’t available in every AWS region.

Azure AI Foundry gives you per token access to GPT 5, but at a higher price than OpenAI directly. Other models are compute based, which means depending on your use case could be very cost effective vs. AWS (batched runs where you can shut the system down after) or much more expensive (intermittent queries where the compute needs to be on constantly).

Open Source

If privacy is of utmost concern to you, then you could self-host your own open source models.

There are two different paths you could go down for open source models. You could either host one of the larger open source models on something like an 8xH100 node (not cheap, ~$15/hr), which would be easy to set up, but pricey to run in the long term. Or you could go and take a smaller open source model and fine-tune it yourself on your particular task (although you would still need to spend $2/hr in compute to host it once you are done training).

I don’t recommend either of these paths if you could avoid it, instead using something like OpenAI’s secure Azure Endpoint if security is a concern.

Fine-tuning also tends to be a massive time sink as fine-tuning models is very difficult. So you should expect to spend at least a couple months and thousands of dollars before you have a model and dataset that you are satisfied with. Usually, I say you should spend more time working on prompt engineering a pre-existing LLM like GPT-5, or adding/ improving your RAG pipeline instead.

That being said, if you do want to use open source models, here are your options.

Ready to go out of the box

The two best open source options right now are Kimi K2 and GLM 4.5. These models are both made by Chinese labs and perform highly across most benchmarks, trading blows with the likes of OpenAI and Anthropic for the top.

For finetuning

For fine-tuning models, the Qwen3 series of models is definitely the best right now. They have a wide variety of sizes to choose from, ranging from 600 million parameters all the way up to 235 billion, and are very receptive to fine-tuning with most of the top research papers right now using them as their base for their finetuning and reinforcement learning experiments.

Other models to consider

Here are some additional models that didn’t make the list, but you could also potentially consider for your deployments. They do not stand out in any way versus the competition, but also there isn’t necessarily anything wrong with them.

  1. Mistral Medium and Large

  2. XAi’s Grok 4

  3. DeepSeek V3.1

  4. Amazon Nova

Not worth it

You may be wondering why some models that you’ve heard of haven’t been mentioned, and so we will list them here and the reasons why we don’t recommend them.

Llama

Meta’s Llama series of models have been completely outdone by the Qwen3 series and are no longer near the top and have very limited support now in the open source community, especially the latest Llama 4 models.

GPT-oss

Trained on only synthetic data, causing it to have very little world knowledge and high hallucination rates. This causes it to be very brittle to use, especially outside math, science, and general reasoning domains.

Stay Updated

Subscribe to get the latest AI news in your inbox every week!

← BACK TO BLOG