Forecasting GPT-4

Gaze not into the model, for it learns from human feedback.

ML researchers joke that each month brings new surprises. I’d like to be less surprised in the coming month, so I’ve tried to predict what GPT-4 will bring (or PaLM-2, Meerkat, …).

Three years after GPT-3 only three companies have the secret sauce for large language models (LLMs) and write about them: OpenAI, DeepMind, and Google. I expect the next big model to be from one of the trio - most obviously OpenAI, which conspicuously missed its spring / summer GPT announcement this year.

Interesting predictions:

Boring predictions (I only cover some of these):

Since I wrote the first draft of this post September 12th, Google published improved scaling laws on September 13th.

timeline of selected OpenAI model announcements

OpenAI has announced a new GPT model every year, but not yet in 2022.

DeepMind's timeline here starts much later than the other two.

❖ ❖ ❖

Interesting predictions

Wait, this isn't a 2k context window!

The Pile includes specialized datasets, unlike GPT-3's pretraining corpora.

❖ ❖ ❖

Boring predictions

❖ ❖ ❖

Predicting performance

The world’s best (disclosed) LLM is locked away by Google. Neither has anyone announced a model combining well-known separate improvements. The uncertainty around how the best or best possible models today behave ripples into the future. The next best model jumps off from a model we don’t have access to. Do iterative techniques combine, brick by brick, to build higher? Or does each new trick substitute for another?

My predictions above are about technical changes, not qualitative differences in experience or quantitative performance differences. I expect the biggest qualitative difference will be that it’s much easier to interact with the model - that it seems to always guess what you want, even if you haven’t been very clear about your intent. Most people want similar things from the model, so scaling human feedback will get you very far. Prompt engineering is a hack! It means your model isn’t ready for users.

On benchmark performance numbers: I’ll save the hot takes for another day.

❖ ❖ ❖

Staying up to date

I track the latest LLMs (and other big models) at, along with info about released weights, code, data, and other useful data points. Follow @model_tracker for updates.

❖ ❖ ❖


Thanks for reading the first post on my personal blog! I welcome discussion and constructive feedback, whether about the contents of this post or the writing. Safety note: I don’t think this type of forecasting worsens racing. Let me know if you disagree.

References above are not comprehensive. If you feel an important paper should be included, feel free to tweet it @thomasiliao.

Thanks to Rohan Taori, Charlie Snell, and asara for comments.