On November 6th, OpenAI held its developers day, in which it introduced GPT-4 Turbo as well as a few other new tools. There has been a lot of buzz online about this release, but what does it mean for the legal world?

What was announced?

A first and obvious improvement is that the model got updated to incorporate knowledge of the world up to April 2023. This was badly needed, because the training of the old GPT-4 was stopped in September 2021, so any legislation, case law, political facts, world events and so on were not known by it. While OpenAI (and search engine Bing) had the possibility to search the internet for interesting facts, and subsequently combine a subset of the search results with GPT-4’s memory, this was always a stop-gap solution that only partially helped with improving results. In fact, including such search results was often also detrimental for the quality of the results.

A second improvement is that GPT-4 Turbo is faster, 2-3 times cheaper, and can process up to 128,000 so-called tokens. Each token roughly translates to about 3/4th of a word in English, so GPT-4 Turbo will be able to process up to about 100,000 words. Be aware that both the question and the answer must together fit within these 100,000 words.

A third improvement is the launch of Custom GPT “Assistants”. Those Assistants provide a simple environment that offers developers easy tools to upload information and create customized chat robots. These Assistants can then also be offered for sale on OpenAI’s shop.

Initial technical impressions

A few days have passed, and many enthusiasts have been experimenting with GPT-4 Turbo. The general feeling is that GPT-4 Turbo is, overall, a clear step forwards. Particularly the more recent world knowledge, the larger context model (16x the standard version of the old GPT-4) and the reduced costs are very welcomed.

However, even leaving aside the two outages that have already happened in one week (probably a result of the usage rates and/or larger token amounts that are allowed), not everything is perfect. The reasoning capabilities of GPT-4 Turbo have somewhat deteriorated. For example, on the SAT reading benchmark, the old GPT-4 scored 770, while the new GPT-4 Turbo scores around 730-740.

What probably happened behind the scenes is that OpenAI compressed and optimized the “memory” of GPT-4 in many different ways, to allow for faster and cheaper computation, while at the same time allowing for the higher token amounts. The technicalities behind this process are outside the scope of this contribution, but you can roughly think that the billions of numbers stored by GPT-4 memory were truncated. So a number such as 34789.987654 was limited to 34789.9, which occupies less memory and is faster to handle, with the downside of being a bit less precise.

From a technical perspective, developers also welcome the GPT “Assistants” and the few other technical tools that were introduced. So far, the overall feeling among developers has been that it is very simple to start working with large language models (LLMs) using the “API” of Microsoft, OpenAI, Google or any of the other vendors. A clever programmer can therefore literally build an initial version of a chatbot in less than an hour.

However, after the initial enthusiasm, developers discovered that there are many devils lurking in the details. An LLM vendor’s API is a very bare-bones tool, and must be combined with many other tools to create useful applications, to for example do groundwork such as chopping texts into smaller pieces in order to make them digestible by the APIs. Those additional tools — e.g., LangChain, LlamaIndex, as well as many vector databases — are very much in flux and several of them were created in a hurry, so that they are not always pleasant to use. New developer tools seem to be launched every day (after all, selling shovels during the gold rush is the surest way to make money), and it is difficult for developers to keep track which tools are great and which are snake oil.

The new tool environment that OpenAI is now offering through its Assistants will therefore be welcomed by many developers, as OpenAI can be seen as the “standard”.


Downsides

At the same time, there are some worries, however.

A first worry is that, with its new announcements, OpenAI essentially killed many one-trick startup ponies that offered development tools for programmers.

One of the most obvious victims are the vendors of simple tools to convert texts into semantic vectors. As described in our Fall 2023 update, those semantic vectors are probably the key towards handling large amounts of texts — e.g., case law, internal memos, contract templates, etc. — that are simply too large for an LLM to handle, even the new GPT-4 Turbo. Developers can now simply upload their texts into OpenAI’s new Assistants environment and have GPT-4 Turbo automatically extract relevant data, so none of the tools from the ecosystem are needed anymore if developers stay within the walled garden of OpenAI’s environment.

To be clear: for those developers who want more than relatively simple environment offered by OpenAI, the more serious tools (e.g., vector databases such as Pinecone, Weaviate and Milvus) continue to be necessary. The victimization will therefore be mostly situated with startups that focused on easy tools that filled some of the obvious gaps in the entire ecosystem.

In the IT-world, it is quite common for incumbents to kill startups by taking their best ideas and incorporating it into their own environment, nicely integrated with the rest of the incumbent’s tools. Both Apple and Microsoft have been severely criticized for this tactic (“Sherlocking” is even a verb to describe Apple’s tactic), but despite the associated competition law worries, they continue doing so.

It is therefore not surprising that OpenAI is taking the same approach. Also, as outlined above, the current state of the ecosystem of LLM-related tools is quite chaotic and in flux, so any standard-setting is in a certain way welcomed. Nevertheless, combined with this new GPTs shop, the road towards “platformisation” that OpenAI is now taking, is a mixed bag.

GPTs shop

As OpenAI is making it so easy to create custom GPTs — no coding skills necessary! — It is almost certain that there will be many startups offering a “Legal GPT” on OpenAI’s shop. At the time of writing, at least one is already available.

We expect that this amount will go up enormously. As every lawyer knows who has experimented with LLMs, the legal content that is produced by GPT-4 seems quite good, at least when having an initial look at the surface. When you dig deeper than the surface, or try to use GPT-4 for niche legal areas or in small (non-Anglosaxon) jurisdictions, then you realize that the generated legal content is often not-so-good, or even completely wrong because of hallucinations and erroneous mixing of legal rules between jurisdictions.

But that first impression is what counts, which explains why we have witnessed many legal startups that had very limited knowledge of the law, but could nevertheless launch legal technology products, through their “piggybacking” on LLMs such as GPT-4. (In fact, for those startups, the less lawyers are involved, the better it seems, because lawyers would probably start nagging about the risks, due to the poor quality of the output).

With OpenAI’s announcements the bar towards launching a legal tech product is lowered even further, so you can expect a smörgåsbord of GPT-4 (Turbo) backed startups being launched in the next months, which will probably focus on a specific area of law (e.g., family law) or a specific type of legal document (e.g., contracts), through the simple act of packaging some clever prompts into a custom GPT.

Impact on legal teams

Does GPT-4 Turbo impact the Million Dollar Question on every legal team’s mind: “how can we store our own content within an LLM”?

The answer is probably no. I had personally expected that GPT-5 would be launched in, say, February 2024. The launch of GPT-4 Turbo (which can be seen as a GPT 4.5) was therefore a small surprise, because a new version of GPT was launched sooner than I had expected. At the same time, it is not a big surprise, in the sense that OpenAI had also launched an interim “Turbo” version of GPT 3, i.e. ChatGPT, before launching GPT-4 a few months later. My personal bet is therefore that GPT-5 will likely launch somewhere in Q2 2024.

When taking a step back, our assessment is that GPT-4 Turbo does not fundamentally change the problems associated with the old GPT-4. Sure, its token capacity is much higher, but even 100,000 words is quite limited for general legal purposes: it is about 2x the size of the text of the EU GDPR legislation, and 12x the size of the US constitution. It will therefore still not be possible to take the lazy approach of “dumping” gigabytes of legal content into GPT-4, in order to have GPT-4 automatically write memos or contracts for you.

What also hasn’t been solved by GPT-4 Turbo, is the “attention span” problem, where LLMs tend to lose their attention when processing large amounts of text. As nicely summarized by a recent experiment, GPT-4 Turbo is doing a good job to keep attention up to about 50,000 words. But above this threshold, facts mentioned in the input text were forgotten, particularly if those facts were positioned between the introductory session and the middle of the long document.

One should also keep in mind that it is very costly to use large input texts. If, for example, you submit a text of 80,000 words and get an answer of another 10,000 words, this will roughly cost you about 1$ per question. With every new question that you ask about this same text, another 1$ will be charged, because GPT-4 Turbo “forgets” about earlier inputs (even if submitted within the same session) and therefore has to perform the costly processing with each and every question.

Our earlier recommendation therefore still stands, to use retrieval-augmented generation (“RAG”): semantic vectors as a prefiltering step, only submitting a subset of data to the LLM. Also interesting to note is that our prediction that “fine-tuning” is not very relevant for the legal world, despite initial hopes to the contrary, has been confirmed by OpenAI’s new announcements. In the press release, it is stated that OpenAI is “creating an experimental access program for GPT-4 fine-tuning. Preliminary results indicate that GPT-4 fine-tuning requires more work to achieve meaningful improvements over the base model compared to the substantial gains realized with GPT-3.5 fine-tuning”. This confirms the earlier analysis that finetuning is primarily intended for teaching LLMs how to generate output (e.g., “lawyer style wording”), and not so much for teaching new facts or for uploading existing legal memos into GPT’s memory.

Yet another sign that fine-tuning is not the way to go, and that training your own LLM is out of reach for almost every legal team, is that OpenAI announced that organisations can now apply for getting a truly custom GPT-4 trained for them. However, not many organisations will be admitted, as you need to have “billions of tokens at a minimum”, and the special program is “expensive”, so probably costs millions of euros/dollars. Bearing in mind that this training process will need to be repeated frequently, it is out of reach for almost every legal team.  

In other words, the best way forward for most legal teams to reap the benefits of LLMs, is to duly organize and tag their legal content, so that a clean subset of relevant legal data can be submitted to the LLM. From that perspective, the larger token window and cheaper cost of GPT-4 Turbo is therefore very much welcomed.

Where are we heading next?

In our earlier blog post we analysed that since the launch of GPT-4 in April, there have been many exciting developments in the LLM-landscape, but that most of these developments were done in areas surrounding GPT-4 (e.g., the many open source LLMs) and not so much to GPT-4 itself. GPT-4 Turbo is at the same time a significant step forward, and a warning that we should not expect truly significant breakthroughs to happen in the short term. The fundamental limitations of the current technology — hallucinations, relative slowness, relatively limited token length — remain for the foreseeable future, with only incremental improvements.

For the legal world, the GPT Assistants are interesting, but will probably not make a big splash. On the one hand, they easily allow legal teams to even more easily create their own legal chatbot, because no coding skills are required. On the other hand, given OpenAI’s burned reputation when it comes to confidentiality due to its reuse of prompts in the consumer version of GPT, it remains to be seen how many legal teams will want to upload their most sensitive data (e.g., frequently used internal memos) to the walled garden of OpenAI. It is remarkable how this confidentiality issue remains a huge worry for almost every legal team we talk to, even though the issue can be easily circumvented through the use of the enterprise version of OpenAI’s environment, or (as we do internally here at ClauseBase) through the use of Microsoft’s Azure version of GPT-4.

Even when reuse of data is not a real issue, the platformisation and walled garden of OpenAI are not be underestimated. OpenAI is most likely following the typical lock-in strategy, where a large vendor initially attracts customers with a good product, low prices and an interesting development environment. Over time, when the product has become the de-facto standard and customers will find it difficult to move to some other platform, the pricing will be increased. Another worry is that an Amazon-like tactic would be used, where — even when data is not itself directly reused — OpenAI will analyse the sales data of its GPTs Shop, to understand which products are popular. When a certain GPT is successful, OpenAI can then easily create its own version, displacing marketplace sellers. Despite the very low barrier to entry, this is perhaps something to think about for legal startups who want to offer their own custom GPT (“Corporate Law GPT”!).