Osborne Clarke’s AI engineer, Christian Braun, recently ran tests to compare the capabilities of their OC-GPT-4o model against the well-known legal AI tool, Harvey. Braun suggested that OC’s GPT-4o not only matched but surpassed Harvey’s performance in several areas.
This bold statement has raised eyebrows across the legal tech industry, especially considering the advanced fine-tuning and detailed system prompting that Harvey utilizes.
Testing in a Secure Azure Environment
Braun explained that the tests were conducted in a secure Azure environment using OC-GPT, which is based on GPT-4o. The test aimed to replicate Harvey’s BigLaw Bench results by utilizing identical sample questions. According to Braun, their model not only met Harvey’s outcomes but, in some cases, even exceeded them.
The tests included various prompting techniques, such as “Chain-of-Thought” and “Few-Shot prompting,” excluding hallucinations for accuracy.
The Importance of Prompt Engineering
Braun emphasized the critical role that prompt engineering plays in achieving optimal results. He encouraged law firms to educate their staff on prompt engineering to fully unlock the potential of AI systems.
He also questioned whether Harvey’s results could be improved with better prompts, sparking further interest in the comparative capabilities of these two models.
What Are the Differences in Evaluation Methods?
Despite OC’s impressive claims, there are significant concerns about whether the tests were truly comparable. Harvey’s team used a much more comprehensive evaluation process. They didn’t just look for a correct answer; instead, they analyzed how well the AI could assist a lawyer in completing a task.
They employed bespoke rubrics designed to assess the model’s effectiveness, penalizing common failure modes like incorrect tone, irrelevant material, and hallucinations. This detailed scrutiny could be why Harvey’s model is considered more accurate, despite OC’s higher performance claims.
Critical Differences in Judgment
While OC stated they excluded hallucinations from their evaluation, they didn’t explain if they considered more nuanced criteria like tone or task relevance. These factors are vital in legal work, and omitting them could make OC’s results less reliable.
It’s unclear whether OC’s tests involved the same number of cases as Harvey’s. Harvey publicly released only six examples, but their internal assessments involved dozens more. The limited scope of OC’s tests might have skewed the results.
The Need for Standardized AI Testing
This debate highlights the growing need for standardized AI testing in the legal field. If two organizations can run what they believe to be the same test and achieve different outcomes, the industry must develop common standards.
Without shared guidelines, legal AI tools risk being judged inconsistently, leading to potentially misleading claims about their capabilities.
Conclusion: A Call for Common AI Benchmarks
Braun’s claim that OC-GPT-4o surpasses Harvey is an intriguing one, but without standardized testing, it’s difficult to fully trust the results. The legal tech industry must prioritize the creation of shared evaluation methods to ensure transparency and accuracy in AI performance.
Key Takeaways
- Osborne Clarke’s GPT-4o claims to outperform Harvey in several tests.
- Prompt engineering plays a crucial role in AI effectiveness.
- There’s a need for standardized AI evaluation methods in the legal field.
In the end, the real question remains: Did OC-GPT-4o truly surpass Harvey, or were the tests not comparable?