OpenAI recently released a new agentic application called Deep Research. This tool is available exclusively to pro users with a $200 monthly subscription. It utilizes their upcoming o3 reasoning model, which is not yet available via API. According to OpenAI’s blog, their Deep Research agent system achieves a score of 26.6% on the Humanity’s Last Exam evaluation benchmark. However, comparing an agent system directly to language models may not be the most appropriate comparison. A more suitable comparison would have been against similar research tools like Perplexity or Google’s Gemini Deep Research tool.

In addition to the Humanity’s Last Exam benchmark results, OpenAI shared their performance on the GAIA benchmark. GAIA is a public benchmark designed to evaluate AI systems on real-world questions, and the Deep Research agentic system has achieved a new state of the art (SOTA), leading the external leaderboard.

Today, HuggingFace launched an open source initiative to replicate OpenAI’s DeepResearch capabilities. It’s worth noting that while Google released their experimental Deep Research model in Gemini in December 2024, there weren’t any significant replication attempts at that time.
According to the HuggingFace team’s blog, they developed their prototype in under 24 hours and have improved upon the previous state of the art, advancing from Magentic-One’s 46% to 54% on the validation set.
Continue reading “My Notes on Open Deep Researcher”