Sensational episode for you today with the illustrious A.I. author, educator and entrepreneur Sinan Ozdemir on how LLM benchmarks are lying to you... and what you can do about it.
Sinan:
Is Founder and CTO of LoopGenius, a generative A.I. startup.
Authored several excellent books, including, most recently, the bestselling "Quick Start Guide to Large Language Models".
Hosts the "Practically Intelligent" podcast.
Was previously adjunct faculty at The Johns Hopkins University, now teaches several times a month within the O'Reilly platform.
Serial A.I. entrepreneur, including founding a Y Combinator-backed generative A.I. startup way back in 2015 that was later acquired.
Holds a Master’s in Pure Math from Johns Hopkins.
Today’s episode skews slightly toward our more technical listeners but Sinan excels at explaining complex concepts in a clear way so today’s episode may appeal to any listener of this podcast.
In today’s episode, Sinan details:
Why the A.I. benchmarks everyone relies on might be lying to you.
How the leading A.I. labs are gaming the benchmark system.
Tricks to actually effectively evaluate LLMs’ capabilities for your use cases.
What the future of benchmarking will involve, including how to benchmark agentic and multimodal models.
How a simple question about watermelon seeds reveals the 40% failure rate of even today’s most advanced A.I. models.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.