We've cracked one million listens per quarter for the first time! No doubt buoyed by the mainstream A.I. fascination, but also thanks to our outstanding recent guests, our show had 1.06 million listens in Q1 2023 🍾
The chart shows episode downloads (on podcasting platforms) plus views (on YouTube) for each quarter since I took over as host of The SuperDataScience Podcast in January 2021.
Thank you for listening and providing thoughtful feedback on how we can improve the show. We have fantastic topics lined up for the coming weeks so I'm hopeful we can continue this growth trend in Q2. We're already off to a good start as the past week was — by some margin — the best week for listens in the show's history.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Filtering by Category: Podcast
GPT-4 Has Arrived
SuperDataScience episode #666 — appropriate for an algorithm that has folks (quixotically) signing a letter to pause all A.I. development. In this first episode of the GPT-4 trilogy; in ten minutes, I introduces GPT-4's staggering capabilities.
A Leap in AI Safety and Accuracy
GPT-4 marks a significant advance over its predecessor, GPT-3.5, in terms of both safety and factual accuracy. It is reportedly 82% less likely to respond with disallowed content and 40% more likely to produce factually correct responses. Despite improvements, challenges like sociodemographic biases and hallucinations persist, although they are considerably reduced.
Academic and Professional Exam Performance
The prowess of GPT-4 becomes evident when revisiting queries initially tested on GPT-3.5. Its ability to summarize complex academic content accurately and its human-like response quality are striking. In one test, GPT-4’s output was mistaken for human writing by GPTZero, an AI detection tool, underscoring its sophistication. In another test, the uniform bar exam, GPT-4 scored in the 90th percentile, a massive leap from GPT-3.5's 10th percentile.
Multimodality
GPT-4 introduces multimodality, handling both language and visual inputs. This capability allows for innovative interactions, like recipe suggestions based on fridge contents or transforming drawings into functional websites. This visual aptitude notably boosted its performance in exams like the Biology Olympiad, where GPT-4 scored in the 99th percentile.
The model also demonstrates proficiency in numerous languages, including low-resource ones, outperforming other major models in most languages tested. This linguistic versatility extends to its translation capabilities between these languages.
The Secret Behind GPT-4’s Success
While OpenAI has not disclosed the exact number of model parameters in GPT-4, it's speculated that they significantly exceed GPT-3's 175 billion. This increase, coupled with more and better-curated training data, and the ability to handle vastly more context (up to 32,000 tokens), are likely contributors to GPT-4's enhanced performance.
Reinforcement Learning from Human Feedback (RLHF)
GPT-4 incorporates RLHF, a method that refines its output based on user feedback, allowing it to align more closely with desired responses. This approach has already proven effective in previous models like InstructGPT.
GPT-4 represents a monumental step in AI development, balancing unprecedented capabilities with improved safety measures. Its impact is far-reaching, offering new possibilities in various fields and highlighting the importance of responsible AI development and use. As we continue to explore its potential, the conversation around AI safety and ethics becomes increasingly vital.
The SuperDataScience GPT-4 trilogy is comprised of:
• #666 (today): an introductory overview by yours truly
• #667 (Tuesday): world-leading A.I.-monetization expert Vin Vashishta joins me to detail how you can leverage GPT-4 to your commercial advantage
• #668 (next Friday): world-leading A.I.-safety expert Jeremie Harris joins me to detail the (existential!) risks of GPT-4 and the models it paves the way for
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
MIT Study: ChatGPT Dramatically Increases Productivity
With all of this ChatGPT and GPT-4 news, I was wondering whether these generative A.I. tools actually result in the productivity gains everyone supposes them to. Well, wonder no more…
Read MoreAstonishing CICERO negotiates and builds trust with humans using natural language
Meta AI's CICERO algorithm — which negotiates and build trust with humans to perform in the top decile at the game of Diplomacy — is (in my view) the most astounding A.I. feat yet. Hear all about it from Alexander.
As published in the prestigious academic journal Science in November, CICERO is capable of using natural-language conversation to coordinate with humans, develop strategic alliances, and ultimately win in Diplomacy, an extremely complex board game.
Excelling in a game with incomplete information and vastly more possible states of play than games previously conquered by A.I. like chess and go would be a wild feat in and of itself, but CICERO’s generative capacity to converse and negotiate in real-time with six other human players in order to strategize victoriously is the truly mind-boggling capability.
To detail for you how the game of Diplomacy works, why Meta chose to tackle this game with A.I., and how they developed a model that competes in the top decile of human Diplomacy players without any other players even catching a whiff that CICERO could possibly be a machine, my guest in today's episode is Alexander Holden Miller, a co-author of the CICERO paper.
Alex:
• Has been working in Meta AI’s Fundamental AI Research group, FAIR, for nearly eight years.
• Currently serves as a Senior Research Engineering Manager within FAIR.
• Has supported researchers working in most ML sub-domains but has been especially involved in conversational A.I. research and more recently reinforcement learning and planning.
• Holds a degree in Computer Science from Cornell University.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Designing Machine Learning Systems
Mega-bestselling author of the "Designing ML Systems" book, Chip Huyen, joined me to cover her top tips on, well, designing ML systems! ...as well as her burgeoning real-time ML startup. Can you tell we had a ton of fun?
Chip:
• Is Co-Founder of Claypot AI, a platform for real-time machine learning.
• Authored the book “Designing Machine Learning Systems”, which was published by O'Reilly Media and based on the Stanford University course she created and taught on the same topic.
• Also created and taught Stanford's “TensorFlow for Deep Learning” course.
• Previously worked as ML Engineer at data-centric development platform Snorkel AI and as a Senior Deep Learning Engineer at the chip giant NVIDIA.
• Runs an MLOps community on Discord with over 14k members.
• Her helpful posts have earned her over 160k followers on LinkedIn.
Today’s episode will probably appeal most to technical listeners like data scientists and ML engineers, but anyone involved in (or thinking of being involved in) the deployment of ML into real-life systems will learn a ton.
In this episode, Chip details:
• Her top tips for designing production-ready ML applications.
• Why iteration is key successfully deploying ML models.
• What real-time ML is and the kinds of applications it’s critical for.
• Why Large Language Models like ChatGPT and other GPT series architectures involve limited data science ingenuity but do involve enormous ML engineering challenges.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Five Ways to Use ChatGPT for Data Science
Back in Episode #646, we focused on how anyone can extract commercial value from ChatGPT today — whether ye be a technical data science practitioner or not. In today’s episode, it’s exclusively the technical practitioners’ turn: In today’s episode, I’ve got five specific ways that ChatGPT can be used for data science.
Use case #1 is code generation. While ChatGPT was designed primarily as a tool for generating natural language (while, in contrast, OpenAI’s Codex algorithm was designed explicitly for generating code — you can hear all about it in Episode #584), the friendly, conversational UI of ChatGPT nevertheless comes in handy for rapidly generating code. And it can do so in all of the primary software languages for data science, including Python, R, and SQL. ChatGPT’s code is not always going to be perfect, but for quick ideas on how you could be extracting features from your data, implementing an algorithm, or creating a data visualization, ChatGPT is a great tool for getting started.
Use case #2 is translating code between programming languages. Not only can ChatGPT convert your natural-language input into code, it can also translate between programming languages. So if you, for example, are expert at Python but unfamiliar with an R code snippet you found online that you’d like to understand and implement in Python, you could ask ChatGPT to convert the R code into Python for you. Because ChatGPT has training data from many different programming languages, you can now convert perhaps any unfamiliar code you come across into a familiar target programming language of your choice.
Use case #3 is code troubleshooting. Not only can ChatGPT help you with generating code, you can use it to explain errors that you’re coming across and provide suggestions as to how to fix it. You can even request ChatGPT to rewrite your code for you so that it’s bug-free.
Use case #4 is providing library suggestions. In Python or R, there are countless open-source libraries of code available to you. With ChatGPT, you can now quickly identify which library or libraries are best-suited to a particular task you’d like to perform with your code.
Finally, use case #5 is article summarization. A seemingly endless number of fascinating articles on machine learning innovations are published on ArXiV each week. Poring through each of the articles that interests you is likely to be impossible, but with ChatGPT you can instantly have articles summarized and key information extracted, making it much easier for you to stay on top of the latest data science developments.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Open-Source Tools for Natural Language Processing
In today's episode, the brilliant Vincent Warmerdam regales us with invaluable ideas and open-source software libraries for developing A.I. (particularly Natural Language Processing) applications. Enjoy!
Vincent:
• Is an ML Engineer at Explosion, the German software company that specializes in developer tools for A.I. and NLP such as spaCy and Prodigy.
• Is renowned for several open-source tools of his own, including Doubtlab.
• Is behind an educational platform called Calmcode that has over 600 short and conspicuously enjoyable video tutorials about software engineering concepts.
• Was Co-Founder and Chair of PyData Amsterdam.
• Has delivered countless amusing and insightful PyData talks.
• Holds a Masters in Econometrics and Operations Research from Vrije Universiteit Amsterdam (VU Amsterdam)).
Today’s episode will appeal primarily to technical listeners as it focuses primarily on ideas and open-source software libraries that are indispensible for data scientists, particularly those developing A.I. or NLP applications.
In this episode, Vincent details:
• The prompt recipes he developed to enable OpenAI GPT architectures to perform tremendously helpful NLP tasks such as data labeling.
• The super-popular open-source libraries he’s developed on his own as well as with Explosion.
• The software tools he uses daily including several invaluable open-source packages made by other folks.
• How both linguistics and operations research are extremely useful fields to be a better NLP practitioner and ML practitioner, respectively.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
SuperDataScience Podcast Audience Growth
Since I started hosting the SuperDataScience Podcast in Q1 of 2021, our audience has quadrupled, with episode downloads (plus YouTube views) now approaching one million per quarter. Thank you for listening!
I'm only a small part of the team required to release the high-quality episodes we do 104 times every year. The world-class people making the machine hum along behind the scenes are:
• Ivana Zibert: Podcast Manager
• Natalie Ziajski: Sales, Marketing, and my personal Operations Manager
• Mario Pombo: Audio & Video Production
• Serg Masís: Research
• Dr. Zara Karschay: Writer
• Sylvia Ogweng: Writer
• Kirill Eremenko: Founder, Co-Owner, Former Host
These people all rock and you rock for your support too! Armed with your invaluable ongoing feedback on episodes, I hope I can continue to learn what resonates most with you and that this growth can keep going. It's a great honor to serve you, our wonderful guests, and our episode sponsors.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
How to Build Data and ML Products Users Love
What makes people latch onto data products and come back for more? In today's episode, Brian T. O'Neill unveils the processes and teams that make data and A.I. products engaging and sticky for users.
Brian:
• Founded and runs Designing for Analytics, a consultancy that specializes in designing analytics and ML products so that they are adopted.
• Hosts the "Experiencing Data" podcast, an entertaining show that covers how to use product-development methodologies and UX design to drive meaningful user and business outcomes with data.
In today's episode, Brian details:
• What data product management is.
• Why so many data projects fail.
• How to develop machine learning-powered products that users love.
• The teams and skill sets required to develop successful data products.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
How to Learn Data Engineering
As data sets continue to grow exponentially, Data Engineering skills become increasingly essential — standalone or as part of Data Scientists' expertise. In today's episode, Andreas Kretz details how to Learn Data Engineering.
Andreas:
• Is the Founder of Learn Data Engineering, a platform through which he’s taught over a thousand students the theory and practice of data engineering.
• Has provided countless more folks with data engineering tips and tricks through his YouTube channel, which has over 10,000 subscribers.
• Worked for ten years at the German industrial giant Bosch, including as a data engineering team lead and data lab team lead.
• Holds a Computer Science degree from the Technical University of Applied Sciences Würzburg-Schweinfurt (THWS).
• With over 100,000 followers on LinkedIn, has twice been recognized as a Top Voice for Data Science and Analytics on the platform.
Today’s episode will appeal primarily to technical listeners particularly to data scientists that are keen to develop ever-more-critical data engineering skills.
In this episode, Andreas details:
• What data engineering is and how it relates to adjacent fields like data science, software engineering, and machine learning engineering.
• Why data engineering skills become increasingly essential to data scientists and data analysts with each passing year.
• What sets Senior Data Engineers apart from junior ones.
• His general process for tackling data engineering problems.
• The must-know data-engineering tools of today as well as the emerging ones you shouldn’t miss.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
A.I. Talent and the Red-Hot A.I. Skills
What skills and traits do the best A.I. talent have? And how do you attract the best A.I. talent to your firm? Jaclyn Rice Nelson of Tribe AI, the world's most prestigious ML collective, fills us in in today's episode.
Jaclyn:
• Is Co-Founder/CEO of Tribe A.I., a "collective" of ML engineers and data scientists that drop into companies to accelerate their A.I. capabilities.
• Previously worked in senior roles at Google and CapitalG, Alphabet's growth equity fund.
In today's episode, she details:
• What characterizes the very best A.I. talent.
• What skills you should learn today to be tomorrow’s top A.I. talent.
• How to attract the top engineers and data scientists to your firm.
• The specific category of A.I. project that her clients are suddenly demanding tons of help with.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
AI ROI: How to get a profitable return on an AI-project investment
Relative to other kinds of software R&D, A.I. projects are typically expensive. Today's sage guest, Keith McCormick, details approaches for ensuring that A.I. projects not only are transparent, but that they are profitable too.
Keith:
• Is Executive Data Scientist in Residence at Pandata LLC, a consulting firm focused on transparent, human-centered A.I.
• Is predictive analytics instructor at UC Irvine.
• Has created 20 LinkedIn Learning courses on machine learning and A.I. with, in aggregate, hundreds of thousands of students.
• Authored four books, recurringly focused on statistics with SPSS Modeler.
Today’s episode should appeal to anyone who’s eager to get a return on an investment in an A.I. project, no matter whether you have technical or non-technical background.
In today’s episode, Keith details:
• His straightforward approach to ensuring that A.I. projects are successful.
• How A.I. projects need to be set up and managed in order to get a profitable return on the project.
• The corporate roles that need to be in place in order for a data science team to complete projects that drive value.
• What A.I. transparency is and how it relates to the field of Explainable A.I.
• How data scientists who have advanced software-writing skills could benefit from the use of low-code/no-code tools.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
A.I. Speech for the Speechless
Thanks to a new lip-reading A.I., non-verbal medical patients can now "speak" to their clinicians and loved ones…
Read MoreIntroduction to Machine Learning
After a multi-year hiatus, Hadelin and Kirill — the most popular data science instructors on Udemy, with 2+ million students — have released a new ML course. In this episode, they introduce what ML is from scratch.
Kirill Eremenko:
• Is Founder and CEO of SuperDataScience, an e-learning platform.
• Founded the SuperDataScience Podcast in 2016 and hosted the show until he passed me the reins two years ago.
Hadelin de Ponteves:
• Was a data engineer at Google before becoming a content creator.
• In 2020, took a break from Data Science content to produce and star in a Bollywood film featuring "Miss Universe" Harnaaz Sandhu.
Together, Kirill and Hadelin:
• Have created dozens of data science courses.
• Are the most popular data science instructors on the Udemy platform, with over two million students.
• After a multi-year hiatus from creating courses, they recently published a new course called “Machine Learning in Python: Level 1".
This episode serves as an introduction to machine learning so will primarily appeal to folks who aren’t already expert at ML — that said, I’ve been doing ML for over 15 years and still learned a few critical new pieces of information during filming so this episode could serve as a fun, light-hearted refresher for experts.
In this episode, Kirill and Hadelin introduce ML concepts such as:
• Supervised vs unsupervised learning
• Classification errors
• Logistic regression
• Feature scaling
• The Adjusted R-Squared metric
• The assumptions of linear regression
• The Elbow Method
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Is Data Science Still Sexy?
Had far too much fun filming today's episode with Prof. Tom Davenport, many-time author of bestselling books on analytics and coiner of data science as "sexiest job of the century". A decade on, does he still think so?
Tom:
• Has published over 20 books, such as the bestselling "Competing on Analytics", "The A.I. Advantage", and "Analytics at Work".
• Has penned 300+ articles in publications like the Harvard Business Review and writes regular columns for Forbes and The Wall Street Journal.
• Is President's Distinguished Professor of IT and Management at Babson College.
• Is Visiting Professor at the Saïd Business School, University of Oxford.
• Is Senior Advisor to the A.I. practice for the global professional services giant Deloitte.
• With nearly 300k followers, he’s recognized as a LinkedIn Top Voice.
Today’s episode is equally well-suited to technical and non-technical listeners alike. Every part of it should be appealing to anyone who’s keen to hear about the leading edge of commercial applications of A.I.
In this episode, Prof. Davenport details:
• The discrete A.I. maturity levels of organizations.
• How organizations become A.I. fueled.
• Which jobs are susceptible to replacement by A.I.
• Which jobs are ripe for augmenting with A.I.
• What roles other than data scientist are required to deploy effective machine learning models.
• What the future of data science will look like and, having coined data science as “the sexiest job of the 21st century” a decade ago, whether he still thinks it is today.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Machine Learning for Video Games
Carly Taylor — Lead ML Engineer for the "Call of Duty" franchise — joined me for today's fun, super informative episode on low-latency software engineering, real-time ML, and the future of gaming.
Carly:
• Grew rapidly from a Sr Data Scientist role to simultaneously holding "Expert ML Engineer" and "Sr Mgr — Security Strategy" titles since joining Activision two years ago.
• At Activision, specifically works on Call of Duty, one of the top-grossing video game franchises of all time, with over $30 billion in sales and 250m global users annually.
• Prior to Activision, rapidly grew from Analyst to Data Scientist roles.
• Has amassed a LinkedIn following of 75k+ by regularly posting fruitful tips on breaking into a data science career and progressing within it.
• Advocates for women in STEM, tech, and gaming careers.
• Offers 1:1 career consulting to anyone who desires it.
• Holds a Masters in Computational Chemistry from the University of Colorado and completed the Galvanize Data Science Immersive program.
Today’s episode certainly has technical tidbits throughout that will be useful to hands-on practitioner but much of the wide-ranging conversation will be fascinating to any listener, particularly if you have an interest in video games, the so-called metaverse, or real-time machine learning.
In this episode, Carly details:
• What the future of gaming holds.
• Why low-latency is critical for an optimal gaming experience and the tools that online engineers use to make it happen.
• Her favorite operating systems, software packages, and keyboards.
• How to transition effectively from a quantitative academic background into data science.
• How to file a patent.
• Why she’s called the “Rebel Data Scientist”.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
A Framework for Big Life Decisions
The biggest decisions we make involve trade-offs between professional opportunity (money) and our personal life (love). Today, Stanford labor economist Prof. Myra Strober provides a framework for making a big choice.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
A.I. for Medicine
Machine learning is ushering in a new era of medicine, e.g., by predicting the shape of therapeutic drugs and assisting in their design. Witty Prof. Charlotte Deane of the University of Oxford and Exscientia explains how.
Charlotte:
• Is a global-leading expert on using ML for designing therapeutic drugs.
• Has been faculty at the University of Oxford for over 20 years, where serves as Professor of Structural Bioinformatics and heads the 25-person Protein Informatics Lab.
• Is Chief Scientist Biologics A.I. at Exscientia, a NASDAQ-listed pharmatech company that uses computational approaches to drive drug development in a fraction of the time of traditional drug companies.
• Was COVID-Response Director for UK Research and Innovation, resulting in Queen Elizabeth II honoring her as a Member of the Most Excellent Order of the British Empire.
Today’s episode should appeal to technical and non-technical folks alike as it features an absolutely brilliant scientist and communicator describing how we can use A.I. to speed the discovery of new molecules that help our body fight off ailments as diverse as viruses and cancer.
In this episode, Prof. Deane details:
• How your immune system works.
• What biologics are and why they’re such an important class of drugs.
• What’s holding back the widespread use of precision medicines that are pinpoint-customized to a specific tumor in a specific person.
• What the celebrated AlphaFold algorithm does exquisitely and where it (and all other computational models of protein folding) still need to improve.
• How she used data to marshall the UK’s scientific response to Covid.
• How data and machine learning will transform drug development over the coming years.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Continuous Calendar for 2023
Well, another year, another continuous calendar from us here at SuperDataScience!
Read MoreData Science Trends for 2023
Happy New Year! To kick it off, the entrepreneur, futurist, and mega-popular Machine Learning instructor Sadie St. Lawrence joins me to predict the biggest data science trends of 2023 🍾
We start the episode off by looking back at how our predictions for 2022 panned out from a year ago and then we dive into our predictions for the year ahead. Specific trends we discuss include:
• Data as a product
• Multimodal models
• Decentralization of enterprise data
• A.I. policy
• Environmental sustainability
This episode will appeal to technical and non-technical folks alike — anyone who’d like to understand the trends that will shape the field of data science and the broader world not only in 2023 but also in the years beyond.
Sadie:
• Has created data science and ML courses enjoyed by 350k+ students.
• Is Founder and CEO of Women In Data, a community of over 20k women across 17 countries.
• Serves on multiple start-up boards.
• Hosts the Data Bytes podcast.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.