Designing Machine Learning Systems

Added on March 14, 2023 by Jon Krohn.

Mega-bestselling author of the "Designing ML Systems" book, Chip Huyen, joined me to cover her top tips on, well, designing ML systems! ...as well as her burgeoning real-time ML startup. Can you tell we had a ton of fun?

Chip:
• Is Co-Founder of Claypot AI, a platform for real-time machine learning.
• Authored the book “Designing Machine Learning Systems”, which was published by O'Reilly Media and based on the Stanford University course she created and taught on the same topic.
• Also created and taught Stanford's “TensorFlow for Deep Learning” course.
• Previously worked as ML Engineer at data-centric development platform Snorkel AI and as a Senior Deep Learning Engineer at the chip giant NVIDIA.
• Runs an MLOps community on Discord with over 14k members.
• Her helpful posts have earned her over 160k followers on LinkedIn.

Today’s episode will probably appeal most to technical listeners like data scientists and ML engineers, but anyone involved in (or thinking of being involved in) the deployment of ML into real-life systems will learn a ton.

In this episode, Chip details:
• Her top tips for designing production-ready ML applications.
• Why iteration is key successfully deploying ML models.
• What real-time ML is and the kinds of applications it’s critical for.
• Why Large Language Models like ChatGPT and other GPT series architectures involve limited data science ingenuity but do involve enormous ML engineering challenges.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Five Ways to Use ChatGPT for Data Science

Added on March 10, 2023 by Jon Krohn.

Back in Episode #646, we focused on how anyone can extract commercial value from ChatGPT today — whether ye be a technical data science practitioner or not. In today’s episode, it’s exclusively the technical practitioners’ turn: In today’s episode, I’ve got five specific ways that ChatGPT can be used for data science.

Use case #1 is code generation. While ChatGPT was designed primarily as a tool for generating natural language (while, in contrast, OpenAI’s Codex algorithm was designed explicitly for generating code — you can hear all about it in Episode #584), the friendly, conversational UI of ChatGPT nevertheless comes in handy for rapidly generating code. And it can do so in all of the primary software languages for data science, including Python, R, and SQL. ChatGPT’s code is not always going to be perfect, but for quick ideas on how you could be extracting features from your data, implementing an algorithm, or creating a data visualization, ChatGPT is a great tool for getting started.

Use case #2 is translating code between programming languages. Not only can ChatGPT convert your natural-language input into code, it can also translate between programming languages. So if you, for example, are expert at Python but unfamiliar with an R code snippet you found online that you’d like to understand and implement in Python, you could ask ChatGPT to convert the R code into Python for you. Because ChatGPT has training data from many different programming languages, you can now convert perhaps any unfamiliar code you come across into a familiar target programming language of your choice.

Use case #3 is code troubleshooting. Not only can ChatGPT help you with generating code, you can use it to explain errors that you’re coming across and provide suggestions as to how to fix it. You can even request ChatGPT to rewrite your code for you so that it’s bug-free.

Use case #4 is providing library suggestions. In Python or R, there are countless open-source libraries of code available to you. With ChatGPT, you can now quickly identify which library or libraries are best-suited to a particular task you’d like to perform with your code.

Finally, use case #5 is article summarization. A seemingly endless number of fascinating articles on machine learning innovations are published on ArXiV each week. Poring through each of the articles that interests you is likely to be impossible, but with ChatGPT you can instantly have articles summarized and key information extracted, making it much easier for you to stay on top of the latest data science developments.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Open-Source Tools for Natural Language Processing

Added on March 7, 2023 by Jon Krohn.

In today's episode, the brilliant Vincent Warmerdam regales us with invaluable ideas and open-source software libraries for developing A.I. (particularly Natural Language Processing) applications. Enjoy!

Vincent:
• Is an ML Engineer at Explosion, the German software company that specializes in developer tools for A.I. and NLP such as spaCy and Prodigy.
• Is renowned for several open-source tools of his own, including Doubtlab.
• Is behind an educational platform called Calmcode that has over 600 short and conspicuously enjoyable video tutorials about software engineering concepts.
• Was Co-Founder and Chair of PyData Amsterdam.
• Has delivered countless amusing and insightful PyData talks.
• Holds a Masters in Econometrics and Operations Research from Vrije Universiteit Amsterdam (VU Amsterdam)).

Today’s episode will appeal primarily to technical listeners as it focuses primarily on ideas and open-source software libraries that are indispensible for data scientists, particularly those developing A.I. or NLP applications.

In this episode, Vincent details:
• The prompt recipes he developed to enable OpenAI GPT architectures to perform tremendously helpful NLP tasks such as data labeling.
• The super-popular open-source libraries he’s developed on his own as well as with Explosion.
• The software tools he uses daily including several invaluable open-source packages made by other folks.
• How both linguistics and operations research are extremely useful fields to be a better NLP practitioner and ML practitioner, respectively.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

SuperDataScience Podcast Audience Growth

Added on March 6, 2023 by Jon Krohn.

Since I started hosting the SuperDataScience Podcast in Q1 of 2021, our audience has quadrupled, with episode downloads (plus YouTube views) now approaching one million per quarter. Thank you for listening!

I'm only a small part of the team required to release the high-quality episodes we do 104 times every year. The world-class people making the machine hum along behind the scenes are:

• Ivana Zibert: Podcast Manager
• Natalie Ziajski: Sales, Marketing, and my personal Operations Manager
• Mario Pombo: Audio & Video Production
• Serg Masís: Research
• Dr. Zara Karschay: Writer
• Sylvia Ogweng: Writer
• Kirill Eremenko: Founder, Co-Owner, Former Host

These people all rock and you rock for your support too! Armed with your invaluable ongoing feedback on episodes, I hope I can continue to learn what resonates most with you and that this growth can keep going. It's a great honor to serve you, our wonderful guests, and our episode sponsors.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Hg Capital's "Digital Forum"

Added on March 6, 2023 by Jon Krohn.

At Hg Capital's "Digital Forum" in London, I delivered a keynote on "Getting Value from A.I." — my slides and the slickly-edited video production on YouTube are available now.

With a focus on B2B SaaS applications, over 45 minutes I covered:
1. What Deep Learning A.I. is and How it Works
2. Tasks that are Replaceable with A.I. vs Tasks that can be Augmented
3. How to Effectively Implement A.I. Research into Production

The audience engagement was terrific and the on-stage Q&A carried on afterward for an energizing 30 additional minutes. It felt like we could have kept on going much longer!

How to Build Data and ML Products Users Love

Added on March 3, 2023 by Jon Krohn.

What makes people latch onto data products and come back for more? In today's episode, Brian T. O'Neill unveils the processes and teams that make data and A.I. products engaging and sticky for users.

Brian:
• Founded and runs Designing for Analytics, a consultancy that specializes in designing analytics and ML products so that they are adopted.
• Hosts the "Experiencing Data" podcast, an entertaining show that covers how to use product-development methodologies and UX design to drive meaningful user and business outcomes with data.

In today's episode, Brian details:
• What data product management is.
• Why so many data projects fail.
• How to develop machine learning-powered products that users love.
• The teams and skill sets required to develop successful data products.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

NLP with ChatGPT (and other LLMs)

Added on March 2, 2023 by Jon Krohn.

Over 1400 people registered for yesterday's "NLP with ChatGPT (and other LLMs)" conference that I hosted in the O'Reilly Media platform. Kudos to speakers Sinan, Melanie and Shaan for making it a smashing success 🎉

This screenshot is a taste of what it looked like from inside the broadcasting platform, captained flawlessly by producers Joan Baker and Nurul Ishak, PMP.

The presenters each spent 30 minutes presenting on their topics and then engaged in riveting Q&A with the highly engaged attendees:
• Sinan Ozdemir: The A.I. entrepreneur and author introduced the theory behind Transformer Architectures and LLMs like BERT, T5, and GPT.
• Melanie Subbiah: A first author on the original GPT-3 paper, she led interactive demos of the broad range of capabilities of LLMs like ChatGPT.
• Shaan Khosla: A data scientist on my team at Nebula.io, he detailed practical tips on training, validating, and productionizing LLMs hands-on in Python.

I've heard word that, unusually for a live event in O'Reilly, the footage of this conference will be made available as a video within the platform. Stay tuned for details!

Getting Value From A.I.

Added on March 1, 2023 by Jon Krohn.

My keynote on "Getting Value from A.I." — which I delivered at Hg Capital's "Digital Forum" in London — is now live on YouTube!

With a focus on B2B SaaS applications, over 45 minutes I covered:
1. What Deep Learning A.I. is and How it Works
2. Tasks that are Replaceable with A.I. vs Tasks that can be Augmented
3. How to Effectively Implement A.I. Research into Production

The audience engagement was terrific and the on-stage Q&A carried on afterward for an energizing 20 additional minutes. All of this is captured in the slickly-edited video production.

How to Learn Data Engineering

Added on February 28, 2023 by Jon Krohn.

As data sets continue to grow exponentially, Data Engineering skills become increasingly essential — standalone or as part of Data Scientists' expertise. In today's episode, Andreas Kretz details how to Learn Data Engineering.

Andreas:
• Is the Founder of Learn Data Engineering, a platform through which he’s taught over a thousand students the theory and practice of data engineering.
• Has provided countless more folks with data engineering tips and tricks through his YouTube channel, which has over 10,000 subscribers.
• Worked for ten years at the German industrial giant Bosch, including as a data engineering team lead and data lab team lead.
• Holds a Computer Science degree from the Technical University of Applied Sciences Würzburg-Schweinfurt (THWS).
• With over 100,000 followers on LinkedIn, has twice been recognized as a Top Voice for Data Science and Analytics on the platform.

Today’s episode will appeal primarily to technical listeners particularly to data scientists that are keen to develop ever-more-critical data engineering skills.

In this episode, Andreas details:
• What data engineering is and how it relates to adjacent fields like data science, software engineering, and machine learning engineering.
• Why data engineering skills become increasingly essential to data scientists and data analysts with each passing year.
• What sets Senior Data Engineers apart from junior ones.
• His general process for tackling data engineering problems.
• The must-know data-engineering tools of today as well as the emerging ones you shouldn’t miss.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

A.I. Talent and the Red-Hot A.I. Skills

Added on February 24, 2023 by Jon Krohn.

What skills and traits do the best A.I. talent have? And how do you attract the best A.I. talent to your firm? Jaclyn Rice Nelson of Tribe AI, the world's most prestigious ML collective, fills us in in today's episode.

Jaclyn:
• Is Co-Founder/CEO of Tribe A.I., a "collective" of ML engineers and data scientists that drop into companies to accelerate their A.I. capabilities.
• Previously worked in senior roles at Google and CapitalG, Alphabet's growth equity fund.

In today's episode, she details:
• What characterizes the very best A.I. talent.
• What skills you should learn today to be tomorrow’s top A.I. talent.
• How to attract the top engineers and data scientists to your firm.
• The specific category of A.I. project that her clients are suddenly demanding tons of help with.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Digital Conference: NLP with ChatGPT (and other Large Language Models)

Added on February 22, 2023 by Jon Krohn.

Large Language Models have revolutionized the field of Natural Language Processing, powering mind-blowing tools such as ChatGPT and DALL-E. On March 1, we'll cover these exciting models from development to deployment.

The half-day online conference is brought to you by Pearson and will take place within the O'Reilly Media platform on March 1st from noon to 3pm ET.

The presenters are at the absolute vanguard on their topics:
• Sinan Ozdemir: The A.I. entrepreneur and author introduces the theory behind Transformer Architectures and LLMs like BERT, GPT, and T5.
• Melanie Subbiah: A first author on the original GPT-3 paper, she'll lead interactive demos of the broad range of LLM capabilities.
• Shaan Khosla: A data scientist on my team at Nebula.io, he'll detail practical tips on training, validating, and productionizing LLMs.

If you don't have access to the O'Reilly online platform through your employer or school, you can use my special code "SDSPOD23" to get a 30-day trial and enjoy the conference for free!

AI ROI: How to get a profitable return on an AI-project investment

Added on February 21, 2023 by Jon Krohn.

Relative to other kinds of software R&D, A.I. projects are typically expensive. Today's sage guest, Keith McCormick, details approaches for ensuring that A.I. projects not only are transparent, but that they are profitable too.

Keith:
• Is Executive Data Scientist in Residence at Pandata LLC, a consulting firm focused on transparent, human-centered A.I.
• Is predictive analytics instructor at UC Irvine.
• Has created 20 LinkedIn Learning courses on machine learning and A.I. with, in aggregate, hundreds of thousands of students.
• Authored four books, recurringly focused on statistics with SPSS Modeler.

Today’s episode should appeal to anyone who’s eager to get a return on an investment in an A.I. project, no matter whether you have technical or non-technical background.

In today’s episode, Keith details:
• His straightforward approach to ensuring that A.I. projects are successful.
• How A.I. projects need to be set up and managed in order to get a profitable return on the project.
• The corporate roles that need to be in place in order for a data science team to complete projects that drive value.
• What A.I. transparency is and how it relates to the field of Explainable A.I.
• How data scientists who have advanced software-writing skills could benefit from the use of low-code/no-code tools.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Mike Wimmer: The 14-Year-Old A.I. Entrepreneur

Added on February 17, 2023 by Jon Krohn.

His CS degree (with a perfect GPA!) completed, 14-year-old sensation Mike Wimmer is now turning his attention to socially-impactful A.I. projects (such as conserving the world's coral reefs) and his tech startups.

Meet the (very funny!) teenager in today's episode and hear about:
• How he got started in A.I. at such a young age.
• How he's using A.I. to detect invasive species with remote-operated vehicles.
• His preferred software stack for building machine learning applications.
• His vision for an automated future with tons of people inspired to create socially-impactful solutions.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Efficiently Glean-ing Insights from Vast Data Warehouses

Added on February 14, 2023 by Jon Krohn.

In today's episode, the super-sharp and super-technical founder Carlos Aguilar details how he's built a platform to democratize data analytics and efficiently Glean insights from across vast data warehouses 📊

Carlos:
• Is Founder and CEO of Glean, a New York-based start-up that has raised $7m in venture capital to democratize data analytics and data insights.
• Spent six years as the VP of Data Insights Engineering at Flatiron Health, a cancer data platform that was acquired by Swiss pharma giant Roche.
• Spent five years engineering robots at Amazon.
• Holds Master’s in mechanical and aerospace engineering from Cornell University.

Today’s episode includes some technical content near the beginning that will appeal most to data science and software engineering practitioners, but we do break down the main takeaways from those technical discussions so that any interested listener can partake. The remainder of the episode will appeal to anyone who’s keen to hear from an extremely intelligent technical founder on how to successfully launch a data-centric start-up.

In this episode, Carlos details:
• The software stack Glean built their platform with to enable their customers to quickly obtain slick data visualizations and insights from their vast data warehouses.
• How to grow an entrepreneurial idea into an investable company.
• The essential characteristics to look for in your founding team.
• How to ensure your early clients are continuously delighted by your evolving software platform.
• How he used genetic algorithms to enable a robotic arm to paint beautiful, creative art onto real-world canvases 🎨

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

A.I. Speech for the Speechless

Added on February 10, 2023 by Jon Krohn.

Thanks to a new lip-reading A.I., non-verbal medical patients can now "speak" to their clinicians and loved ones…

The Intentional Use of Color in Data Communication

Added on February 7, 2023 by Jon Krohn.

In today's episode, Kate Strachnyi— author of the new book ColorWise — opened my eyes to the vastly underutilized power of the intentional use of color in data visualizations. Now you can harness its power too!

Kate:
• Is a multi-time data science book author.
• Her latest book, ColorWise, was published by O'Reilly Media: It’s a beautiful, comprehensive guide to the effective use of color when communicating data visually.
• Founded the DATAcated Circle, a community of data professionals committed to engaging and learning together.
• Is a megastar on LinkedIn where she has over 170k followers and was twice recognized as a LinkedIn Top Voice for Data Science & Analytics.
• Is big into long-distance running; her longest to date was a 50-mile (!!!) ultramarathon in New York’s Central Park.

Today’s episode should appeal to technical and non-technical folks alike because I suspect that pretty well any listener of this show presents data and could benefit from learning how to do so more effectively with the intentional use of color.

In today’s episode, Kate details:
• Why the intentional use of color matters.
• What thought process you should follow to select a color scheme for a visualization.
• Special considerations for color choice, such as accessibility, cultural understanding, and due to human psychology.
• How to effectively use multiple visualizations together in a document, presentation or dashboard.
• Her favorite data viz tools.

Want a free digital copy of ColorWise? The first five people to comment on this post that they want one, get one! Thanks to O'Reilly's Suzanne Huston for offering this to our listeners

If you miss out on one of the five copies, you can use my special code "SDSPOD23" to get a free 30-day trial of the O'Reilly platform and read the book there.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

SparseGPT: Remove 100 Billion Parameters but Retain 100% Accuracy

Added on February 3, 2023 by Jon Krohn.

Today’s episode isn’t specifically about GPT-3, however. It’s about the issue of how massive these large language models are and how we can prune these models to compress them.

Introduction to Machine Learning

Added on January 31, 2023 by Jon Krohn.

After a multi-year hiatus, Hadelin and Kirill — the most popular data science instructors on Udemy, with 2+ million students — have released a new ML course. In this episode, they introduce what ML is from scratch.

Kirill Eremenko:
• Is Founder and CEO of SuperDataScience, an e-learning platform.
• Founded the SuperDataScience Podcast in 2016 and hosted the show until he passed me the reins two years ago.

Hadelin de Ponteves:
• Was a data engineer at Google before becoming a content creator.
• In 2020, took a break from Data Science content to produce and star in a Bollywood film featuring "Miss Universe" Harnaaz Sandhu.

Together, Kirill and Hadelin:
• Have created dozens of data science courses.
• Are the most popular data science instructors on the Udemy platform, with over two million students.
• After a multi-year hiatus from creating courses, they recently published a new course called “Machine Learning in Python: Level 1".

This episode serves as an introduction to machine learning so will primarily appeal to folks who aren’t already expert at ML — that said, I’ve been doing ML for over 15 years and still learned a few critical new pieces of information during filming so this episode could serve as a fun, light-hearted refresher for experts.

In this episode, Kirill and Hadelin introduce ML concepts such as:
• Supervised vs unsupervised learning
• Classification errors
• Logistic regression
• Feature scaling
• The Adjusted R-Squared metric
• The assumptions of linear regression
• The Elbow Method

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

VALL-E: Uncannily Realistic Voice Imitation from a 3-Second Clip

Added on January 27, 2023 by Jon Krohn.

Text-to-speech models take in text as an input (e.g., a sentence that you type out) and then output an audio waveform that sounds like a human reading out the sentence you provided as an input. TTS systems like this have been around for decades, but until the past few years the quality of the audio was not compellingly human-like…

Is Data Science Still Sexy?

Added on January 24, 2023 by Jon Krohn.

Had far too much fun filming today's episode with Prof. Tom Davenport, many-time author of bestselling books on analytics and coiner of data science as "sexiest job of the century". A decade on, does he still think so?

Tom:
• Has published over 20 books, such as the bestselling "Competing on Analytics", "The A.I. Advantage", and "Analytics at Work".
• Has penned 300+ articles in publications like the Harvard Business Review and writes regular columns for Forbes and The Wall Street Journal.
• Is President's Distinguished Professor of IT and Management at Babson College.
• Is Visiting Professor at the Saïd Business School, University of Oxford.
• Is Senior Advisor to the A.I. practice for the global professional services giant Deloitte.
• With nearly 300k followers, he’s recognized as a LinkedIn Top Voice.

Today’s episode is equally well-suited to technical and non-technical listeners alike. Every part of it should be appealing to anyone who’s keen to hear about the leading edge of commercial applications of A.I.

In this episode, Prof. Davenport details:
• The discrete A.I. maturity levels of organizations.
• How organizations become A.I. fueled.
• Which jobs are susceptible to replacement by A.I.
• Which jobs are ripe for augmenting with A.I.
• What roles other than data scientist are required to deploy effective machine learning models.
• What the future of data science will look like and, having coined data science as “the sexiest job of the 21st century” a decade ago, whether he still thinks it is today.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.