How to be both socially impactful and financially successful in your data career

Added on March 28, 2023 by Jon Krohn.

Josh Wills, the real-life data science superhero — decarbonizer of the transport system, full-time modeler of the Covid pandemic, and force of nature at fast-growing tech firms — is my extraordinary guest today.

Josh has done a startling amount in his career:
• Worked to decarbonize transport as a software engineer at WeaveGrid.
• Modeled Covid-19 full-time for the Government of California in early 2020 as the pandemic was first kicking off.
• Was the first Director of Data Engineering at Slack.
• Was Director of Data Science at Cloudera.
• Was Staff Software Engineer at Google.
• Co-authored several editions of O'Reilly Media books on advanced analytics.
• Has given countless thought-provoking (and very funny!) talks at major data science conferences.
• And now describes himself as a “gainfully unemployed data person” as he contributes to open-source software projects and develops his “Data Engineering for Machine Learning” course.

Today’s episode will appeal most to technical listeners that are keen to be outstanding data scientists or software engineers, particularly through engineering scalable ML products.

However, much of the content in the episode will appeal to anyone who’d like to hear from a brilliant, thoughtful, and seasoned professional who goes into depth on:
• The orders-of-magnitude more efficient “contextual bandit” approach to testing models in production.
• How to avoid the “infinite loop of sadness” in data product development.
• The pros and cons of choosing a management-track career path relative to an independent contributor path.
• What it’s like to be called on as a life-saving data-science superhero during a catastrophic global event.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

MIT Study: ChatGPT Dramatically Increases Productivity

Added on March 24, 2023 by Jon Krohn.

With all of this ChatGPT and GPT-4 news, I was wondering whether these generative A.I. tools actually result in the productivity gains everyone supposes them to. Well, wonder no more…

Astonishing CICERO negotiates and builds trust with humans using natural language

Added on March 21, 2023 by Jon Krohn.

Meta AI's CICERO algorithm — which negotiates and build trust with humans to perform in the top decile at the game of Diplomacy — is (in my view) the most astounding A.I. feat yet. Hear all about it from Alexander.

As published in the prestigious academic journal Science in November, CICERO is capable of using natural-language conversation to coordinate with humans, develop strategic alliances, and ultimately win in Diplomacy, an extremely complex board game.

Excelling in a game with incomplete information and vastly more possible states of play than games previously conquered by A.I. like chess and go would be a wild feat in and of itself, but CICERO’s generative capacity to converse and negotiate in real-time with six other human players in order to strategize victoriously is the truly mind-boggling capability.

To detail for you how the game of Diplomacy works, why Meta chose to tackle this game with A.I., and how they developed a model that competes in the top decile of human Diplomacy players without any other players even catching a whiff that CICERO could possibly be a machine, my guest in today's episode is Alexander Holden Miller, a co-author of the CICERO paper.

Alex:
• Has been working in Meta AI’s Fundamental AI Research group, FAIR, for nearly eight years.
• Currently serves as a Senior Research Engineering Manager within FAIR.
• Has supported researchers working in most ML sub-domains but has been especially involved in conversational A.I. research and more recently reinforcement learning and planning.
• Holds a degree in Computer Science from Cornell University.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

The Most Popular SuperDataScience Podcast Episodes of 2022

Added on March 17, 2023 by Jon Krohn.

I put together today’s episode to fill you in on the most listened-to episodes of 2022, giving you a data-backed set of outstanding episodes that you might want to go back and check out if you’re hankering for more content. For veteran listeners, this episode could be informative too: It’ll ensure that you didn’t miss any of the most popular episodes from last year that sound interesting to you.

Designing Machine Learning Systems

Added on March 14, 2023 by Jon Krohn.

Mega-bestselling author of the "Designing ML Systems" book, Chip Huyen, joined me to cover her top tips on, well, designing ML systems! ...as well as her burgeoning real-time ML startup. Can you tell we had a ton of fun?

Chip:
• Is Co-Founder of Claypot AI, a platform for real-time machine learning.
• Authored the book “Designing Machine Learning Systems”, which was published by O'Reilly Media and based on the Stanford University course she created and taught on the same topic.
• Also created and taught Stanford's “TensorFlow for Deep Learning” course.
• Previously worked as ML Engineer at data-centric development platform Snorkel AI and as a Senior Deep Learning Engineer at the chip giant NVIDIA.
• Runs an MLOps community on Discord with over 14k members.
• Her helpful posts have earned her over 160k followers on LinkedIn.

Today’s episode will probably appeal most to technical listeners like data scientists and ML engineers, but anyone involved in (or thinking of being involved in) the deployment of ML into real-life systems will learn a ton.

In this episode, Chip details:
• Her top tips for designing production-ready ML applications.
• Why iteration is key successfully deploying ML models.
• What real-time ML is and the kinds of applications it’s critical for.
• Why Large Language Models like ChatGPT and other GPT series architectures involve limited data science ingenuity but do involve enormous ML engineering challenges.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Five Ways to Use ChatGPT for Data Science

Added on March 10, 2023 by Jon Krohn.

Back in Episode #646, we focused on how anyone can extract commercial value from ChatGPT today — whether ye be a technical data science practitioner or not. In today’s episode, it’s exclusively the technical practitioners’ turn: In today’s episode, I’ve got five specific ways that ChatGPT can be used for data science.

Use case #1 is code generation. While ChatGPT was designed primarily as a tool for generating natural language (while, in contrast, OpenAI’s Codex algorithm was designed explicitly for generating code — you can hear all about it in Episode #584), the friendly, conversational UI of ChatGPT nevertheless comes in handy for rapidly generating code. And it can do so in all of the primary software languages for data science, including Python, R, and SQL. ChatGPT’s code is not always going to be perfect, but for quick ideas on how you could be extracting features from your data, implementing an algorithm, or creating a data visualization, ChatGPT is a great tool for getting started.

Use case #2 is translating code between programming languages. Not only can ChatGPT convert your natural-language input into code, it can also translate between programming languages. So if you, for example, are expert at Python but unfamiliar with an R code snippet you found online that you’d like to understand and implement in Python, you could ask ChatGPT to convert the R code into Python for you. Because ChatGPT has training data from many different programming languages, you can now convert perhaps any unfamiliar code you come across into a familiar target programming language of your choice.

Use case #3 is code troubleshooting. Not only can ChatGPT help you with generating code, you can use it to explain errors that you’re coming across and provide suggestions as to how to fix it. You can even request ChatGPT to rewrite your code for you so that it’s bug-free.

Use case #4 is providing library suggestions. In Python or R, there are countless open-source libraries of code available to you. With ChatGPT, you can now quickly identify which library or libraries are best-suited to a particular task you’d like to perform with your code.

Finally, use case #5 is article summarization. A seemingly endless number of fascinating articles on machine learning innovations are published on ArXiV each week. Poring through each of the articles that interests you is likely to be impossible, but with ChatGPT you can instantly have articles summarized and key information extracted, making it much easier for you to stay on top of the latest data science developments.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Open-Source Tools for Natural Language Processing

Added on March 7, 2023 by Jon Krohn.

In today's episode, the brilliant Vincent Warmerdam regales us with invaluable ideas and open-source software libraries for developing A.I. (particularly Natural Language Processing) applications. Enjoy!

Vincent:
• Is an ML Engineer at Explosion, the German software company that specializes in developer tools for A.I. and NLP such as spaCy and Prodigy.
• Is renowned for several open-source tools of his own, including Doubtlab.
• Is behind an educational platform called Calmcode that has over 600 short and conspicuously enjoyable video tutorials about software engineering concepts.
• Was Co-Founder and Chair of PyData Amsterdam.
• Has delivered countless amusing and insightful PyData talks.
• Holds a Masters in Econometrics and Operations Research from Vrije Universiteit Amsterdam (VU Amsterdam)).

Today’s episode will appeal primarily to technical listeners as it focuses primarily on ideas and open-source software libraries that are indispensible for data scientists, particularly those developing A.I. or NLP applications.

In this episode, Vincent details:
• The prompt recipes he developed to enable OpenAI GPT architectures to perform tremendously helpful NLP tasks such as data labeling.
• The super-popular open-source libraries he’s developed on his own as well as with Explosion.
• The software tools he uses daily including several invaluable open-source packages made by other folks.
• How both linguistics and operations research are extremely useful fields to be a better NLP practitioner and ML practitioner, respectively.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

SuperDataScience Podcast Audience Growth

Added on March 6, 2023 by Jon Krohn.

Since I started hosting the SuperDataScience Podcast in Q1 of 2021, our audience has quadrupled, with episode downloads (plus YouTube views) now approaching one million per quarter. Thank you for listening!

I'm only a small part of the team required to release the high-quality episodes we do 104 times every year. The world-class people making the machine hum along behind the scenes are:

• Ivana Zibert: Podcast Manager
• Natalie Ziajski: Sales, Marketing, and my personal Operations Manager
• Mario Pombo: Audio & Video Production
• Serg Masís: Research
• Dr. Zara Karschay: Writer
• Sylvia Ogweng: Writer
• Kirill Eremenko: Founder, Co-Owner, Former Host

These people all rock and you rock for your support too! Armed with your invaluable ongoing feedback on episodes, I hope I can continue to learn what resonates most with you and that this growth can keep going. It's a great honor to serve you, our wonderful guests, and our episode sponsors.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Hg Capital's "Digital Forum"

Added on March 6, 2023 by Jon Krohn.

At Hg Capital's "Digital Forum" in London, I delivered a keynote on "Getting Value from A.I." — my slides and the slickly-edited video production on YouTube are available now.

With a focus on B2B SaaS applications, over 45 minutes I covered:
1. What Deep Learning A.I. is and How it Works
2. Tasks that are Replaceable with A.I. vs Tasks that can be Augmented
3. How to Effectively Implement A.I. Research into Production

The audience engagement was terrific and the on-stage Q&A carried on afterward for an energizing 30 additional minutes. It felt like we could have kept on going much longer!

How to Build Data and ML Products Users Love

Added on March 3, 2023 by Jon Krohn.

What makes people latch onto data products and come back for more? In today's episode, Brian T. O'Neill unveils the processes and teams that make data and A.I. products engaging and sticky for users.

Brian:
• Founded and runs Designing for Analytics, a consultancy that specializes in designing analytics and ML products so that they are adopted.
• Hosts the "Experiencing Data" podcast, an entertaining show that covers how to use product-development methodologies and UX design to drive meaningful user and business outcomes with data.

In today's episode, Brian details:
• What data product management is.
• Why so many data projects fail.
• How to develop machine learning-powered products that users love.
• The teams and skill sets required to develop successful data products.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

NLP with ChatGPT (and other LLMs)

Added on March 2, 2023 by Jon Krohn.

Over 1400 people registered for yesterday's "NLP with ChatGPT (and other LLMs)" conference that I hosted in the O'Reilly Media platform. Kudos to speakers Sinan, Melanie and Shaan for making it a smashing success 🎉

This screenshot is a taste of what it looked like from inside the broadcasting platform, captained flawlessly by producers Joan Baker and Nurul Ishak, PMP.

The presenters each spent 30 minutes presenting on their topics and then engaged in riveting Q&A with the highly engaged attendees:
• Sinan Ozdemir: The A.I. entrepreneur and author introduced the theory behind Transformer Architectures and LLMs like BERT, T5, and GPT.
• Melanie Subbiah: A first author on the original GPT-3 paper, she led interactive demos of the broad range of capabilities of LLMs like ChatGPT.
• Shaan Khosla: A data scientist on my team at Nebula.io, he detailed practical tips on training, validating, and productionizing LLMs hands-on in Python.

I've heard word that, unusually for a live event in O'Reilly, the footage of this conference will be made available as a video within the platform. Stay tuned for details!

Getting Value From A.I.

Added on March 1, 2023 by Jon Krohn.

My keynote on "Getting Value from A.I." — which I delivered at Hg Capital's "Digital Forum" in London — is now live on YouTube!

With a focus on B2B SaaS applications, over 45 minutes I covered:
1. What Deep Learning A.I. is and How it Works
2. Tasks that are Replaceable with A.I. vs Tasks that can be Augmented
3. How to Effectively Implement A.I. Research into Production

The audience engagement was terrific and the on-stage Q&A carried on afterward for an energizing 20 additional minutes. All of this is captured in the slickly-edited video production.

How to Learn Data Engineering

Added on February 28, 2023 by Jon Krohn.

As data sets continue to grow exponentially, Data Engineering skills become increasingly essential — standalone or as part of Data Scientists' expertise. In today's episode, Andreas Kretz details how to Learn Data Engineering.

Andreas:
• Is the Founder of Learn Data Engineering, a platform through which he’s taught over a thousand students the theory and practice of data engineering.
• Has provided countless more folks with data engineering tips and tricks through his YouTube channel, which has over 10,000 subscribers.
• Worked for ten years at the German industrial giant Bosch, including as a data engineering team lead and data lab team lead.
• Holds a Computer Science degree from the Technical University of Applied Sciences Würzburg-Schweinfurt (THWS).
• With over 100,000 followers on LinkedIn, has twice been recognized as a Top Voice for Data Science and Analytics on the platform.

Today’s episode will appeal primarily to technical listeners particularly to data scientists that are keen to develop ever-more-critical data engineering skills.

In this episode, Andreas details:
• What data engineering is and how it relates to adjacent fields like data science, software engineering, and machine learning engineering.
• Why data engineering skills become increasingly essential to data scientists and data analysts with each passing year.
• What sets Senior Data Engineers apart from junior ones.
• His general process for tackling data engineering problems.
• The must-know data-engineering tools of today as well as the emerging ones you shouldn’t miss.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

A.I. Talent and the Red-Hot A.I. Skills

Added on February 24, 2023 by Jon Krohn.

What skills and traits do the best A.I. talent have? And how do you attract the best A.I. talent to your firm? Jaclyn Rice Nelson of Tribe AI, the world's most prestigious ML collective, fills us in in today's episode.

Jaclyn:
• Is Co-Founder/CEO of Tribe A.I., a "collective" of ML engineers and data scientists that drop into companies to accelerate their A.I. capabilities.
• Previously worked in senior roles at Google and CapitalG, Alphabet's growth equity fund.

In today's episode, she details:
• What characterizes the very best A.I. talent.
• What skills you should learn today to be tomorrow’s top A.I. talent.
• How to attract the top engineers and data scientists to your firm.
• The specific category of A.I. project that her clients are suddenly demanding tons of help with.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Digital Conference: NLP with ChatGPT (and other Large Language Models)

Added on February 22, 2023 by Jon Krohn.

Large Language Models have revolutionized the field of Natural Language Processing, powering mind-blowing tools such as ChatGPT and DALL-E. On March 1, we'll cover these exciting models from development to deployment.

The half-day online conference is brought to you by Pearson and will take place within the O'Reilly Media platform on March 1st from noon to 3pm ET.

The presenters are at the absolute vanguard on their topics:
• Sinan Ozdemir: The A.I. entrepreneur and author introduces the theory behind Transformer Architectures and LLMs like BERT, GPT, and T5.
• Melanie Subbiah: A first author on the original GPT-3 paper, she'll lead interactive demos of the broad range of LLM capabilities.
• Shaan Khosla: A data scientist on my team at Nebula.io, he'll detail practical tips on training, validating, and productionizing LLMs.

If you don't have access to the O'Reilly online platform through your employer or school, you can use my special code "SDSPOD23" to get a 30-day trial and enjoy the conference for free!

AI ROI: How to get a profitable return on an AI-project investment

Added on February 21, 2023 by Jon Krohn.

Relative to other kinds of software R&D, A.I. projects are typically expensive. Today's sage guest, Keith McCormick, details approaches for ensuring that A.I. projects not only are transparent, but that they are profitable too.

Keith:
• Is Executive Data Scientist in Residence at Pandata LLC, a consulting firm focused on transparent, human-centered A.I.
• Is predictive analytics instructor at UC Irvine.
• Has created 20 LinkedIn Learning courses on machine learning and A.I. with, in aggregate, hundreds of thousands of students.
• Authored four books, recurringly focused on statistics with SPSS Modeler.

Today’s episode should appeal to anyone who’s eager to get a return on an investment in an A.I. project, no matter whether you have technical or non-technical background.

In today’s episode, Keith details:
• His straightforward approach to ensuring that A.I. projects are successful.
• How A.I. projects need to be set up and managed in order to get a profitable return on the project.
• The corporate roles that need to be in place in order for a data science team to complete projects that drive value.
• What A.I. transparency is and how it relates to the field of Explainable A.I.
• How data scientists who have advanced software-writing skills could benefit from the use of low-code/no-code tools.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Mike Wimmer: The 14-Year-Old A.I. Entrepreneur

Added on February 17, 2023 by Jon Krohn.

His CS degree (with a perfect GPA!) completed, 14-year-old sensation Mike Wimmer is now turning his attention to socially-impactful A.I. projects (such as conserving the world's coral reefs) and his tech startups.

Meet the (very funny!) teenager in today's episode and hear about:
• How he got started in A.I. at such a young age.
• How he's using A.I. to detect invasive species with remote-operated vehicles.
• His preferred software stack for building machine learning applications.
• His vision for an automated future with tons of people inspired to create socially-impactful solutions.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

Efficiently Glean-ing Insights from Vast Data Warehouses

Added on February 14, 2023 by Jon Krohn.

In today's episode, the super-sharp and super-technical founder Carlos Aguilar details how he's built a platform to democratize data analytics and efficiently Glean insights from across vast data warehouses 📊

Carlos:
• Is Founder and CEO of Glean, a New York-based start-up that has raised $7m in venture capital to democratize data analytics and data insights.
• Spent six years as the VP of Data Insights Engineering at Flatiron Health, a cancer data platform that was acquired by Swiss pharma giant Roche.
• Spent five years engineering robots at Amazon.
• Holds Master’s in mechanical and aerospace engineering from Cornell University.

Today’s episode includes some technical content near the beginning that will appeal most to data science and software engineering practitioners, but we do break down the main takeaways from those technical discussions so that any interested listener can partake. The remainder of the episode will appeal to anyone who’s keen to hear from an extremely intelligent technical founder on how to successfully launch a data-centric start-up.

In this episode, Carlos details:
• The software stack Glean built their platform with to enable their customers to quickly obtain slick data visualizations and insights from their vast data warehouses.
• How to grow an entrepreneurial idea into an investable company.
• The essential characteristics to look for in your founding team.
• How to ensure your early clients are continuously delighted by your evolving software platform.
• How he used genetic algorithms to enable a robotic arm to paint beautiful, creative art onto real-world canvases 🎨

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.

A.I. Speech for the Speechless

Added on February 10, 2023 by Jon Krohn.

Thanks to a new lip-reading A.I., non-verbal medical patients can now "speak" to their clinicians and loved ones…

The Intentional Use of Color in Data Communication

Added on February 7, 2023 by Jon Krohn.

In today's episode, Kate Strachnyi— author of the new book ColorWise — opened my eyes to the vastly underutilized power of the intentional use of color in data visualizations. Now you can harness its power too!

Kate:
• Is a multi-time data science book author.
• Her latest book, ColorWise, was published by O'Reilly Media: It’s a beautiful, comprehensive guide to the effective use of color when communicating data visually.
• Founded the DATAcated Circle, a community of data professionals committed to engaging and learning together.
• Is a megastar on LinkedIn where she has over 170k followers and was twice recognized as a LinkedIn Top Voice for Data Science & Analytics.
• Is big into long-distance running; her longest to date was a 50-mile (!!!) ultramarathon in New York’s Central Park.

Today’s episode should appeal to technical and non-technical folks alike because I suspect that pretty well any listener of this show presents data and could benefit from learning how to do so more effectively with the intentional use of color.

In today’s episode, Kate details:
• Why the intentional use of color matters.
• What thought process you should follow to select a color scheme for a visualization.
• Special considerations for color choice, such as accessibility, cultural understanding, and due to human psychology.
• How to effectively use multiple visualizations together in a document, presentation or dashboard.
• Her favorite data viz tools.

Want a free digital copy of ColorWise? The first five people to comment on this post that they want one, get one! Thanks to O'Reilly's Suzanne Huston for offering this to our listeners

If you miss out on one of the five copies, you can use my special code "SDSPOD23" to get a free 30-day trial of the O'Reilly platform and read the book there.

The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.