To cap 2022 off, like I did to cap 2021 off, I’m covering the five big lessons that I learned over the course of the year:
Read MoreSimplifying Machine Learning
Today, Mariya Sha — host of the wildly popular "Python Simplified" YouTube channel (140k subscribers!) — taps her breadth of A.I. expertise to provide a fun and fascinating finale to SuperDataScience guest episodes for 2022.
Mariya:
• Is the mind behind the "Python Simplified" YouTube channel that makes advanced concepts (e.g., ML, neural nets) simple to understand.
• Her videos cover Python-related topics as diverse as data science, web scraping, automation, deep learning, GUI development, and OOP.
• Is renowned for taking complex concepts such as gradient descent or unsupervised learning and explaining them in a straightforward manner that leverages hands-on, real-life examples.
• Is pursuing a bachelor's in Computer Science (with a specialization in A.I. and Machine Learning) from the University of London.
Today’s episode should appeal to anyone who’s interested in or involved with data science, machine learning, or A.I.
In this episode, Mariya details:
• How the incredible potential of ML in our lifetimes inspired her to shift her focus from web-development languages like JavaScript to Python.
• Why automation and web scraping are critical skills for data scientists.
• How to make learning any apparently complex data science concept straightforward to comprehend.
• Her favorite Python libraries and software tools.
• One rarely-mentioned topic that every data scientist would benefit from.
• The pros and cons of pursuing a 100% remote degree in computer science.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
ChatGPT Holiday Greeting
Welcome to the SuperDataScience Podcast! Last year, for a Five-Minute Friday at this time of year, I provided a holiday greeting. This year, the sensational natural-language generating conversational algorithm ChatGPT scripted the holiday greeting for me.
Read MoreHow to Influence Others with Your Data
If you ever use data to make decisions or to persuade those around you to make data-driven decisions, today’s episode is jam-packed with relevant, practical tips from data presentation guru Ann K. Emery.
Ann:
• Is an internationally-acclaimed speaker who delivers 100+ keynotes, workshops, and webinars each year to enable people to share data-driven insights more effectively.
• She has consulted on data visualization, data reporting, and data presentation with over 200 organizations — the likes of the United Nations, the US Centers for Disease Control, and Harvard University.
• She holds a BA in Psychology and Spanish from the University of Virginia and a Masters in Educational Psychology Evaluation, Assessment, and Testing from George Mason University.
I rarely say that everyone should listen to an episode, but this is one of those rare cases.
In this episode, Ann details:
• What data storytelling is.
• Best practices for data visualization.
• Surprising tricks you can pull off with spreadsheet software.
• How to report on data effectively.
• Her top tips for presenting data in a slideshow.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The Equality Machine
Many recent books and articles spread fear about data collection and A.I. Today's guest, Prof. Orly Lobel, offers the antidote with her book "The Equality Machine" — an optimistic take on the future of data science.
The Perils of Manually Labeling Data for Machine Learning Models
The "gold standard" in Machine Learning is to train models with manually labeled data. Shayan Mohanty details why that encodes bias in our models and provides a "weakly-supervised" solution.
Shayan:
• Is the CEO of Watchful, a Bay Area startup he co-founded to automate the injection of subject-matter-expertise into ML models.
• Is Guest Scientist at Los Alamos, a renowned national security lab.
• Previously he worked as a data engineer at Facebook.
• Was co-founder and CEO of a pair of other tech startups.
• Holds a degree in economics from the The University of Texas at Austin.
Today’s episode will be of interest to technical data science experts and non-technical folks alike as it addresses critical issues associated with creating datasets for machine learning models — issues we should be aware of regardless whether we’re more technically or commercially oriented.
In this episode, Shayan details:
• Why bias in general is good.
• Why degenerative bias in particular is bad.
• Arguments against using manual labeling.
• How his company Watchful has devised a better alternative to manual labeling — including its fascinating technical underpinnings such as the Chomsky hierarchy of languages and their high-performance Monte Carlo simulation engine.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Model Error Analysis
Extremely fortunate to have force-of-nature Serg Masís as our researcher behind the scenes on SuperDataScience. For today's episode, he's in front of the camera detailing how error analysis can improve your model!
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Responsible Decentralized Intelligence
The eminent Prof. Dawn Song joins me on the keynote stage of the Open Data Science Conference (ODSC) West in San Fran for a exceptionally deep, live episode on Responsible Decentralized Intelligence.
Dawn:
• Leads trailblazing research at the intersection of deep learning A.I. and decentralized systems like the blockchain.
• Has been Professor in the Computer Science Division of University of California, Berkeley for 15 years.
• Is Founder of Oasis Labs, a data privacy startup.
• Co-directs the Berkeley Center on Responsible Decentralized Intelligence.
• Is part of the illustrious Berkeley AI Research (BAIR) Lab.
• Has authored 300+ papers that have been cited over 80,000 times!
• Has won countless major awards including a MacArthur Fellowship ("genius grant").
Today’s episode is a deeply technical one that will appeal primarily to practitioners like data scientists, but it does have take-away points that will allow any interested listener to become abreast of the massive emerging potential of decentralized intelligence.
In this episode, Prof. Song details:
• What decentralized intelligence is and how it relates machine learning (particularly deep learning) to other emerging technologies like the blockchain, differential privacy, federated learning, and homomorphic encryption.
• What a “Responsible Data Economy” would look like, with specific real-world examples from her applications of her research to industry.
• Specific resources that she has developed to allow data scientists and software developers to easily develop and deploy privacy-preserving machine learning applications.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Liquid Neural Networks
Liquid Neural Networks are a new, biology-inspired deep learning approach that could be transformative. I think they're super cool and Adrian Kosowski, PhD introduced them to me for today's Five-Minute Friday episode.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Data Analytics Career Orientation
Considering a Data Analytics career? Today's episode with YouTube icon Luke Barousse (273k subscribers) will be particularly appealing to you, but the terrifically interesting guest makes for an episode that anyone will love.
Luke:
• Is a full-time YouTuber, creating highly educational — but nevertheless hilarious — videos focused on Data Analytics.
• Previously worked as a Lead Data Analyst and Data Engineer at BASF.
• Worked for seven years in the US Navy on nuclear-powered submarines.
• Holds a degree in mechanical engineering, a graduate qualification in nuclear engineering, and an MBA in business analytics.
In this episode, Luke details:
• The must-have skills for entry-level data analyst roles.
• The data analyst skills mistakenly and erroneously pursued by many folks considering the career.
• How his submariner experience prepared him well for a data career.
• His favorite tools for creating interactive data dashboards.
• His favorite scraping libraries for collecting data from the web.
• The skills to learn now to be prepared for the data careers of the future.
• The benefits of CrossFit beyond just the fitness improvements.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Resilient Machine Learning
Machine learning is often fragile in production. For today's Five-Minute Friday episode, Dr. Dan Shiebler details how we can make ML more resilient.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Software for Efficient Data Science
In today's episode, Dr. Jodie Burchell details a broad range of tools for working efficiently with data, including data cleaning, reproducibility, visualization, and natural language processing.
Jodie:
• Is the Data Science Developer Advocate for JetBrains, the developer-tools company behind PyCharm (one of the most widely-used Python IDEs) and DataLore (their new cloud platform for collaborative data science).
• Previously was Data Scientist or Lead Data Scientist at several tech companies, developing specializations in search, recommender systems, and NLP.
• Co-authored two books on data visualization libraries: "The Hitchhiker's Guide to ggplot2" and "The Hitchhiker's Guide to Plotnine".
• Prior to entering industry, was a postdoctoral fellow in biostatistics at the University of Melbourne.
• Holds a PhD in Psychology from the Australian National University.
Today’s episode is primarily intended for a technical audience as it's packed with practical tips and software for data scientists.
In this episode, Jodie details:
• What a data science developer advocate is and why you might want to consider it as a career option.
• How to work effectively, efficiently, and confidently with real-world data.
• Her favorite Python libraries, such as ones for data viz and NLP.
• How to have reproducible data science workflows.
• The subject she would have majored in if she could go back in time.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The Critical Human Element of Successful A.I. Deployments
For today's episode, I sat down with the prolific data-science instructor, author and practitioner Keith McCormick to discuss how critical user considerations are for developing a successful A.I. application.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
AutoML: Automated Machine Learning
AutoML with Erin LeDell — it rhymes! In today's episode, H2O.ai's Chief ML Scientist guides us through what Automated Machine Learning is and why it's an advantageous technique for data scientists to adopt.
Dr. LeDell:
• Has been working at H2O.ai — the cloud A.I. firm that has raised over $250m in venture capital and is renowned for its open-source AutoML library — for eight years.
• Founded (WiMLDS) Women in Machine Learning & Data Science (100+ chapters worldwide).
• Co-founded R-Ladies Global, a community for genders currently underrepresented amongst R users.
• Is celebrated for her talks at leading A.I. conferences.
• Previously was Principal Data Scientist at two acquired A.I. startups.
• Holds a Ph.D. from the Berkeley focused on ML and computational stats.
Today’s episode is relatively technical so will primarily appeal to technical listeners, but it would also provide context to anyone who’s interested to understand how key aspects of data science work are becoming increasingly automated.
In this episode, Erin details:
• What AutoML — automated machine learning — is and why it’s an advantageous technique for data scientists to adopt.
• How the open-source H2O AutoML platform works.
• What the “No Free Lunch Theorem” is.
• What Admissible Machine Learning is and how it can reduce the biases present in many data science models.
• The new software tools she’s most excited about.
• How data scientists can prepare for the increasingly automated data science field of the future.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Subword Tokenization with Byte-Pair Encoding
When working with written natural language data as we do with many natural language processing models, a step we typically carry out while preprocessing the data is tokenization. In a nutshell, tokenization is the conversion of a long string of characters into smaller units that we call tokens.
Read MoreAnalyzing Blockchain Data and Cryptocurrencies
As real-time, publicly-available ledgers of transactions, blockchains provide exciting new data analytics opportunities. Kimberly Grauer leads us through the tools and approaches for blockchain analytics.
Kim:
• Is Director of Research at Chainalysis Inc., the world’s leading crypto analytics firm.
• Previously worked in an economic research and analysis group for NYC.
• Holds a Masters in Political Theory from the University of Oxford, a Master of Public Administration from the London School of Economics, and she completed the General Assembly Data Science bootcamp.
Today’s episode will appeal primarily to folks who are interested in blockchains and cryptocurrencies, particularly those keen to perform data analysis on blockchain data.
In this episode, Kim details:
• The unique real-time economic-data analytics opportunities that blockchains provide.
• Examples of her own research on blockchain data, such as analyses of illegal activity and global crypto adoption.
• The tools and approaches she uses daily to analyze and report on blockchain data.
• Where the evolutions of crypto, blockchains, and data science are going together.
• Why a data science bootcamp could be exactly the right thing for you if you’re looking to break into the field.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Imagen Video: Incredible Text-to-Video Generation
For today’s Five-Minute Friday episode, it’s my pleasure to introduce you to the Imagen Video model published upon just a few weeks ago by researchers from Google.
Read MoreData Analyst, Data Scientist, and Data Engineer Career Paths
Keen to become a Data Analyst? Get promoted to Sr Data Analyst? Or explore Data Engineer/Scientist options? Shashank, a YouTube expert on these questions (>100k subscribers!) tackles them in today's episode.
Shashank:
• Has an exceptional YouTube channel focused on helping people break into a data analyst career.
• Works as a Senior Data Engineer at digital sports platform Fanatics, Inc.
• Was previously Data Analyst at luxury retailer Nordstrom and other firms.
• Holds a degree in chemistry from Emory University in Atlanta.
Today’s episode will appeal primarily to folks who are interested in becoming a data analyst, or who are interested in transitioning from a data analyst role into a data science or data engineering role.
In this episode, Shashank details:
• How you can land an entry-level data analyst role in just a few weeks, regardless of your educational and professional background.
• The hard and soft skills you need to progress from a junior data analyst to a senior data analyst position.
• What it takes to transition from data analyst to a typically more lucrative role as a data scientist or data engineer.
• His favorite resources for learning the essential skills for data scientists.
What he looks for when he’s interviewing candidates.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Burnout: Causes and Solutions
What really is Burnout? What causes it? And how can you prevent or treat it? Prof. Christina Maslach — world-leading researcher and author on Burnout — joins me for today's episode to unpack these questions.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Blockchains and Cryptocurrencies: Analytics and Data Applications
Today's episode introduces what Blockchains are, what Crypto is, and Data Science applications of these technologies. Philip Gradwell of globally-renowned Chainalysis Inc. is our brilliant guide.
Philip:
• Is Chief Economist at Chainalysis, the world’s leading crypto analytics firm — their analysis is regularly featured by major news outlets.
• Previously worked as Principal at Vivid Economics, where he helped grow the consulting firm to 40 people, eventually culminating in its acquisition by consulting giant McKinsey & Company.
• Holds a Master’s in Economics from UCL and a PPE degree — that’s Philosophy, Politics, and Economics — from the University of Oxford.
Today’s episode will appeal to anyone looking for an introduction to the blockchain and cryptocurrencies. It’ll hold special appeal for people keen to do data science with these technologies.
In this episode, Philip details:
• Similarities and differences between analyzing cryptocurrencies and the established fiat currencies.
• His crypto data analytics pipeline.
• How he develops data products for a wide range of users, including businesses, banks, governments, and law enforcement.
• How the blockchain facilitates innovative computing and machine learning technologies.
• What he looks for in the data scientists he hires.
The SuperDataScience show's available on all major podcasting platforms, YouTube, and at SuperDataScience.com.