Meta's Llama 2 offered state-of-the-art performance for an "open-source"* LLM... except on tasks involving code. Now Code Llama is here and it magnificently fills that gap by outperforming all other open-source LLMs on coding benchmarks.
Read MoreImage, Video and 3D-Model Generation from Natural Language, with Dr. Ajay Jain
Today, brilliant ML researcher Ajay Jain, Ph.D explains how a full-length feature film could be created using Stable-Diffusion-style generative A.I. — these models can now output flawless 3D models and compelling video clips.
Ajay:
• Is a Co-Founder of Genmo AI, a platform for using natural language to generate stunning state-of-the-art images, videos and 3D models.
• Prior to Genmo, he worked as a researcher on the Google Brain team in California, in the Uber Advanced Technologies Group in Toronto and on the Applied Machine Learning team at Facebook.
• Holds a degree in Computer Science and Engineering from MIT and did his PhD within the world-class Berkeley A.I. Research (BAIR) lab, where he specialized in deep generative models.
• Has published highly influential papers at all of the most prestigious ML conferences, including NeurIPS, ICML and CVPR.
Today’s episode is on the technical side so will likely appeal primarily to hands-on practitioners, but we did our best to explain concepts so that anyone who’d like to understand the state of the art in image, video and 3D-model generation can get up to speed.
In this episode, Ajay details:
• How the Creative General Intelligence he’s developing will allow humans to express anything in natural language and get it.
• How feature-length films could be created today using generative A.I. alone.
• How the Stable Diffusion approach to text-to-image generation differs from the Generative Adversarial Network approach.
• How neural nets can represent all the aspects of a visual scene so that the scene can be rendered as desired from any perspective.
• Why a self-driving vehicle forecasting pedestrian behavior requires similar modeling capabilities to text-to-video generation.
• What he looks for in the engineers and researchers he hires.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
LangChain: Create LLM Applications Easily in Python
Today's episode is a fun intro to the powerful, versatile LLM-development framework LangChain. In it, Kris Ograbek talks us through how to use LangChain to chat with previous episodes of SuperDataScience! 😎
Kris:
• Is a content creator who specializes in creating LLM-based projects — with Python libraries like LangChain and the Hugging Face Transformers library — and then using the projects to teach these LLM techniques.
• Previously, he worked as a software engineer in Germany.
• He holds a Master’s in Electrical and Electronics Engineering from the Wroclaw University of Science and Technology.
In this episode, Kris details:
• The exceptionally popular LangChain framework for developing LLM applications.
• Specifically, he introduces how LangChain is so powerful by walking us step-by-step through a chatbot he built that interactively answers questions about episodes of the SuperDataScience podcast.
Having been listening to the podcast for years, at the end of the episode Kris flips the script on me and asks me some of the burning questions he has for me — questions that perhaps many other listeners also have wondered the answers to.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Big A.I. R&D Risks Reap Big Societal Rewards, with Meta’s Dr. Laurens van der Maaten
By making big research bets, the prolific Meta Senior Research Director Dr. Laurens van der Maaten has devised or supported countless world-changing machine-learning innovations across healthcare, climate change, privacy and more.
Laurens:
• Is a Senior Research Director at Meta, overseeing swathes of their high-risk, high-reward A.I. projects with application areas as diverse as augmented reality, biological protein synthesis and tackling climate change.
• Developed the "CrypTen" privacy-preserving ML framework.
• Pioneered web-scale weakly supervised training of image-recognition models.
• Along with the iconic Geoff Hinton, created the t-SNE dimensionality reduction technique (this paper alone has been cited over 36,000 times).
• In aggregate, his works have been cited nearly 100,000 times!
• Holds a PhD in machine learning from Tilburg University in the Netherlands.
Today’s episode will probably appeal primarily to hands-on data science practitioners, but there is tons of content in this episode for anyone who’d like to appreciate the state of the art in A.I. across a broad range of socially impactful, super-cool applications.
In this episode, Laurens details:
• How he pioneered learning across billions of weakly labeled images to create a state-of-the-art machine-vision model.
• How A.I. can be applied to the synthesis of new biological proteins with implications for both medicine and agriculture.
• Specific ways A.I. is being used to tackle climate change as well as to simulate wearable materials for enhancing augmented-reality interactivity.
• A library just like PyTorch but where all the computations are encrypted.
• The wide range of applications of his ubiquitous dimensionality-reduction approach.
• His vision for the impact of A.I. on society in the coming decades.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
ChatGPT Code Interpreter: 5 Hacks for Data Scientists
The ChatGPT Code Interpreter is surreal: It creates and executes Python code for whatever task you describe, debugs its own runtime errors, displays charts, does file uploads/downloads, and suggests sensible next steps all along the way.
Whether you write code yourself today or not, you can take advantage of GPT-4's stellar natural-language input/output capabilities to interact with the Code Interpreter. The mind-blowing experience is equivalent to having an expert data analyst, data scientist or software developer with you to instantaneously respond to your questions or requests.
As an example of these jaw-dropping capabilities (and given the data science-focused theme of my show), I use today's episode demonstrate the ChatGPT Code Interpreter's full automation of data analysis and machine learning. If you watch the episode on YouTube, you can even see the Code Interpreter hands-on in action while I interact with it solely with natural language.
Over the course of today's episode/video, the Code Interpreter:
1. Receives a sample data file that I provide it.
2. Uses natural language to describe all of the variables that are in the file.
3. Performs a four-step Exploratory Data Analysis (EDA), including histograms, scatterplots that compare key variables and key summary statistics (all explained in natural language).
4. Preprocesses all of my variables for machine learning.
5. Selects an appropriate baseline ML model, trains it and quantitatively evaluates its performance.
6. Suggests alternative models and approaches (e.g., grid search) to get even better performance and then automatically carries these out.
7. Optionally provides Python code every step of the way and is delighted to answer any questions I have about the code.
The whole process is a ton of fun and, again, requires no coding abilities to use (the "Code Interpreter" moniker could be misleadingly intimidating to non-coding folks). Even as an experienced data scientist, however, I would estimate that in many everyday situations use of the Code Interpreter could decrease my development time by a crazy 90% or more.
The big caveat with all of this is whether you're comfortable sharing your code with OpenAI. I wouldn't provide proprietary company code to it without clearing it with your firm first and — if you do use proprietary code with it — turn "Chat history & training" off in your ChatGPT Plus settings. To circumnavigate the data-privacy issue entirely, you could alternatively try Meta's newly-released "Code Llama — Instruct 34B" Large Language Model on your own infrastructure. Code Llama won't, however, be as good as the Code Interpreter in many circumstances and will require some technical savvy to get it up and running.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Vicuña, Gorilla, Chatbot Arena and Socially Beneficial LLMs, with Prof. Joey Gonzalez
Vicuna, Gorilla and the Chatbot Arena are all critical elements of the new open-source LLM ecosystem — the extremely knowledgeable and innovative Prof. Joseph Gonzalez is behind all of them. Get the details in today's episode
Joey:
• Is an Associate Professor of Electrical Engineering and Computer Science at the University of California, Berkeley.
• Co-directs the Berkeley RISE Lab, which studies Real-time, Intelligent, Secure and Explainable systems.
• Co-founded Turi (acquired by Apple for $200m) and more recently Aqueduct.
• His research is integral to major software systems including Apache Spark, Ray (for scaling Python ML), GraphLab (a high-level interface for distributed ML) and Clipper (low-latency ML serving).
• His papers—published in top ML journals—have been cited over 24,000 times.
• Developed Berkeley's upper-division data science class, which he now teaches to over 1000 students per semester.
Today’s episode will probably appeal primarily to hands-on data science practitioners but we made an effort to break down technical terms so that anyone who’s interested in staying on top of the latest in open-source Generative A.I. can enjoy the episode.
In it, Prof. Gonzalez details:
• How his headline-grabbing LLM, Vicuña, came to be and how it arose as one of the leading open-source alternatives to ChatGPT.
• How his Chatbot Arena became the leading proving ground for commercial and open-source LLMs alike.
• How his Gorilla project enables open-source LLMs to call APIs making it an open-source alternative to ChatGPT’s powerful plugin functionality.
• The race for longer LLM context windows.
• How both proprietary and open-source LLMs will thrive alongside each other in the coming years.
• His vision for how A.I. will have a massive, positive societal impact over the coming decades.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Large Language Model Leaderboards and Benchmarks
Llamas, Alpacas, Koalas, Falcons... there is a veritable zoo of LLMs out there! In today's episode, Caterina Constantinescu breaks down the LLM Leaderboards and evaluation benchmarks to help you pick the right LLM for your use case.
Caterina:
• Is a Principal Data Consultant at GlobalLogic, a full-lifecycle software development services provider with over 25,000 employees worldwide.
• Previously, she worked as a data scientist for financial services and marketing firms.
• Is a key player in data science conferences and Meetups in Scotland.
• Holds a PhD from The University of Edinburgh.
In this episode, Caterina details:
• The best leaderboards (e.g., HELM, Chatbot Arena and the Hugging Face Open LLM Leaderboard) for comparing the quality of both open-source and proprietary Large Language Models (LLMs).
• The advantages and issues associated with LLM evaluation benchmarks (e.g., evaluation dataset contamination is an big issue because the top-performing LLMs are often trained on all the publicly available data they can find... including benchmark-evaluation datasets).
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Feeding the World with ML-Powered Precision Agriculture
Earth's population may peak around 10 billion later this century. To feed everyone while also avoiding climate disaster, A.I. is essential. Today, three leaders from Syngenta detail how ML is transforming agriculture and assuring our future.
Read MoreJon’s “Generative A.I. with LLMs” Hands-on Training
Today's episode introduces my two-hour "Generative A.I with LLMs" training, which is packed with hands-on Python demos in Colab notebooks. It details open-source LLM (Hugging Face; PyTorch Lightning) and commercial (OpenAI API) options.
Read MoreHow Data Happened: A History, with Columbia Prof. Chris Wiggins
Today, Chris Wiggins — Chief Data Scientist at The New York Times and faculty at Columbia University — provides an enthralling, witty and rich History of Data Science. Chris is an extraordinarily gifted orator; don't miss this episode!
Chris:
• Is an Associate Professor of Applied Math at Columbia University.
• Has been CDS at The NY Times for nearly a decade.
• Co-authored two fascinating recently-published books: "How Data Happened: A History from the Age of Reason to the Age of Algorithms" and "Data Science in Context: Foundations, Challenges, Opportunities"
The vast majority of this episode will be accessible to anyone. There are just a couple of questions near the end that cover content on tools and programming languages that are primarily intended for hands-on practitioners.
In the episode, Chris magnificently details:
• The history of data and statistics from its infancy centuries ago to the present.
• Why it’s a problem that most data scientists have limited exposure to the humanities.
• How and when Bayesian statistics became controversial.
• What we can do to address the key issues facing data science and ML today.
• His computational biology research at Columbia.
• The tech stack used for data science at the globally revered New York Times.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
LLaMA 2 — It’s Time to Upgrade your Open-Source LLM
If you've been using fine-tuned open-source LLMs (e.g. for generative A.I. functionality or natural-language conversations with your users), it's very likely time you switch your starting model over to Llama 2. Here's why:
Read MoreGenerative A.I. without the Privacy Risks (with Prof. Raluca Ada Popa)
Consumers and enterprises dread that Generative A.I. tools like ChatGPT breach privacy by using convos as training data, storing PII and potentially surfacing confidential data as responses. Prof. Raluca Ada Popa has all the solutions.
Today's guest, Raluca:
• Is Associate Professor of Computer Science at University of California, Berkeley.
• Specializes in computer security and applied cryptography.
• Her papers have been cited over 10,000 times.
• Is Co-Founder and President of Opaque Systems, a confidential computing platform that has raised over $31m in venture capital to enable collaborative analytics and A.I., including allowing you to securely interact with Generative A.I.
• Previously co-founded PreVeil, a now-well-established company that provides end-to-end document and message encryption to over 500 clients.
• Holds a PhD in Computer Science from MIT.
Despite Raluca being such a deep expert, she does such a stellar job of communicating complex concepts simply that today’s episode should appeal to anyone that wants to dig into the thorny issues around data privacy and security associated with Large Language Models (LLMs) and how to resolve them.
In the episode, Raluca details:
• What confidential computing is and how to do it without sacrificing performance.
• How you can perform inference with an LLM (or even train an LLM!) without anyone — including the LLM developer! — being able to access your data.
• How you can use commercial generative models OpenAI’s GPT-4 without OpenAI being able to see sensitive or personally-identifiable information you include in your API query.
• The pros and cons of open-source versus closed-source A.I. development.
• How and why you might want to seamlessly run your compute pipelines across multiple cloud providers.
• Why you should consider a career that blends academia and entrepreneurship.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
“The Dream of Life” by Alan Watts
For episode #700 today, I bring you the "Dream of Life" thought experiment originally penned by Alan Watts. You are terrifically powerful (particularly now that you're armed with A.I.!) — are you making good use of your power?
Also, time flies, eh? Another hundred episodes in the bag today! Thanks for listening, providing feedback and otherwise contributing to making SuperDataScience, with over a million downloads per quarter, the most listened-to podcast in the data science industry. We've got some serious awesomeness lined up for the next hundred episodes — I can't wait for the amazing, inspiring, mind-opening conversations.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
The Modern Data Stack, with Harry Glaser
Today, eloquent Harry Glaser details the Modern Data Stack, including cloud collab tools (like Deepnote), running ML from data warehouses (like Snowflake), using dbt Labs for model orchestration, and model deployment best-practices.
Harry:
• Is Co-Founder and CEO of Modelbit, a San Francisco-based startup that has raised $5m in venture capital to make the productionization of machine learning models as fast and as simple as possible.
• Previously, was Co-Founder and CEO of Periscope Data, a code-driven analytics platform that was acquired by Sisense for $130m.
• And, prior to that, was a product manager at Google.
• Holds a degree in Computer Science from the University of Rochester.
Today’s episode is squarely targeted at practicing data scientists but could be of interest to anyone who’d like to enrich their understanding of the modern data stack and how ML models are deployed into production applications.
In the episode, Harry details:
• The major tools available for developing ML models.
• The best practices for model deployment such as version control, CI/CD, load balancing and logging.
• The data warehouse options for running models.
• What model orchestration is.
• How BI tools can be leveraged to collaborate on model prototypes across your organization.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
How Firms Can Actually Adopt A.I., with Rehgan Avon
Rehgan Avon's DataConnect conference is this week and is getting rave reviews. In this SuperDataScience episode, Jon Krohn, the silver-tongued entrepreneur details how organizations can successfully adopt A.I.
Read MoreThe (Short) Path to Artificial General Intelligence, with Dr. Ben Goertzel
Today, the luminary Dr. Ben Goertzel details how we could realize Artificial General Intelligence (AGI) in 3-7 years, why he's optimistic about the Artificial Super Intelligence (ASI) this would trigger, and what post-Singularity society could be like.
Dr. Goertzel:
• Is CEO of SingularityNET, a decentralized open market for A.I. models that aims to bring about AGI and thus the singularity that would transform society beyond all recognition.
• Has been Chairman of The AGI Society for 14 years.
• Has been Chairman of the foundation behind OpenCog — an open-source AGI framework — for 16 years.
• Was previously Chief Scientist at Hanson Robotics Limited, the company behind Sophia, the world’s most recognizable humanoid robot.
• Holds a PhD in mathematics from Temple University and held tenure-track professorships prior to transitioning to industry.
Today’s episode has parts that are relatively technical, but much of the episode will appeal to anyone who wants to understand how AGI — a machine that has all of the cognitive capabilities of a human — could be brought about and the world-changing impact that would have.
In the episode, Ben details:
• The specific approaches that could be integrated with deep learning to realize, in his view, AGI in as few as 3-7 years.
• Why the development of AGI would near-instantly trigger the development of ASI — a machine with intellectual capabilities far beyond humans’.
• Why, despite triggering the singularity — beyond which we cannot make confident predictions about the future — he’s optimistic that AGI will be a positive development for humankind.
• The connections between self-awareness, consciousness and the ASI of the future.
• With admittedly wide error bars, what a society that includes ASI may look like.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
Brain-Computer Interfaces and Neural Decoding, with Prof. Bob Knight
In today's extraordinary episode, Prof. Bob Knight details how ML-powered brain computer interfaces (BCIs) could allow real-time thought-to-speech synthesis and the reversal of cognitive decline associated with aging.
This is a rare treat as "Dr. Bob" doesn't use social media and has only made two previous podcast appearances: on Ira Flatow's "Science Friday" and a little-known program called "The Joe Rogan Experience".
Dr. Bob:
• Is Professor of Neuroscience and Psychology at University of California, Berkeley.
• Is Adjunct Professor of Neurology and Neurosurgery at UC San Francisco.
• Over his career, has amassed tens of millions of dollars in research funding, 75 patents, and countless international awards for neuroscience and cognitive computing research.
• His hundreds of papers have together been cited over 70,000 times.
In this episode, Bob details:
• Why the “prefrontal cortex” region of our brains makes us uniquely intelligent relative to all the other species on this planet.
• The invaluable data that can be gathered by putting recording electrodes through our skulls and directly into our brains.
• How "dynamic time-warping" algorithms allow him to decode imagined sounds, even musical melodies, through recording electrodes implanted into the brain.
• How BCIs are life-changing for a broad range of illnesses today.
• The extraordinary ways that advances in hardware and machine learning could revolutionize medical care with BCIs in the coming years.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
NLP with Transformers, feat. Hugging Face’s Lewis Tunstall
Lewis Tunstall — brilliant author of the bestseller "NLP with Transformers" and an ML Engineer at Hugging Face — today details how to train and deploy your own LLMs, the race for an open-source ChatGPT, and why RLHF leads to better models.
Dr. Tunstall:
• Is an ML Engineer at Hugging Face, one of the most important companies in data science today because they provide much of the most critical infrastructure for A.I. through open-source projects such as their ubiquitous Transformers library, which has a staggering 100,000 stars on GitHub.
• Is a member of Hugging Face’s prestigious research team, where he is currently focused on bringing us closer to having an open-source equivalent of ChatGPT by building tools that support RLHF (reinforcement learning from human feedback) and large-scale model evaluation.
• Authored “Natural Language Processing with Transformers”, an exceptional bestselling book that was published by O'Reilly last year and covers how to train and deploy Large Language Models (LLMs) using open-source libraries.
• Prior to Hugging Face, was an academic at the University of Bern in Switzerland and held data science roles at several Swiss firms.
• Holds a PhD in theoretical and mathematical physics from Adelaide in Australia.
Today’s episode is definitely on the technical side so will likely appeal most to folks like data scientists and ML engineers, but as usual I made an effort to break down the technical concepts Lewis covered so that anyone who’s keen to be aware of the cutting edge in NLP can follow along.
In the episode, Lewis details:
• What transformers are.
• Why transformers have become the default model architecture in NLP in just a few years.
• How to train NLP models when you have few to no labeled data available.
• How to optimize LLMs for speed when deploying them into production.
• How you can optimally leverage the open-source Hugging Face ecosystem, including their Transformers library and their hub for ML models and data.
• How RLHF aligns LLMs with the outputs users would like.
• How open-source efforts could soon meet or surpass the capabilities of commercial LLMs like ChatGPT.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
CatBoost: Powerful, efficient ML for large tabular datasets
CatBoost is making waves in open-source ML as it's often the top approach for tasks as diverse as classification, regression, ranking, and recommendation. This is especially so if working with tabular data that include categorical variables.
This justifiable excitement in mind, today's "Five-Minute Friday" episode of SuperDataScience is dedicated to CatBoost (short for “category” and “boosting”).
CatBoost has been around since 2017 when it was released by Yandex, a tech giant based in Moscow. In a nutshell, CatBoost — like the more established (and regularly Kaggle-leaderboard-topping approaches) XGBoost and LightGBM — is at its heart a decision-tree algorithm that leverages gradient boosting. So that explains the “boost” part of CatBoost.
The “cat” (“category”) part comes from CatBoost’s superior handling of categorical features. If you’ve trained models with categorical data before, you’ve likely experienced the tedium of preprocessing and feature engineering with categorical data. CatBoost comes to the rescue here, efficiently dealing with categorical variables by implementing a novel algorithm that eliminates the need for extensive preprocessing or manual feature engineering. CatBoost handles categorical features automatically by employing techniques such as target encoding and one-hot encoding.
In addition to CatBoost’s superior handling of categorical features, the algorithm also makes use of:
• A specialized gradient-based optimization scheme known as Ordered Boosting that takes advantage of the natural ordering of categorical variables to minimize the loss function efficiently.
• Symmetric decision trees, which have a fixed tree depth that enables a faster training time relative to XGBoost and a comparable training time to LightGBM, which is famous for its speed.
• Regularization techniques, such as the well-known L2 regularization as well as ordered boosting and symmetric trees already discussed, all together make CatBoost unlikely to overfit to training data relative to other boosted-tree algorithms.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.
YOLO-NAS: The State of the Art in Machine Vision, with Harpreet Sahota
Deci's YOLO-NAS architecture provides today's state of the art in Machine Vision, specifically the key task of Object Detection. Harpreet Sahota joins us from Deci today to detail YOLO-NAS as well as where Computer Vision is going next.
Harpreet:
• Leads the deep learning developer community at Deci AI, an Israeli startup that has raised over $55m in venture capital and that recently open-sourced the YOLO-NAS deep learning model architecture.
• Through prolific data science content creation, including The Artists of Data Science podcast and his LinkedIn live streams, Harpreet has amassed a social-media following in excess of 70,000 followers.
• Previously worked as a lead data scientist and as a biostatistician.
• Holds a master’s in mathematics and statistics from Illinois State University.
Today’s episode will likely appeal most to technical practitioners like data scientists, but we did our best to break down technical concepts so that anyone who’d like to understand the latest in machine vision can follow along.
In the episode, Harpreet details:
• What exactly object detection is.
• How object detection models are evaluated.
• How machine vision models have evolved to excel at object detection, with an emphasis on the modern deep learning approaches.
• How a “neural architecture search” algorithm enabled Deci to develop YOLO-NAS, an optimal object detection model architecture.
• The technical approaches that will enable large architectures like YOLO-NAS to be compute-efficient enough to run on edge devices.
• His “top-down” approach to learning deep learning, including his recommended learning path.
The SuperDataScience podcast is available on all major podcasting platforms, YouTube, and at SuperDataScience.com.