Jon Krohn, Cajoler of Datums

Getting Started

For those taking their first steps in data science, check out DataQuest. It is a comprehensive one-stop shop for an introduction to the fundamental tools and techniques of the trade. In particular, if you're eyeing an entry-level data scientist role, I harp on endlessly to candidates that they should show up with a portfolio of work; the folks at DataQuest have developed a series of blog posts on the topic. 

 

 

General Data Scientist Tools

As initially outlined in my post on Data Scientist Skills and Salaries, here is a list of key data science tools. With a focus on coding in Python wherever possible, they are:

It's also helpful to develop familiarity with:

Note that these tools generally appear in the open-source Hadoop cluster in the O'Reilly Data Science Salary Survey. Based on demand and relative compensation, it appears that valuable next steps to becoming a unicorn-variety data scientist would be to equip oneself with parallel-processing tools (e.g., SparkHivePig). 

 

 

Deep Learning

First Steps. For people in New York, I founded a Deep Learning Study Group. If you're further afield, you can track our progress via GitHub. Based on my experience with the study group, I have recorded a five-hour video series titled Deep Learning with TensorFlow LiveLessons that will be available in O'Reilly Safari in August 2017. As an intuitive, interactive introduction, the notebooks of code built over the course of the LiveLessons are available for free in GitHub today. Otherwise, get a lay of the land from: 

  • the sequence of courses suggested by Greg Brockman, or
  • this (more comprehensive) introductory resource post from Ofir Press, or
  • this (even more comprehensive) guide from YerevaNN Research Lab

Textbooks. Relative to viewing lectures, I prefer reading and working through problems. The stand-out resources for this, in the order they ought to be tackled are: 

Interactive Demos. Top-drawer interactive demos you can develop an intuitive sense of neural networks from are provided by: 

  • Distill, the academic publication for visualising machine learning research
  • Chris Olah
  • the illustrious Andrej Karpathy 
  • fun, concise, browser-based (i.e., JavaScript) self-driving cars
  • ...in addition, I've curated introductory Jupyter notebooks across the popular libraries TFLearn, Keras, Theano, and TensorFlow here

Applications. Scroll down to see my recommendations for high-quality data sources as well as global issues in need of solutions. Problems worth solving with deep learning approaches in particular are curated by OpenAI. In addition, if you're at the stage that you'd like to test a General AI across a range of applications (e.g., games), work with: 

Academic Papers. If you're looking for the latest deep learning research, bookmark: 

The Future. Insights into emerging trends:

 

 

Lay Primers on Software and Artificial Intelligence

 

 

Fun Online Primers for Data Science Techniques

 

 

Excellent Lay Books on Statistics

 

 

Open Data Sources

To train a powerful model, the larger the data set, the better -- if it's well-organised and open, that's ideal. The following repositories are standouts that meet all these criteria: 

For machine learning models that require a lot of labelled data, check out:

Finally, here are extensive pages on importing data from the Web into R, provided by CRAN and MRAN

 

 

Meetups

 

 

News

 

 

Clarity and Productivity

 

 

Charitable Projects

DataKind is a well-respected platform for finding humanitarian causes to apply your data science skills to. 

 

 

Problems Worth Solving

 

 

List of Additional Tools

  • LaTeX for creating beautiful documents, including Beamer for slideshows and Pandoc for conversion to countless other formats (e.g., word processor formats for sharing with coworkers)
  • Amazon AWS, especially S3 buckets, EC2, and Redshift
  • I love the Mathematica-based Wolfram Alpha web interface, for funsies and for learning about mathematical concepts
  • Plotly is a free, easy-to-use GUI for collaboratively creating aesthetically-pleasing visualisations
  • if you would like a slick, professional tool for mining data from patents, companies and/or the news, check out Quid, which I used extensively for a political project
network.jpg

Eudaemonia

For a life of flourishing -- a life of beauty, truth, justice, play and love -- choose mathematics