Jon Krohn, Cajoler of Datums

General Data Scientist Tools

As initially outlined in my post on Data Scientist Skills and Salaries, here is a list of key data science tools. With a focus on coding in Python wherever possible, they are:

It's also helpful to develop familiarity with:

Note that these tools generally appear in the open-source Hadoop cluster in the O'Reilly Data Science Salary Survey. Based on demand and relative compensation, it appears that valuable next steps to becoming a unicorn-variety data scientist would be to equip oneself with parallel-processing tools (e.g., SparkHivePig). 



Deep Learning

First Steps. For people in New York, I founded a Deep Learning Study Group. If you're further afield, you can track our progress via GitHub. Based on my experience with the study group, I have recorded a six-hour video series titled Deep Learning with TensorFlow LiveLessons that is available within Safari. As an intuitive, interactive introduction, the notebooks of code built over the course of the LiveLessons are available for free in GitHub.  In addition, I offer a part-time in-classroom Deep Learning course at the NYC Data Science Academy. 

Otherwise, get a lay of the land from: 

  • the sequence of courses suggested by Greg Brockman, or
  • this (more comprehensive) introductory resource post from Ofir Press, or
  • this (even more comprehensive) guide from YerevaNN Research Lab

Textbooks. Relative to viewing lectures, I prefer reading and working through problems. The stand-out resources for this, in the order I recommend tackling them are: 

Interactive Demos. Top-drawer interactive demos you can develop an intuitive sense of neural networks from are provided by: 

  • Distill, the academic publication for visualising machine learning research
  • Chris Olah
  • the illustrious Andrej Karpathy 
  • fun, concise, browser-based (i.e., JavaScript) self-driving cars
  • addition, I've curated introductory Jupyter notebooks across the popular libraries TFLearn, Keras, Theano, and TensorFlow here

Applications. Scroll down to see my recommendations for high-quality data sources as well as global issues in need of solutions. Problems worth solving with deep learning approaches in particular are curated by OpenAI. In addition, if you're at the stage that you'd like to test a General AI across a range of applications (e.g., games), work with: 

Academic Papers. If you're looking for the latest deep learning research, bookmark: 

The Past. Histories of Deep Learning: 

The Future. Insights into emerging trends:



Lay Primers on Software and Artificial Intelligence



Fun Online Primers for Data Science Techniques



Excellent Lay Books on Statistics



Open Data Sources

To train a powerful model, the larger the data set, the better -- if it's well-organised and open, that's ideal. The following repositories are standouts that meet all these criteria: 

For machine learning models that require a lot of labelled data, check out:

Finally, here are extensive pages on importing data from the Web into R, provided by CRAN and MRAN












Clarity and Productivity



Charitable Projects

DataKind is a well-respected platform for finding humanitarian causes to apply your data science skills to. 



Problems Worth Solving



List of Additional Tools

  • LaTeX for creating beautiful documents, including Beamer for slideshows and Pandoc for conversion to countless other formats (e.g., word processor formats for sharing with coworkers)
  • Amazon AWS, especially S3 buckets, EC2, and Redshift
  • I love the Mathematica-based Wolfram Alpha web interface, for funsies and for learning about mathematical concepts
  • Plotly is a free, easy-to-use GUI for collaboratively creating aesthetically-pleasing visualisations
  • if you would like a slick, professional tool for mining data from patents, companies and/or the news, check out Quid, which I used extensively for a political project


For a life of flourishing -- a life of beauty, truth, justice, play and love -- choose mathematics