Interested in Data Science?

This page curates great resources for learning Data Science. I will update this page as I learn more and more about Data Science.

Lectures

Udacity's Intro to Machine Learning. A very good place to start and pretty fun.

Andrew Ng's Machine Learning course. This lecture series is a classic!

Stanford's Computer Vision course (CS 231n). Highly recommend this course to learn about Neural Networks, CNNs, RNNs, GANs, modern advances in computer vision. Incredibly up-to-date material; check out the course webpage for more.

Daniel Soper's Introduction to Databases. This may be very basic for you or a much needed source of information! Either way, it's an excellent series to check out. Links... Episode 1: Introduction to Databases, Episode 2: The Relational Model, Episode 3: SQL, Episode 4: Data Modeling and the ER Model, Episode 5: Database Design, Episode 6: Database Administration, Episode 7: Database Indexes, Episode 8: Big Data, Data Warehouses, and Business Intelligence Systems

Reading

Neural Networks and Deep Learning by Michael Nielsen. A great introduction to Neural Networks with nice visualizations

Command Line for Data Science. Perhaps a niche skill among Data Scientist, impress your co-workers with your Terminal talent! A very simple, efficient, automatable way to retrieve and scrub data.

Deep Learning Book by Ian Goodfellow, Yoshua Bengio and Aaron Courville. I don't think this book is very good, but it's written by some big shots so I marked it down here.

Genetic Programming by John Koza. A very unique approach to machine learning. Koza introduces an evolutionary framework for an algorithm to learn and mutate.

Practice

Kaggle. This website is huge and designed for all levels of data scientists. If you are just starting out, check out Kaggle Learn.

Driven Data. Hosts competitions that are directed towards good. Examples include predicting disease spread, efficient education spending, and predicting extreme poverty.

SQL Zoo. This is the best resource I know for learning SQL.

Berkeley's Data Science 100. Excellent up-to-date course on some of the basics of Data Science. I recommend trying the homework assignments here.

HackerRank. Challenges that span from basic to advanced on topics such as Statistics, Python, SQL, Java, and much more!

LeetCode. Similar to HackerRank, but I'd argue with a stronger emphasis on theory than language. Great SQL exercises, check out the Database category.

Top Coder. Again similar to LeetCode and HackRank, but this is more project focused. Also there are opportunities to make money.

Visualizations

Holoviews tutorial. Highly advanced visualization tutorial that introduces interactivee graphs, 3D plots, and more. Learn about Python's visualization libraries Holoviews and Bokeh.

Tableau. For some reason this can get a bad reputation for not being as serious as Python. I admit, when I first used Tableau, it reminded me of Scratch. However Tableau is able to produce stunning visualizations in a fraction of the time as other languages. Here are some resources Tableau-specific:

Blogs

Colah's blog. Incredibily insightful, well illustrated essays on Neural Networks from one of Google Brain's employees.

A Visual Introduction to Machine Learning. Some of the best visuals I've ever seen. See Part I and Part II

Robert Chang's blog. I particulary recommend his Data Engineering posts (Part I, Part II, Part III)

Papers

I'm really sorry for doing this, blogs listing important papers is nothing new.

Stay Determined! Although academic papers may appear dense and intimidating at first, and stay that way. However you get used to that aspect over time. You got this!


Convolutional Neural Networks (CNNs).

  • 2012 - ImageNet Classification with Deep Convolutional Neural Networks (link). This is the paper introducing AlexNet, a model that handely defeated 2nd place in the ILSVRC-2012 competition.
  • Sept 2014 - Going Deeper with Convolutions (link). Introducing the Inception architecture proposed by Google. Implements multiple networks per layer and won ILSVRC14 with fewer parameters and less computation.
  • Oct 2014 - Rich feature hierarchies for accurate object detection and semantic segmentation (link). Details how to segment an image into region proposals, then classify by CNN networks.

Recursive Neural Networks (RNNs).

  • 1997 - Long Short-Term Memory (link). This paper introduces the architecture for LSTM that has since become the most successful model for RNNs
  • Jun 2014 - Recurrent Models of Visual Attention (link). This paper explores an RNN that performs better than CNNs on image detection. The trick is selective focus; the network looks at clips of the image and then decides where to look next, much like a human eye.
  • May 2015 - DRAW: A Recurrent Neural Network For Image Generation (link). This paper expands off the previous paper by generating new images with this model.

Generative Adversarial Networks (GANs)

  • Jun 2014 - Generative Adversarial Networks (link). Introduces a model capable of generating realistic data by using two networks: (1) a discrimitive network learning whether data is generated or real and (2) a generative network trying to produce data that maximizes the mistakes of the discrimitive model. It might be interesting to look at Goodfellow's code supplementing the paper (link).

Deep Learning (fun)

  • Mar 2012 - A Survey of Monte Carlo Tree Search Methods (link). Monte Carlo Tree Search (MCTS) are one of the central factors leading to the success of DeepMind in Go. Be wary the paper may be a bit outdated after 6 years.
  • Oct 2017 - Mastering the game of Go without human knowledge (link). Who would expect a model with less human intervention, a single neural network, and 1/12th the TPUs would beat the current champion AlphaGo Lee 100 to 0.
  • Dec 2017 - Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm (link). The deep learning model of Go is extended to Chess/Shogi and defeats the previous computer world champions

Other publications

I apoligize if any links break over time; if you notice one that is, please email me at 'pstetz@live.com' and I will fix it promptly.

If you start to become overflooded with papers, check out Mendeley to keep you organized.

Tools

A / B test Sample Size calculater (link) - Performs most of the mathematical work you'll need for A / B tests.

Kaggle Kernels (link) - Python and R notebooks running off of the cloud. It isn't industrial-grade, but if you're just starting out, it's a good supplement to your laptop. Kaggle kernels move the computation away from your local device for FREE. And the kernels specs are: 4 CPUs, 16GB RAM, 1GB disk space, 60min execution time.

Good luck!

If you have made it this far, you are extremely powerful and should consider using your newfound skills for good. I wish you the best on your journey as Data Science is as difficult as it is rewarding.