Commit 666bcd3f authored by Simon Bowly's avatar Simon Bowly
Browse files

Toss out the readme in favour of the wiki.

parent 9d4ad658
# Python for Data Science Resources
# ADS1001 PyData Resources
## Jupyter Notebooks
The easiest way to build pieces of analysis code is by using Jupyter notebooks.
Jupyter can be accessed through [MoVE](https://move.monash.edu/).
### Installing Jupyter Yourself
If you're using Jupyter a lot, you may find it faster to have it installed on your own computer, rather than always using it via MoVE.
The easiest way to get Python, Jupyter, and the full python data science toolset installed is to install [Anaconda](https://docs.anaconda.com/anaconda/install/), which will install the Jupyter Lab program and give you a similar interface to what you see on MoVE.
## Python Basics
Handling data in Python, for the most part, doesn't require extensive knowledge of the classical parts of programming (loops, if statements, functions, classes, etc), but it would help to be familiar with variables, basic data types (numbers, strings), operations (assignment, mathematical, logical), and data structures (lists, tuples, dictionaries).
You'll cover a lot of this in more depth next semester; but the basics should be enough to get started with the data science libraries in python.
* Parts 1-5 of [this tutorial](https://gitlab.erc.monash.edu.au/andrease/Python4Maths/tree/master) cover basic Python concepts.
* Some familiarity with [NumPy](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html) may also be helpful.
## Pandas Tutorials
[Pandas](https://pandas.pydata.org/) is the core library used for reading, writing, and manipulating datasets in Python.
You should get familiar with the [Series](https://pandas.pydata.org/docs/getting_started/dsintro.html#series) and [DataFrames](https://pandas.pydata.org/docs/getting_started/dsintro.html#dataframe) structures which are used as row/column containers for data.
Pandas then provides various methods for filtering, grouping, aggregating, and plotting the contents of a DataFrame.
* [pandas tutorials](https://pandas.pydata.org/docs/getting_started/tutorials.html)
* [10 minutes to pandas](https://pandas.pydata.org/docs/getting_started/10min.html): crash course in playing with DataFrames.
* [Pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf): great visual guide to the most common operations you'll need for transforming/grouping/plotting/exploring data sets.
* [pandas cookbook](https://github.com/jvns/pandas-cookbook): a set of jupyter notebooks to follow through with real data to play with. This is set up using an online service called Binder which allows you to click through the exercises while running the code examples (see the 'how to use this cookbook' section).
* Other useful resources
* [PyData TV](https://www.youtube.com/user/PyDataTV): this YouTube channel has a lot of useful videos from the PyData workshop series.
* [.head() to .tail()](https://www.youtube.com/watch?v=7vuO9QXDN50) is a good one to start with.
* [Stack Exchange](https://stackexchange.com/) (or more specifically [Stack Overflow](https://stackoverflow.com/)): Q&A site for programming related questions. Very useful to ask "how to do *x*" questions about python/pandas in plain English. You'll usually find your question has been asked before and there are some good answers waiting for you.
## Model Building Libraries
You'll need to get familiar with other parts of the python data science stack, such as:
* [matplotlib](matplotlib.org) - for producing plots from data. A lot of the matplotlib functionality is now implemented [directly from pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) so you may not need to use this library directly very often.
* [statsmodels](statsmodels.org) - provides statistical functions such as regression modelling for pandas and numpy data structures
* [scikit-learn](scikit-learn.org) - library for common machine learning tasks such as classification, clustering, regression, dimensionality reduction, etc
* [scipy](scipy.org) - large collection of more general algorithms and tools
## Sharing Code and Collaborating
* [GitHub](github.com) can be very useful here. It's worth learning for a couple of reasons: 1) it keeps checkpoints of your work in case you break something and can't find your way back, and 2) multiple people can contribute to a central codebase, review each others work and suggest changes. Also, it's the industry standard for storing code so is a valuable skill.
* **Please note**: by default, when you create a GitHub repository, everything you contribute is public. You can set a repository to private, but this only allows you to give access to up to three other people.
* Making your code public can be a useful showcase of your skills, especially in data science where the outputs are pretty visual. There's an interesting discussion to be had here around copyright and licensing to allow others to re-use your work ... for now please ensure you have permission from everyone involved in the creation of any work before you store it on a public site.
* If you do want to keep work private and still work in a github-ish manner, you can create projects on Monash's [GitLab server](https://gitlab.erc.monash.edu.au/), which allows you to keep things entirely private in group work.
Nothing in the main repo at the moment, see the [Wiki](https://gitlab.erc.monash.edu.au/ads1001/python-data-science-resources/-/wikis/home).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment