Commit be7708c2 authored by Simon Bowly's avatar Simon Bowly
Browse files

Merge branch 'master' into 'master'

Store wiki notes in the repo.

See merge request ads1001/python-data-science-resources!28
parents 852ebae3 3b5222e8
# Python for Data Science Resources
This is a short list of tutorial links for learning python for data science in ADS1001.
We'll continue to update it during semester as everyone starts to get up to speed.
[See here for some example Jupyter notebooks we're putting together as we go](https://gitlab.erc.monash.edu.au/ads1001/python-data-science-resources/-/blob/master/README.md).
Especially if you're new to Python, you may find it easiest to learn the relevant commands for data analysis by example.
The tools we're using here are built on top of the Python programming language.
For data science projects we use Jupyter notebooks as an environment for running Python code, since it allows you to mix code, documentation and outputs in a readable and repeatable format.
The main data science packages to be aware of are **pandas** and **numpy** for loading and transforming data, **matplotlib** for visualising data, and **statsmodels**, **scikit-learn** and **scipy** for conducting more complex analysis and building models.
This is not meant to be an exhaustive list, but hopefully it will get you started.
You'll find a huge number of publicly available learning resources via a quick google search, whether it be step-by-step walkthroughs, video tutorials, or question and answer forums.
Noting down what style of online resources you found most useful (and why) will be a useful exercise for your reflective journals throughout semester.
## Running Jupyter
The easiest way to build and run pieces of analysis code is by using Jupyter notebooks.
Jupyter can be accessed through [MoVE](https://move.monash.edu/).
### Installing Jupyter Yourself
If you're using Jupyter a lot, you'll probably find it faster to have it installed on your own computer, rather than always using it via MoVE.
The easiest way to get Python, Jupyter, and the full python data science toolset installed is to install [Anaconda](https://docs.anaconda.com/anaconda/install/), which will install the Jupyter Lab program and give you a similar interface to what you see on MoVE.
## Python Basics
Handling data in Python, for the most part, doesn't require extensive knowledge of the classical parts of programming (loops, if statements, functions, classes, etc), but it would help to be familiar with variables, basic data types (numbers, strings), operations (assignment, mathematical, logical), and data structures (lists, tuples, dictionaries).
You'll cover a lot of this in more depth next semester; but the basics should be enough to get started with the data science libraries in python.
* The [Monash Data Fluency](https://monashdatafluency.github.io/python-workshop-base/) Python modules give a good overview of the Python language, as well as an introduction to the data analysis components.
* Parts 1-5 of [this tutorial](https://gitlab.erc.monash.edu.au/andrease/Python4Maths/tree/master) cover basic Python concepts.
* Some familiarity with [NumPy concepts](https://jakevdp.github.io/PythonDataScienceHandbook/02.02-the-basics-of-numpy-arrays.html) may also be helpful.
## Pandas Tutorials
[Pandas](https://pandas.pydata.org/) is the core library used for reading, writing, and manipulating datasets in Python.
You should get familiar with the [Series](https://pandas.pydata.org/docs/getting_started/dsintro.html#series) and [DataFrame](https://pandas.pydata.org/docs/getting_started/dsintro.html#dataframe) structures which are used as row/column containers for data, and the methods used to read data into DataFrames (e.g. the survey csv file).
Pandas then provides various methods for filtering, grouping, aggregating, and plotting the contents of a DataFrame.
Here are a few tutorial options to get started:
* [Pandas tutorials](https://pandas.pydata.org/docs/getting_started/tutorials.html)
* [10 minutes to pandas](https://pandas.pydata.org/docs/getting_started/10min.html): crash course in playing with DataFrames.
* [Pandas cheat sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf): great visual guide to the most common operations you'll need for transforming/grouping/plotting/exploring data sets.
* [Pandas cookbook](https://github.com/jvns/pandas-cookbook): a set of jupyter notebooks to follow through with real data to play with. This is set up using an online service called Binder which allows you to click through the exercises while running the code examples (see the 'how to use this cookbook' section).
* Other useful resources
* [PyData TV](https://www.youtube.com/user/PyDataTV): this YouTube channel has a lot of useful videos from the PyData workshop series.
* [.head() to .tail()](https://www.youtube.com/watch?v=7vuO9QXDN50) is a good one to start with.
* [Stack Exchange](https://stackexchange.com/) (or more specifically [Stack Overflow](https://stackoverflow.com/)): Q&A site for programming related questions. Very useful to ask "how to do *x*" questions about python/pandas in plain English. You'll usually find your question has been asked before and there are some good answers waiting for you.
## Other Libraries
You'll need to get familiar with other parts of the python data science stack, such as:
* [Statsmodels](https://www.statsmodels.org/stable/index.html) - provides statistical functions such as regression models for pandas and numpy data structures.
* [Matplotlib](https://matplotlib.org/) - for generating plots from data. A lot of the matplotlib functionality is now implemented [directly from pandas DataFrames](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html) so you may not need to use this library directly very often.
* [Scikit-learn](https://scikit-learn.org/stable/) - library for common machine learning tasks such as classification, clustering, regression, dimensionality reduction, etc.
* [Scipy](https://www.scipy.org/) - large collection of more general algorithms and tools.
Watch this space, we'll add more here once everyone is up to speed with the fundamentals of Python and Pandas.
## Sharing Code and Collaborating
Using Google Drive to share files will probably suit your purposes at least for this semester.
[GitHub](https://www.github.com), although it takes some effort to master, is extremely useful for this purpose.
It's worth learning for a couple of reasons: 1) it keeps checkpoints of your work in case you break something and can't find your way back, and 2) multiple people can contribute to a central codebase, review each others work and suggest changes.
Also, it's the industry standard for working on code collaboratively so is a valuable skill.
**Please note** that by default, when you create a GitHub repository, everything you contribute is public.
You can set a repository to private, but this only allows you to give access to up to three other people.
Ensure you have permission from everyone involved in the creation of any work before you store it on a public site.
If you do want to keep work private and still use the Git workflow, you can create projects on Monash's [GitLab server](https://gitlab.erc.monash.edu.au/), which allows you to keep a private repository for group work.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment