Commit e95d62ee authored by Sean Solari's avatar Sean Solari
Browse files

Added quickstart tutorial. Revamp docs.

parent 2a4d8168
...@@ -18,7 +18,7 @@ most common of which are outlined in the FAQ section below. ...@@ -18,7 +18,7 @@ most common of which are outlined in the FAQ section below.
First download the source code from the GitLab repository. First download the source code from the GitLab repository.
```console ```console
user@computer:~$ git clone git@gitlab.erc.monash.edu.au:ssol0002/pam.git user@computer:~$ git clone git@github.com:seansolari/expam.git
``` ```
This can then be installed locally by executing the following command from the This can then be installed locally by executing the following command from the
source code root: source code root:
...@@ -34,6 +34,10 @@ View our online documentation! ...@@ -34,6 +34,10 @@ View our online documentation!
[https://expam.readthedocs.io/en/latest/index.html](https://expam.readthedocs.io/en/latest/index.html) [https://expam.readthedocs.io/en/latest/index.html](https://expam.readthedocs.io/en/latest/index.html)
See the Quick Start Tutorial for a guide to expam's basic usage and download links for pre-built databases.
[Quick Start Tutorial](https://expam.readthedocs.io/en/latest/quickstart.html)
<hr style="border:1px solid #ADD8E6"> </hr> <hr style="border:1px solid #ADD8E6"> </hr>
......
.. Colour profiles for sphinx.
.. role:: red
.. role:: green
.. role:: blue
.. role:: flexcontainer
.. role:: wideflexcontainer
.. role:: colone
.. role:: coltwo
.. role:: colthree
.. role:: colfour
.. role:: colfive
.. role:: colsix
...@@ -10,4 +10,201 @@ p { ...@@ -10,4 +10,201 @@ p {
.redbackground { .redbackground {
background-color: rgba(255, 0, 0, 0.1)!important; background-color: rgba(255, 0, 0, 0.1)!important;
font-weight: bold; font-weight: bold;
} }
\ No newline at end of file
.red {
color: red;
font-weight: bold;
}
.green {
/* color: white;
background-color: rgba(0, 255, 0, 1.0); */
border-left: 1em solid rgba(0, 255, 0, 1.0);
font-weight: 600;
padding: 15px;
}
.blue {
/* color: white;*/
/* background-color: rgba(0, 0, 255, 0.5); */
border-left: 1em solid rgba(0, 0, 255, 0.5);
font-weight: 600;
padding: 15px;
}
.flexcontainer {
width: 400px;
margin: 0 auto;
margin-top: 3em;
margin-bottom: 3em;
display: flex;
flex-direction: row;
flex-wrap: wrap;
align-items: center;
justify-content: center;
}
.wideflexcontainer {
width: 600px;
margin: 0 auto;
margin-top: 1em;
margin-bottom: 1em;
display: flex;
flex-direction: row;
flex-wrap: wrap;
align-items: center;
justify-content: center;
}
.colone {
border-left: 1em solid rgba(2, 110, 129, 1.0);
margin: 15px;
font-weight: 600;
padding: 15px;
flex: 1;
flex-basis: 30%;
}
.colone:hover {
background-color: rgba(2, 110, 129, 0.3);
}
.colone p {
margin: 0 auto;
}
.colone a {
color: black;
padding: 15px;
}
.coltwo {
border-left: 1em solid rgba(0, 171, 189, 1.0);
margin: 15px;
font-weight: 600;
padding: 15px;
flex: 1;
flex-basis: 30%;
}
.coltwo:hover {
background-color: rgba(0, 171, 189, 0.3);
}
.coltwo p {
margin: 0 auto;
}
.coltwo a {
color: black;
padding: 15px;
}
.colthree {
border-left: 1em solid rgba(0, 153, 221, 1.0);
margin: 15px;
font-weight: 600;
padding: 15px;
flex: 1;
flex-basis: 30%;
}
.colthree:hover {
background-color: rgba(0, 153, 221, 0.3);
}
.colthree p {
margin: 0 auto;
}
.colthree a {
color: black;
padding: 15px;
}
.colfour {
border-left: 1em solid rgba(255, 153, 51, 1.0);
margin: 15px;
font-weight: 600;
padding: 15px;
flex: 1;
flex-basis: 30%;
}
.colfour:hover {
background-color: rgba(255, 153, 51, 0.3);
}
.colfour p {
margin: 0 auto;
}
.colfour a {
color: black;
padding: 15px;
}
.colfive {
border-left: 1em solid rgba(161, 199, 224, 1.0);
margin: 15px;
font-weight: 600;
padding: 15px;
flex: 1;
flex-basis: 30%;
}
.colfive:hover {
background-color: rgba(161, 199, 224, 0.3);
}
.colfive p {
margin: 0 auto;
}
.colfive a {
color: black;
padding: 15px;
}
.colsix {
border-left: 1em solid rgba(242, 205, 172, 1.0);
margin: 15px;
font-weight: 600;
padding: 15px;
flex: 1;
flex-basis: 30%;
}
.colsix:hover {
background-color: rgba(242, 205, 172, 0.3);
}
.colsix p {
margin: 0 auto;
}
.colsix a {
color: black;
padding: 15px;
}
...@@ -110,7 +110,7 @@ Add reference sequences to the database. ...@@ -110,7 +110,7 @@ Add reference sequences to the database.
Add sequences to particular sequence group. Add sequences to particular sequence group.
See :doc:`Tutorial 1 <tutorials/quickstart>` for details. See :doc:`Tutorial 1 <tutorials/overview>` for details.
Examples Examples
...@@ -230,7 +230,7 @@ Run metagenomic reads against a succesfully built database. See :doc:`Tutorial 2 ...@@ -230,7 +230,7 @@ Run metagenomic reads against a succesfully built database. See :doc:`Tutorial 2
.. option:: --alpha <float> .. option:: --alpha <float>
Percentage requirement for classification subtrees (see :doc:`Tutorial 1 <tutorials/quickstart>` Percentage requirement for classification subtrees (see :doc:`Tutorial 1 <tutorials/overview>`
and :doc:`Tutorial 2 <tutorials/classify>`). and :doc:`Tutorial 2 <tutorials/classify>`).
.. option:: --itol .. option:: --itol
......
...@@ -15,17 +15,48 @@ Welcome to the **expam** documentation! ...@@ -15,17 +15,48 @@ Welcome to the **expam** documentation!
**expam** is a Python package for phylogenetic analysis of metagenomic data. **expam** is a Python package for phylogenetic analysis of metagenomic data.
Here is the `GitHub page <https://github.com/seansolari/expam>`_. .. include:: .special.rst
.. toctree:: .. container:: flexcontainer
:maxdepth: 2
:caption: Contents: .. container:: colone
:ref:`Installation Instructions`
.. container:: coltwo
:doc:`Quickstart <quickstart>`
.. container:: colthree
:doc:`Tutorials <tutorials/index>`
.. container:: colfour
.. raw:: html
<p>
<a class="reference internal" href="https://github.com/seansolari/expam" target="_blank">GitHub</a>
</p>
.. container:: colfive
.. raw:: html
<p>
<a class="reference internal" href="https://figshare.com/s/3475c3a9aa926a40c722" target="_blank">Database</a>
</p>
.. container:: colsix
.. raw:: html
<p>
<a class="reference internal" href="https://github.com/seansolari/expam/issues" target="_blank">Report Bug</a>
</p>
Documentation<commands>
dependencies
tutorials/index
tree
.. _Installation Instructions:
Installation Installation
------------ ------------
...@@ -50,6 +81,26 @@ Python Package Index (pip) ...@@ -50,6 +81,26 @@ Python Package Index (pip)
$ pip install expam $ pip install expam
From GitHub source
^^^^^^^^^^^^^^^^^^
To install from source, you need a local installation of Python >=3.8, as well as *numpy* and *cython*.
There are some commonly encountered problems when installing on Linux, the most common of which are
outlined in the FAQ section on the `GitHub page <https://github.com/seansolari/expam>`_.
First download the source code from the GitLab repository.
.. code-block:: console
$ git clone git@github.com:seansolari/expam.git
This can then be installed locally by executing the following command from the source code root.
.. code-block:: console
$ python3 setup.py install
Usage Usage
----- -----
...@@ -63,6 +114,10 @@ expam's CLI uses the same structure for all commands and operations: ...@@ -63,6 +114,10 @@ expam's CLI uses the same structure for all commands and operations:
For a comprehensive list of commands and arguments, see :doc:`commands <commands>`. Practical usage For a comprehensive list of commands and arguments, see :doc:`commands <commands>`. Practical usage
of these commands for building and classifying are given in the :doc:`tutorials <tutorials/index>`. of these commands for building and classifying are given in the :doc:`tutorials <tutorials/index>`.
.. image:: expam-figure-v4.2.jpg
:align: center
:alt: expam Pipeline.
**Important** - monitoring memory usage **Important** - monitoring memory usage
--------------------------------------- ---------------------------------------
...@@ -70,6 +125,15 @@ of these commands for building and classifying are given in the :doc:`tutorials ...@@ -70,6 +125,15 @@ of these commands for building and classifying are given in the :doc:`tutorials
Be aware of the built-in tools for monitoring and restricting **expam**'s memory usage, Be aware of the built-in tools for monitoring and restricting **expam**'s memory usage,
outlined :ref:`here <limiting resource usage>`. outlined :ref:`here <limiting resource usage>`.
.. toctree::
:maxdepth: 2
:caption: Contents:
quickstart
Documentation<commands>
dependencies
tutorials/index
tree
Indices and tables Indices and tables
================== ==================
......
Quickstart Tutorial
===================
We will be using a pre-built database to classify some metagenomic reads and obtain both phylogenetic and taxonomic output.
Get the database
----------------
* Download one of the following compressed databases (see :doc:`Tutorial 1 <tutorials/overview>` for instructions on building a database).
.. container:: wideflexcontainer
.. container:: colone
.. raw:: html
<p>
<a class="reference internal" href="https://drive.google.com/file/d/1K1sVA4LGgmGBVg_0GeUppVa_xcfWxxGL/view?usp=sharing" target="_blank">Test Database (110.7 Mb)</a>
</p>
.. container:: coltwo
.. raw:: html
<p>
<a class="reference internal" href="https://figshare.com/s/3475c3a9aa926a40c722" target="_blank">expam RefSeq (122.35 Gb)</a>
</p>
* Unzip the compressed file. For instance, if you downloaded :code:`Test Database` (:code:`test.tar.gz`) into your home directory :code:`~`, run
.. code-block:: console
$ tar -xvzf test.tar.gz
The database will now be located at :code:`~/test`, which is the directory you should pass to any :code:`-db` flag as input to **expam**.
* Download these metagenomic reads (simulated reads from a genome used to build the :code:`test` database).
.. container:: wideflexcontainer
.. container:: colone
.. raw:: html
<p>
<a class="reference internal" href="https://drive.google.com/file/d/1hTOndUelxf1cEEW8EIRYZrdxaHKez9qz/view?usp=sharing" target="_blank">Fasta reads (432 Kb)</a>
</p>
* Unzip these metagenomic reads into a new folder, which we will call :code:`reads`. Assuming you have downloaded and moved the above reads into your home directory, run
.. code-block::
$ mkdir reads
$ mv reads.tar.gz reads
$ cd reads
$ tar -xvzf reads.tar.gz
There should be two files:
* :code:`~/reads/GCF_000005845.2_ASM584v2_genomic.fna.gz_1.fa`,
* :code:`~/reads/GCF_000005845.2_ASM584v2_genomic.fna.gz_2.fa`.
These are paired read, fasta files.
Phylogenetic classification
---------------------------
* We will now classify these reads using the database you downloaded. We will save the results to an output folder located at :code:`~/my_run/`.
* Run the :code:`expam classify` command as follows (replacing :code:`~/test` with where you decompressed the database from Step 1 if necessary):
.. code-block:: console
$ expam classify -db ~/test -d ~/reads --paired --out ~/my_run
Clearing old log files...
Loading the map and phylogeny.
Preparing shared memory allocations...
Loading database keys...
Loading database values...
* Initialising node pool...
* Checking for polytomies...
Polytomy (degree=3) detected! Resolving...
* Finalising index...
Loading reads from /Users/ssol0002/Documents/Projects/pam/test/data/reads/GCF_000005845.2_ASM584v2_genomic.fna.gz_2.fa, /Users/ssol0002/Documents/Projects/pam/test/data/reads/GCF_000005845.2_ASM584v2_genomic.fna.gz_1.fa...
Could not import ete3 plotting modules! Error raised:
Traceback (most recent call last):
File "/Users/ssol0002/Documents/Projects/pam/src/expam/tree/tree.py", line 622, in draw_tree
import ete3.coretype.tree
ModuleNotFoundError: No module named 'ete3'
Skipping plotting...
Could not import ete3 plotting modules! Error raised:
Traceback (most recent call last):
File "/Users/ssol0002/Documents/Projects/pam/src/expam/tree/tree.py", line 622, in draw_tree
import ete3.coretype.tree
ModuleNotFoundError: No module named 'ete3'
Skipping plotting...
.. note::
Note that **expam** tried to plot the results on a phylotree, but since we did not have the ete3 module installed,
it simply skipped plotting the results. This is the expected behaviour to let you know **expam** was not able
to produce a graphical picture for your results.
* The phylogenetic classifications will be located at :code:`~/my_run/phy`, and will contain four files:
* :code:`~/my_run/GCF_000005845.2_ASM584v2_genomic.gz_1.csv` - sample summary file,
.. code-block::
unclassified 0.000000% 0 0
p1 100.000000% 1000 3 0.000000% 0 0
p2 99.700000% 997 232 0.000000% 0 0
GCF_000005845.2_ASM584v2_genomic 76.500000% 765 765 0.000000% 0 0
* :code:`~/my_run/classified.csv` - classified summary file,
.. code-block::
GCF_000005845.2_ASM584v2_genomic.gz_1
unclassified 0
p1 3
p2 232
GCF_000005845.2_ASM584v2_genomic 765
* :code:`~/my_run/split.csv` - split summary file,
.. code-block::
GCF_000005845.2_ASM584v2_genomic.gz_1
p1 0
p2 0
GCF_000005845.2_ASM584v2_genomic 0
* :code:`~/my_run/raw` - raw read-wise classifications. There will be a single raw read-wise output file, :code:`~/my_run/raw/GCF_000005845.2_ASM584v2_genomic.gz_1.csv`.
.. code-block::
C R4825323246286034638 p2 302 p2:240
C R4280015672552393909 p10 302 p10:240
C R5925738157954038177 p10 302 p1:5 p10:16 p2:198 p10:16 p1:5
C R3237657389899545456 p10 302 p2:85 p10:31 p2:8 p10:31 p2:85
C R6111671585932593081 p10 302 p2:36 p10:37 p2:3 p10:88 p2:3 p10:37 p2:36
C R4574482278193488645 p10 302 p10:29 p2:14 p10:31 p2:2 p10:88 p2:2 p10:31 p2:14 p10:29
C R8975058804953044791 p10 302 p10:21 p2:59 p10:80 p2:59 p10:21
C R6052336354009855322 p10 302 p2:53 p10:31 p2:72 p10:31 p2:53
The sample summary file is a tab-separated document where the first element of each row is a phylogenetic node/clade, and the corresponding values are contain details of the raw and cumulative classifications and splits at this particular node.
The classified summary file is a tab-separated matrix where each row is a phylogenetic clade, each column is an input sample, and the cell value is the raw counts at this clade. The split summary file is an analogous file that contains the raw split count at any given clade. These two files are formatted such that they will always have the same column and row indices, and in the same order.
The raw read-wise output is a sub-directory containing one output file for each input sample, the kraken-formatted read-wise output.
A more comprehensive overview is given :doc:`this tutorial <tutorials/classify>`.
Convert to taxonomy
-------------------
* First run :code:`expam download_taxonomy` download the taxonomy for all sequences in the database. This will require an internet connection.
.. code-block:: console
$ expam download_taxonomy -db ~/test
Posting 6 UIDs to NCBI Entrez nuccore.
Received 6 response(s) for ESummary TaxID request!
Posting 6 UIDs to NCBI Entrez taxonomy.
Received 6 response(s) for EFetch Taxon request!
Taxonomic lineages written to ~/test/phylogeny/taxid_lineage.csv!
Taxonomic ranks written to ~/test/phylogeny/taxa_rank.csv!
* We saved our previous classification results to :code:`~/my_run`. This is the directory we pass to :code:`expam to_taxonomy` to convert phylogenetic classifications to taxonomy.
.. code-block:: console
$ expam to_taxonomy -db test --out ~/my_run
* Initialising node pool...
* Checking for polytomies...
Polytomy (degree=3) detected! Resolving...
* Finalising index...
* There will now be taxonomic output files located in :code:`~/my_run/tax/`, analogous to each of the files present in the phylogenetic output, with the exception of :code:`classified.tsv` and :code:`split.tsv` - only the sample summaries and raw read-wise output are converted.
* :code:`~/my_run/tax/GCF_000005845.2_ASM584v2_genomic.gz_1.csv` - taxonomic summary file
.. code-block::
c_perc c_cumul c_count s_perc s_cumul s_count rank scientific name
unclassified 0.0% 0 0 0% 0 0 0 0
1 100.0% 1000 0 0% 0 0 root
131567 100.0% 1000 0 0% 0 0 top cellular organisms
2 100.0% 1000 235 0% 0 0 superkingdom cellular organisms Bacteria
1224 76.5% 765 0 0% 0 0 phylum cellular organisms Bacteria Proteobacteria
* :code:`~/my_run/tax/raw/GCF_000005845.2_ASM584v2_genomic.gz_1.csv` - taxonomic read-wise output. The second column is the read header, the third is the assigned taxid, and the fourth is the length of the read. Observe length of 300 for paired-end 150bp reads.
.. code-block::
C R4825323246286034638 2 302
C R4280015672552393909 511145 302
C R5925738157954038177 511145 302
C R3237657389899545456 511145 302
C R6111671585932593081 511145 302
C R4574482278193488645 511145 302
C R8975058804953044791 511145 302
C R6052336354009855322 511145 302
C R4978825024774141837 2 302
C R7016203356160788326 511145 302
The complete comprehensive overview is given :doc:`this tutorial <tutorials/classify>`.
...@@ -5,7 +5,7 @@ Tutorials ...@@ -5,7 +5,7 @@ Tutorials
:maxdepth: 2 :maxdepth: 2
:caption: Contents: :caption: Contents:
quickstart overview
classify classify
treebuilding treebuilding
graphical graphical
......