Commit 1a259417 authored by Sean Solari's avatar Sean Solari
Browse files

Update to documentation, --itol flag and log scaling of results

parent a79f4124
![expam logo](docs/source/expamlogo.png)
![expam logo](docs/source/expam-logo.png)
## **Install**.
......
......@@ -161,7 +161,7 @@ Classification
Classify
^^^^^^^^
Run metagenomic reads against a succesfully built database. See :doc:`Tutorial 2 <tutorials/classify>` form details.
Run metagenomic reads against a succesfully built database. See :doc:`Tutorial 2 <tutorials/classify>` for more details.
.. code-block:: console
......@@ -180,9 +180,9 @@ Run metagenomic reads against a succesfully built database. See :doc:`Tutorial 2
To be supplied when sample files contained paired-end reads.
.. option:: --name <string>
.. option:: -o <str>, --out <str>
Name of results folder.
Path to save classification results and output in.
.. option:: --taxonomy
......@@ -217,6 +217,7 @@ Run metagenomic reads against a succesfully built database. See :doc:`Tutorial 2
.. option:: --group <sample name> <sample name> ...
Space-separated list of sample files to be treated as a single group in phylotree.
Groups are explained in this :ref:`tutorial <groups explanation>`.
.. note::
......@@ -232,13 +233,34 @@ Run metagenomic reads against a succesfully built database. See :doc:`Tutorial 2
Percentage requirement for classification subtrees (see :doc:`Tutorial 1 <tutorials/quickstart>`
and :doc:`Tutorial 2 <tutorials/classify>`).
.. option:: --itol
Rather than use :code:`ete3` for plotting the phylogenetic tree, **expam** will output files that can be
used with iTOL for plotting. See the :ref:`classification tutorial <itol integration>` for details.
.. option:: --log-scale
Compute a log-transform on the counts at each node in the phylogenetic tree before
depiction on the phylotree.
.. note::
For a given sample :math:`S`, with minimum and maximum counts :math:`\underline{c}` and :math:`\overline{c}`
respectively (:math:`\underline{c} > 0` i.e. the smallest non-zero score), the log-transform :math:`f` of some count :math:`x` is defined by
.. math::
f(x) = \frac{ \log\left(x / \underline{c}\right) }{ \log\left(\overline{c} / \underline{c}\right) },
so that :math:`f(x)\in[0,1]`. Then :math:`f(x)` is treated as an opacity score for plotting purposes.
Example
"""""""
.. code-block:: console
$ expam run -db DB_NAME -d /path/to/paired/reads --paired --name paired_reads_analysis --taxonomy
$ expam run -db DB_NAME -d /path/to/paired/reads --paired --out ~/paired_reads_analysis --taxonomy
.. _download taxonomy:
......@@ -280,3 +302,111 @@ Translate phylogenetic classification output to NCBI taxonomy.
$ expam to_taxonomy --db DB_NAME
Plotting results on phylotree
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Results are automatically visualised on top of a phylogenetic tree when during the :code:`expam run` command,
but can also be done after classification using the :code:`phylotree` command.
.. code-block::
$ expam phylotree -db DB_NAME --out /path/to/classification/output [args...]
.. option:: -o <str>, --out <str>
Path to retrieve classification results for plotting.
.. option:: --phyla
Colour phylotree results by phyla.
.. option:: --keep_zeros
Keep nodes in output where no reads have been assigned.
.. option:: --ignore_names
Don't plot names of reference genomes in output phylotree.
.. option:: --colour_list <hex string> <hex string> ...
List of colours to use when plotting groups in phylotree.
.. option:: --group <sample name> <sample name> ...
Space-separated list of sample files to be treated as a single group in phylotree.
Groups are explained above, and in this :ref:`tutorial <groups explanation>`.
.. option:: --itol
Rather than use :code:`ete3` for plotting the phylogenetic tree, **expam** will output files that can be
used with iTOL for plotting. See the :ref:`classification tutorial <itol integration>` for details.
.. option:: --log-scale
Compute a log-transform on the counts at each node in the phylogenetic tree before
depiction on the phylotree.
.. _limiting resource usage:
Limiting resource usage
-----------------------
**expam** allows you to provide an :code:`expam_limit` context before the :code:`expam` call to limit
how much RAM is used. *Note that this doesn't change any underlying algorithms, it simply
prepares a graceful exit of the program if it exceeds the supplied limit.* See :ref:`examples<limit example>`
for an example usage.
.. option:: -m <int>, --memory <int>
Memory limit in bytes.
.. option:: -x <float>, --x <float>
Percentage of total available memory to limit to.
.. option:: -t <float>, --interval <float>
Intervals in which program memory usage is written to log file.
.. option:: -o <str>, --out <str>
Log file to write to. By default, logs are written to console.
.. _limit example:
Example
^^^^^^^
The following will perform a database build while restricting *expam*'s total
memory usage to half of the available machine's RAM, writing logs
in 1 second intervals to a :code:`build.log` file.
.. code-block:: console
$ expam_limit -x 0.5 -t 1.0 -o build.log expam build ...
.. warning::
It is important that the :code:`expam_limit` command comes before
the :code:`expam` command.
.. note::
The :code:`expam_limit` context works the same for any command. :code:`expam build`
can be replaced with :code:`expam run`, or any other command.
The following is an example of the (tab-separated) log file output:
.. code-block::
2022-03-11 02:25:05,888 ... total used free shared buff/cache available
2022-03-11 02:25:05,903 ... Mem: 944Gi 1.6Gi 427Gi 0.0Ki 515Gi 938Gi
2022-03-11 02:25:06,915 ... Mem: 944Gi 1.6Gi 427Gi 0.0Ki 515Gi 938Gi
2022-03-11 02:25:07,928 ... Mem: 944Gi 2.2Gi 427Gi 38Mi 515Gi 937Gi
2022-03-11 02:25:08,940 ... Mem: 944Gi 2.2Gi 426Gi 195Mi 515Gi 937Gi
2022-03-11 02:25:09,953 ... Mem: 944Gi 2.2Gi 426Gi 353Mi 515Gi 937Gi
2022-03-11 02:25:10,966 ... Mem: 944Gi 2.2Gi 426Gi 516Mi 516Gi 937Gi
2022-03-11 02:25:11,980 ... Mem: 944Gi 2.2Gi 426Gi 682Mi 516Gi 936Gi
2022-03-11 02:25:12,992 ... Mem: 944Gi 2.2Gi 426Gi 848Mi 516Gi 936Gi
2022-03-11 02:25:14,005 ... Mem: 944Gi 2.2Gi 425Gi 1.0Gi 516Gi 936Gi
......@@ -3,7 +3,7 @@
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
.. image:: expamlogo.png
.. image:: expam-logo.png
:width: 500
:align: center
:alt: logo
......@@ -15,6 +15,8 @@ Welcome to the **expam** documentation!
**expam** is a Python package for phylogenetic analysis of metagenomic data.
Here is the `GitHub page <https://github.com/seansolari/expam>`_.
.. toctree::
:maxdepth: 2
:caption: Contents:
......@@ -62,6 +64,13 @@ For a comprehensive list of commands and arguments, see :doc:`commands <commands
of these commands for building and classifying are given in the :doc:`tutorials <tutorials/index>`.
**Important** - monitoring memory usage
---------------------------------------
Be aware of the built-in tools for monitoring and restricting **expam**'s memory usage,
outlined :ref:`here <limiting resource usage>`.
Indices and tables
==================
......
expam's Tree module
===================
**expam**'s tree module
=======================
A programmatic API to interact with phylogenetic trees, particularly those used in reference databases.
......
......@@ -78,8 +78,8 @@ What do I do with splits?
The :code:`--cutoff` flag sets a minimum count that any clade/taxa needs to reach before it is included in the classification results.
The :code:`--cpm` flag sets the same cutoff, but in **count per million** as opposed to a flat cutoff number.
When both are supplied, :code:`--cpm` takes precedence, and by default **expam** requires each node to have at least
100 counts per million input reads.
When both are supplied, the highest of either cutoff is taken. *By default*, **expam** *requires each node to have at least
100 counts per million input reads*.
* With both these mechanisms in place, we can be more confident that high split counts in a particular region of the phylogeny is suggestive of novel sequence in the biological sample.
* The algorithm for classifying splits takes a conservative approach - **those that are interested only in a general profile can feel comfortable simply adding classification and split counts together to produce an overall profile.**
......@@ -93,7 +93,7 @@ Phylogenetic classification results
.. code-block:: console
$ expam run -db my_database -d /path/to/sample_one.fq --name sample_one
$ expam run -db my_database -d /path/to/sample_one.fq --out sample_one
* In :code:`./sample_one`, there will be a :code:`phy` subdirectory containing three files:
......@@ -199,12 +199,12 @@ Taxonomic results
.. code-block:: console
$ expam run -d /path/to/reads --name example --taxonomy
$ expam run -d /path/to/reads --out example --taxonomy
.. code-block:: console
$ expam run -d /path/to/reads --name example_one
$ expam to_taxonomy --name example_one
$ expam run -d /path/to/reads --out example_one
$ expam to_taxonomy --out example_one
* Where before the results directory contained only a :code:`phy` subdirectory, it will now also contain a :code:`tax` folder.
......@@ -214,6 +214,7 @@ Taxonomic sample summaries
* For each sample input file, **expam** will translate a corresponding taxonomic sample summary.
* These are tab-delimited matrices with nine columns:
1. **Taxon ID** - NCBI taxon id.
2. **Percent classified (cumulative)** - total percentage of reads in this sample classified at or below this taxon id.
3. **Total classified (cumulative)** - total number of reads classified at or below this taxon id.
......
......@@ -37,6 +37,8 @@ For instance, :code:`sample_one.fq.tar.gz` :math:`\rightarrow` :code:`sample_one
<hr>
.. _groups explanation:
Groups
^^^^^^
......@@ -110,4 +112,50 @@ Example of colour list
$ expam run ... --colour_list "#FF0000" "#00FF00" "#0000FF"
.. _itol integration:
iTOL integration
----------------
Rather than use :code:`ete3` for visualising classification results, supplying the
:code:`--itol` flag will instead create a :code:`itol` subdirectory within the output
folder containing two files:
* :code:`tree.nwk` - Newick format tree that can be inserted into iTOL.
* :code:`style.txt` - An iTOL formatted text document that contains all the information needed for iTOL to style the tree.
For instance, say we previously ran :code:`expam run --out my_run -d /some/samples`, and
now run :code:`expam phylotree --out my_run --itol`, the corresponding files
would be located at
* :code:`my_run/itol_classified/tree.nwk`,
* :code:`my_run/itol_classified/style.txt`,
* :code:`my_run/itol_splits/tree.nwk`,
* :code:`my_run/itol_splits/style.txt`.
To use these files,
* Create a new tree in iTOL with :code:`tree.nwk`.
* Open this tree using the iTOL interface.
* Drag-and-drop the style.txt into the open tree interface, and iTOL will colour the tree accordingly.
.. note::
By default, iTOL will only colour the leaf labels and clades with the supplied
colours. Using the *Colored ranges* window that appears after dragging the style
sheet onto the tree, you can select the *Cover --> Clade* option for more
effective highlighting of the distributions.
An example is shown below.
.. figure:: includes/gtdb-itol-example.png
:width: 500
:align: center
:alt: iTOL tree
**Figure 2:** Example tree containing three different sample classification results
plotted in red, green and blue shades respectively.
......@@ -99,18 +99,13 @@ Running classifications
* We use the :code:`run` command to classify reads.
* These are paired reads, but for now we'll treat them as separate.
* By default, run results are stored in the :code:`results` database folder,
* here :code:`test/results`.
* This can be redirected using :code:`--out`.
* We can supply a :code:`--name` to label these results.
* We'll call this first run :code:`unpaired`.
* We supply the :code:`-o` or :code:`--out` flag to tell *expam* where to save classification results.
* *expam* automatically creates a :code:`results` subdirectory in the database directory, which is a convenient but not necessarily required place to keep classification results related to this database.
.. code-block:: console
$ expam run -db test -d /Users/seansolari/Documents/expam/test/data/reads/ --name unpaired_test
$ expam run -db test -d /Users/seansolari/Documents/expam/test/data/reads/ --out test/results/unpaired_test
Clearing old log files...
Results directory created at /Users/seansolari/Documents/Databases/test/results/unpaired_test.
Loading the map and phylogeny.
......@@ -207,7 +202,7 @@ Running paired data
.. code-block:: console
$ expam run -db test -d /Users/seansolari/Documents/expam/test/data/reads/ --name paired_test --paired
$ expam run -db test -d /Users/seansolari/Documents/expam/test/data/reads/ --out test/results/paired_test --paired
Clearing old log files...
Results directory created at /Users/seansolari/Documents/Databases/test/results/paired_test.
Loading the map and phylogeny.
......@@ -250,10 +245,11 @@ Taxonomic results
This saves space by only downloading the data required for your specific reference sequences.
* We will convert the previous :code:`paired_test` run to taxonomic format.
* Specify the path to the classfication results folder using :code:`-o` or :code:`--out`.
.. code-block:: console
$ expam to_taxonomy -db test --name paired_test
$ expam to_taxonomy -db test --out test/results/paired_test
* Initialising node pool...
* Checking for polytomies...
......@@ -261,15 +257,15 @@ Taxonomic results
* Finalising index...
Phylogenetic tree written to /Users/seansolari/Documents/Databases/test/results/paired_test/phylotree.pdf!
* The results to convert are specified using the :code:`--name` flag.
* The results to convert are specified using the :code:`-o/--out` flag.
* This must point to the base of the results directory (ie. parent of :code:`phy` output).
* This must point to the base of the results directory (ie. parent of :code:`phy` output directory).
* Taxonomic results can be found in :code:`tax` subdirectory within results folder (that you specified with :code:`--name`).
* Taxonomic results can be found in :code:`tax` subdirectory within results folder (that you specified with :code:`--out`).
.. code-block:: console
$ test/results/paired_test/tax/
$ ls test/results/paired_test/tax/
GCF_000005845.2_ASM584v2_genomic.gz_2.csv raw
$ head test/results/paired_test/tax/GCF_000005845.2_ASM584v2_genomic.gz_2.csv
c_perc c_cumul c_count s_perc s_cumul s_count rank scientific name
......
......@@ -5,7 +5,7 @@ from setuptools.extension import Extension
from Cython.Build import cythonize
import numpy as np
EXPAM_VERSION = (0, 0, 8)
EXPAM_VERSION = (0, 0, 9)
SOURCE = os.path.dirname(os.path.abspath(__file__))
......
......@@ -860,7 +860,7 @@ def main():
help="Length of simulated reads.",
metavar="[read length]")
parser.add_argument("-o", "--out", dest="out_url",
help="URL to save sequences.",
help="Where to save classification results.",
metavar="[out URL]")
parser.add_argument("-y", "--pile", dest="pile",
help="Number of genomes to pile at a time (or inf).",
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment