Commit d0305f69 authored by Ubuntu's avatar Ubuntu
Browse files

Initial commit of biodemo; starting from bionitio (python)

parents
Pipeline #6437 failed with stages
sudo: true
dist: trusty
language: python
python:
- "3.4"
before_install:
- ./.travis/install-dependencies.sh
script:
- ./functional_tests/biodemo-test.sh -p biodemo -d functional_tests/test_data
- ./.travis/unit-test.sh
#!/bin/sh
# Install Python dependencies
echo 'Python install'
(
pip install -r requirements-dev.txt
pip install .
)
#!/bin/bash
set -e
errors=0
# Run unit tests
python biodemo/biodemo_test.py || {
echo "'python python/biodemo/biodemo_test.py' failed"
let errors+=1
}
# Check program style
pylint -E biodemo/*.py || {
echo 'pylint -E biodemo/*.py failed'
let errors+=1
}
[ "$errors" -gt 0 ] && {
echo "There were $errors errors found"
exit 1
}
echo "Ok : Python specific tests"
Copyright 29 Nov 2018 Scott_Maxwell
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
[![travis](https://travis-ci.org/USERNAME/biodemo.svg?branch=master)](https://travis-ci.org/USERNAME/biodemo)
# Overview
This program reads one or more input FASTA files. For each file it computes a variety of statistics, and then prints a summary of the statistics as output.
In the examples below, `$` indicates the command line prompt.
# Licence
This program is released as open source software under the terms of [MIT License](https://raw.githubusercontent.com/USERNAME/biodemo/master/LICENSE).
# Installing
Clone this repository:
```
$ git clone https://github.com/USERNAME/biodemo
```
Move into the repository directory:
```
$ cd biodemo
```
Biodemo can be installed using `pip` in a variety of ways (`$` indicates the command line prompt):
1. Inside a virtual environment:
```
$ python3 -m venv biodemo_dev
$ source biodemo_dev/bin/activate
$ pip install -U /path/to/biodemo
```
2. Into the global package database for all users:
```
$ pip install -U /path/to/biodemo
```
3. Into the user package database (for the current user only):
```
$ pip install -U --user /path/to/biodemo
```
# General behaviour
Biodemo accepts zero or more FASTA filenames on the command line. If zero filenames are specified it reads a single FASTA file from the standard input device (stdin). Otherwise it reads each named FASTA file in the order specified on the command line. Biodemo reads each input FASTA file, computes various statistics about the contents of the file, and then displays a tab-delimited summary of the statistics as output. Each input file produces at most one output line of statistics. Each line of output is prefixed by the input filename or by the text "`stdin`" if the standard input device was used.
Biodemo processes each FASTA file one sequence at a time. Therefore the memory usage is proportional to the longest sequence in the file.
An optional command line argument `--minlen` can be supplied. Sequences with length strictly less than the given value will be ignored by biodemo and do not contribute to the computed statistics. By default `--minlen` is set to zero.
These are the statistics computed by biodemo, for all sequences with length greater-than-or-equal-to `--minlen`:
* *NUMSEQ*: the number of sequences in the file satisfying the minimum length requirement.
* *TOTAL*: the total length of all the counted sequences.
* *MIN*: the minimum length of the counted sequences.
* *AVERAGE*: the average length of the counted sequences rounded down to an integer.
* *MAX*: the maximum length of the counted sequences.
If there are zero sequences counted in a file, the values of MIN, AVERAGE and MAX cannot be computed. In that case biodemo will print a dash (`-`) in the place of the numerical value. Note that when `--minlen` is set to a value greater than zero it is possible that an input FASTA file does not contain any sequences with length greater-than-or-equal-to the specified value. If this situation arises biodemo acts in the same way as if there are no sequences in the file.
## Help message
Biodemo can display usage information on the command line via the `-h` or `--help` argument:
```
$ biodemo -h
usage: biodemo [-h] [--minlen N] [--version] [--log LOG_FILE]
[FASTA_FILE [FASTA_FILE ...]]
Print fasta stats
positional arguments:
FASTA_FILE Input FASTA files
optional arguments:
-h, --help show this help message and exit
--minlen N Minimum length sequence to include in stats (default 0)
--version show program's version number and exit
--log LOG_FILE record program progress in LOG_FILE
```
## Reading FASTA files named on the command line
Biodemo accepts zero or more named FASTA files on the command line. These must be specified following all other command line arguments. If zero files are named, biodemo will read a single FASTA file from the standard input device (stdin).
There are no restrictions on the name of the FASTA files. Often FASTA filenames end in `.fa` or `.fasta`, but that is merely a convention, which is not enforced by biodemo.
The example below illustrates biodemo applied to a single named FASTA file called `file1.fa`:
```
$ biodemo file1.fa
FILENAME NUMSEQ TOTAL MIN AVG MAX
file1.fa 5264 3801855 31 722 53540
```
The example below illustrates biodemo applied to three named FASTA files called `file1.fa`, `file2.fa` and `file3.fa`:
```
$ biodemo file1.fa file2.fa file3.fa
FILENAME NUMSEQ TOTAL MIN AVG MAX
file1.fa 5264 3801855 31 722 53540
file2.fa 5264 3801855 31 722 53540
file3.fa 5264 3801855 31 722 53540
```
## Reading a single FASTA file from standard input
The example below illustrates biodemo reading a FASTA file from standard input. In this example we have redirected the contents of a file called `file1.fa` into the standard input using the shell redirection operator `<`:
```
$ biodemo < file1.fa
FILENAME NUMSEQ TOTAL MIN AVG MAX
stdin 5264 3801855 31 722 53540
```
Equivalently, you could achieve the same result by piping a FASTA file into biodemo:
```
$ cat file1.fa | biodemo
FILENAME NUMSEQ TOTAL MIN AVG MAX
stdin 5264 3801855 31 722 53540
```
## Filtering sequences by length
Biodemo provides an optional command line argument `--minlen` which causes it to ignore (not count) any sequences in the input FASTA files with length strictly less than the supplied value.
The example below illustrates biodemo applied to a single FASTA file called `file`.fa` with a `--minlen` filter of `1000`.
```
$ biodemo --minlen 1000 file.fa
FILENAME NUMSEQ TOTAL MIN AVG MAX
file1.fa 4711 2801855 1021 929 53540
```
## Logging
If the ``--log FILE`` command line argument is specified, biodemo will output a log file containing information about program progress. The log file includes the command line used to execute the program, and a note indicating which files have been processes so far. Events in the log file are annotated with their date and time of occurrence.
```
$ biodemo --log bt.log file1.fasta file2.fasta
# normal biodemo output appears here
# contents of log file displayed below
```
```
$ cat bt.log
12/04/2016 19:14:47 program started
12/04/2016 19:14:47 command line: /usr/local/bin/biodemo --log bt.log file1.fasta file2.fasta
12/04/2016 19:14:47 Processing FASTA file from file1.fasta
12/04/2016 19:14:47 Processing FASTA file from file2.fasta
```
## Empty files
It is possible that the input FASTA file contains zero sequences, or, when the `--minlen` command line argument is used, it is possible that the file contains no sequences of length greater-than-or-equal-to the supplied value. In both of those cases biodemo will not be able to compute minimum, maximum or average sequence lengths, and instead it shows output in the following way:
The example below illustrates biodemo applied to a single FASTA file called `empty.fa` which contains zero sequences:
```
$ biodemo empty.fa
FILENAME NUMSEQ TOTAL MIN AVG MAX
empty.fa 0 0 - - -
```
# Exit status values
Biodemo returns the following exit status values:
* 0: The program completed successfully.
* 1: File I/O error. This can occur if at least one of the input FASTA files cannot be opened for reading. This can occur because the file does not exist at the specified path, or biodemo does not have permission to read from the file.
* 2: A command line error occurred. This can happen if the user specifies an incorrect command line argument. In this circumstance biodemo will also print a usage message to the standard error device (stderr).
* 3: Input FASTA file is invalid. This can occur if biodemo can read an input file but the file format is invalid.
# Testing
## Unit tests
```
$ cd biodemo/python/biodemo
$ python -m unittest -v biodemo_test
```
## Test suite
A set of sample test input files is provided in the `test_data` folder.
```
$ biodemo two_sequence.fasta
FILENAME TOTAL NUMSEQ MIN AVG MAX
two_sequence.fasta 2 357 120 178 237
```
# Bug reporting and feature requests
Please submit bug reports and feature requests to the issue tracker on GitHub:
[biodemo issue tracker](https://github.com/USERNAME/biodemo/issues)
'''
Module : Main
Description : The main entry point for the program.
Copyright : (c) Scott_Maxwell, 29 Nov 2018
License : MIT
Maintainer : scott.maxwell1@monash.edu
Portability : POSIX
The program reads one or more input FASTA files. For each file it computes a
variety of statistics, and then prints a summary of the statistics as output.
'''
from argparse import ArgumentParser
from math import floor
import sys
import logging
import pkg_resources
from Bio import SeqIO
EXIT_FILE_IO_ERROR = 1
EXIT_COMMAND_LINE_ERROR = 2
EXIT_FASTA_FILE_ERROR = 3
DEFAULT_MIN_LEN = 0
DEFAULT_VERBOSE = False
HEADER = 'FILENAME\tNUMSEQ\tTOTAL\tMIN\tAVG\tMAX'
PROGRAM_NAME = "biodemo"
try:
PROGRAM_VERSION = pkg_resources.require(PROGRAM_NAME)[0].version
except pkg_resources.DistributionNotFound:
PROGRAM_VERSION = "undefined_version"
def exit_with_error(message, exit_status):
'''Print an error message to stderr, prefixed by the program name and 'ERROR'.
Then exit program with supplied exit status.
Arguments:
message: an error message as a string.
exit_status: a positive integer representing the exit status of the
program.
'''
logging.error(message)
print("{} ERROR: {}, exiting".format(PROGRAM_NAME, message), file=sys.stderr)
sys.exit(exit_status)
def parse_args():
'''Parse command line arguments.
Returns Options object with command line argument values as attributes.
Will exit the program on a command line error.
'''
description = 'Read one or more FASTA files, compute simple stats for each file'
parser = ArgumentParser(description=description)
parser.add_argument(
'--minlen',
metavar='N',
type=int,
default=DEFAULT_MIN_LEN,
help='Minimum length sequence to include in stats (default {})'.format(
DEFAULT_MIN_LEN))
parser.add_argument('--version',
action='version',
version='%(prog)s ' + PROGRAM_VERSION)
parser.add_argument('--log',
metavar='LOG_FILE',
type=str,
help='record program progress in LOG_FILE')
parser.add_argument('fasta_files',
nargs='*',
metavar='FASTA_FILE',
type=str,
help='Input FASTA files')
return parser.parse_args()
class FastaStats(object):
'''Compute various statistics for a FASTA file:
num_seqs: the number of sequences in the file satisfying the minimum
length requirement (minlen_threshold).
num_bases: the total length of all the counted sequences.
min_len: the minimum length of the counted sequences.
max_len: the maximum length of the counted sequences.
average: the average length of the counted sequences rounded down
to an integer.
'''
#pylint: disable=too-many-arguments
def __init__(self,
num_seqs=None,
num_bases=None,
min_len=None,
max_len=None,
average=None):
"Build an empty FastaStats object"
self.num_seqs = num_seqs
self.num_bases = num_bases
self.min_len = min_len
self.max_len = max_len
self.average = average
def __eq__(self, other):
"Two FastaStats objects are equal iff their attributes are equal"
if type(other) is type(self):
return self.__dict__ == other.__dict__
return False
def __repr__(self):
"Generate a printable representation of a FastaStats object"
return "FastaStats(num_seqs={}, num_bases={}, min_len={}, max_len={}, " \
"average={})".format(
self.num_seqs, self.num_bases, self.min_len, self.max_len,
self.average)
def from_file(self, fasta_file, minlen_threshold=DEFAULT_MIN_LEN):
'''Compute a FastaStats object from an input FASTA file.
Arguments:
fasta_file: an open file object for the FASTA file
minlen_threshold: the minimum length sequence to consider in
computing the statistics. Sequences in the input FASTA file
which have a length less than this value are ignored and not
considered in the resulting statistics.
Result:
A FastaStats object
'''
num_seqs = num_bases = 0
min_len = max_len = None
for seq in SeqIO.parse(fasta_file, "fasta"):
this_len = len(seq)
if this_len >= minlen_threshold:
if num_seqs == 0:
min_len = max_len = this_len
else:
min_len = min(this_len, min_len)
max_len = max(this_len, max_len)
num_seqs += 1
num_bases += this_len
if num_seqs > 0:
self.average = int(floor(float(num_bases) / num_seqs))
else:
self.average = None
self.num_seqs = num_seqs
self.num_bases = num_bases
self.min_len = min_len
self.max_len = max_len
return self
def pretty(self, filename):
'''Generate a pretty printable representation of a FastaStats object
suitable for output of the program. The output is a tab-delimited
string containing the filename of the input FASTA file followed by
the attributes of the object. If 0 sequences were read from the FASTA
file then num_seqs and num_bases are output as 0, and min_len, average
and max_len are output as a dash "-".
Arguments:
filename: the name of the input FASTA file
Result:
A string suitable for pretty printed output
'''
if self.num_seqs > 0:
num_seqs = str(self.num_seqs)
num_bases = str(self.num_bases)
min_len = str(self.min_len)
average = str(self.average)
max_len = str(self.max_len)
else:
num_seqs = num_bases = "0"
min_len = average = max_len = "-"
return "\t".join([filename, num_seqs, num_bases, min_len, average,
max_len])
def process_files(options):
'''Compute and print FastaStats for each input FASTA file specified on the
command line. If no FASTA files are specified on the command line then
read from the standard input (stdin).
Arguments:
options: the command line options of the program
Result:
None
'''
if options.fasta_files:
for fasta_filename in options.fasta_files:
logging.info("Processing FASTA file from %s", fasta_filename)
try:
fasta_file = open(fasta_filename)
except IOError as exception:
exit_with_error(str(exception), EXIT_FILE_IO_ERROR)
else:
with fasta_file:
stats = FastaStats().from_file(fasta_file, options.minlen)
print(stats.pretty(fasta_filename))
else:
logging.info("Processing FASTA file from stdin")
stats = FastaStats().from_file(sys.stdin, options.minlen)
print(stats.pretty("stdin"))
def init_logging(log_filename):
'''If the log_filename is defined, then
initialise the logging facility, and write log statement
indicating the program has started, and also write out the
command line from sys.argv
Arguments:
log_filename: either None, if logging is not required, or the
string name of the log file to write to
Result:
None
'''
if log_filename is not None:
logging.basicConfig(filename=log_filename,
level=logging.DEBUG,
filemode='w',
format='%(asctime)s %(levelname)s - %(message)s',
datefmt='%m-%d-%Y %H:%M:%S')
logging.info('program started')
logging.info('command line: %s', ' '.join(sys.argv))
def main():
"Orchestrate the execution of the program"
options = parse_args()
init_logging(options.log)
print(HEADER)
process_files(options)
# If this script is run from the command line then call the main function.
if __name__ == '__main__':
main()
'''
Unit tests for biodemo.
Usage: python -m unittest -v biodemo_test
'''
import unittest
from io import StringIO
#pylint: disable=no-name-in-module
from biodemo import FastaStats
class TestFastaStats(unittest.TestCase):
'''Unit tests for FastaStats'''
def do_test(self, input_str, minlen, expected):
"Wrapper function for testing FastaStats"
result = FastaStats().from_file(StringIO(input_str), minlen)
self.assertEqual(expected, result)
def test_zero_byte_input(self):
"Test input containing zero bytes"
expected = FastaStats(num_seqs=0,
num_bases=0,
min_len=None,
max_len=None,
average=None)
self.do_test('', 0, expected)
def test_single_newline_input(self):
"Test input containing a newline (\n) character"
expected = FastaStats(num_seqs=0,
num_bases=0,
min_len=None,
max_len=None,
average=None)
self.do_test('\n', 0, expected)
def test_single_greater_than_input(self):
"Test input containing a single greater-than (>) character"
expected = FastaStats(num_seqs=1,
num_bases=0,
min_len=0,
max_len=0,
average=0)
self.do_test('>', 0, expected)
def test_one_sequence(self):
"Test input containing one sequence"
expected = FastaStats(num_seqs=1,
num_bases=5,
min_len=5,
max_len=5,
average=5)
self.do_test(">header\nATGC\nA", 0, expected)
def test_two_sequences(self):
"Test input containing two sequences"
expected = FastaStats(num_seqs=2,
num_bases=9,
min_len=2,
max_len=7,
average=4)
self.do_test(">header1\nATGC\nAGG\n>header2\nTT\n", 0, expected)
def test_no_header(self):
"Test input containing sequence without preceding header"
expected = FastaStats(num_seqs=0,
num_bases=0,
min_len=None,
max_len=None,
average=None)
self.do_test("no header\n", 0, expected)
def test_minlen_less_than_all(self):
"Test input when --minlen is less than 2 out of 2 sequences"
expected = FastaStats(num_seqs=2,
num_bases=9,
min_len=2,
max_len=7,
average=4)
self.do_test(">header1\nATGC\nAGG\n>header2\nTT\n", 2, expected)
def test_minlen_greater_than_one(self):
"Test input when --minlen is less than 1 out of 2 sequences"
expected = FastaStats(num_seqs=1,
num_bases=7,
min_len=7,
max_len=7,
average=7)
self.do_test(">header1\nATGC\nAGG\n>header2\nTT\n", 3, expected)
def test_minlen_greater_than_all(self):
"Test input when --minlen is greater than 2 out of 2 sequences"
expected = FastaStats(num_seqs=0,
num_bases=0,
min_len=None,
max_len=None,
average=None)
self.do_test(">header1\nATGC\nAGG\n>header2\nTT\n", 8, expected)
if __name__ == '__main__':
unittest.main()
#!/usr/bin/env bash
# 1. Parse command line arguments
# 2. cd to the test directory
# 3. run tests
# 4. Print summary of successes and failures, exit with 0 if
# all tests pass, else exit with 1
# Uncomment the line below if you want more debugging information
# about this script.
#set -x
# The name of this test script
this_program_name="biodemo-test.sh"
# The program we want to test (either a full path to an executable, or the name of an executable in $PATH)
test_program=""
# Directory containing the test data files and expected outputs
test_data_dir=""
# Number of failed test cases
num_errors=0
# Total number of tests run
num_tests=0
function show_help {
cat << UsageMessage
${this_program_name}: run integration/regression tests for biodemo
Usage:
${this_program_name} [-h] [-v] -p program -d test_data_dir