QuickStart Guide

Installation

Using conda

ncfp is available through the bioconda channel of Anaconda:

conda install -c bioconda ncfp

From source

At the command-line, use git to clone the current version of the ncfp repository:

git clone git@github.com:widdowquinn/ncfp.git

Change to the newly-created ncfp subdirectory:

cd ncfp

Install the package and program, using the setup.py file:

python setup.py install

(other installation methods can be found on the Installation page)

ncfp Example

To see options available for the ncfp program, use the -h (help) option:

ncfp -h

In the ncfp/tests/test_input/sequences subdirectory there is a file called input_ncbi.fasta. This contains a number of protein sequences in FASTA format. The identifier for each sequence in this file is a valid NCBI sequence identifier.

Using ncfp, to obtain a corresponding nucleotide coding sequence for each protein, issue the following command (substituting your own email address, where indicated):

ncfp tests/test_input/sequences/input_ncbi.fasta \
     example_output \
     my.name@my.domain

You should see progress bars appear for processing of the input protein, sequences, searching those sequences against the remote NCBI databases, then retrieving the corresponding sequence identifiers, GenBank headers and finally the full GenBank records.

On completion, a list of the recovered sequences will be presented, and the directory example_output will be created, with the following contents:

$ tree example_output/
example_output/
├── ncfp_aa.fasta
└── ncfp_nt.fasta

The two files should contain corresponding amino acid and nucleotide sequences:

$ head example_output/*.fasta
==> example_output/ncfp_aa.fasta <==
>XP_004520832.1 kunitz-type serine protease inhibitor homolog dendrotoxin I-like [Ceratitis capitata]
MRTKFVLVFALIVCVLNGLGEAQRPAHCLQPHPQGVGRCDMLISGFFYNSERNECEQWTE
EGCRVQGGHTYDFKEDCVNECIEIN
>XP_017966559.1 PREDICTED: kunitz-type serine protease inhibitor homolog dendrotoxin I-like [Drosophila navojoa]
MKFILLLACLCVYVATLEAQRPPCKGIVPPWLTNCVGGKNEGRGNLRSCARNANSRMWWY
DSRSRSCKKMAYKGCGGNRNRYCTREACRRACRRRN
>XP_017841791.1 PREDICTED: kunitz-type serine protease inhibitor homolog dendrotoxin K-like [Drosophila busckii]
MKVCLILSALVLQYIVFVNAEGCPLRPAEQNCQSSRNVGVSSYSNCILTKRLMWYYNPTI
RDCLPLDFRGCGGNGNRYCSLKDCQQSCKHT
>XP_017046608.1 PREDICTED: kunitz-type serine protease inhibitor homolog dendrotoxin I [Drosophila ficusphila]

==> example_output/ncfp_nt.fasta <==
>XP_004520832.1 coding sequence
ATGAGAACTAAATTTGTTTTGGTATTCGCGCTCATTGTTTGTGTACTCAACGGTTTAGGT
GAAGCGCAAAGACCAGCACATTGCTTACAACCACATCCACAAGGAGTTGGCCGTTGTGAT
ATGCTTATCAGTGGTTTCTTCTATAACTCGGAGCGTAATGAGTGCGAGCAATGGACAGAG
GAGGGCTGCCGTGTGCAGGGTGGGCACACATACGATTTCAAAGAAGATTGTGTAAATGAG
TGCATTGAAATTAATTAA
>XP_017966559.1 coding sequence
ATGAAATTCATTCTGCTCCTCGCTTGTCTCTGCGTCTACGTGGCCACCCTTGAGGCTCAG
CGACCCCCTTGCAAGGGAATAGTGCCTCCATGGTTGACCAATTGTGTTGGAGGCAAGAAC
GAGGGCAGGGGTAACCTTCGCTCGTGCGCCAGGAACGCGAATTCCAGAATGTGGTGGTAT