Basic Use

Given a set of protein sequences in the file <INPUT>.fasta as input, the path to an output directory as <OUTPUT>, and the user’s email address [1] as <EMAIL>, the following command will query the NCBI databases for nucleotide coding sequences corresponding to the input, and write the results to files in <OUTPUT>:

ncfp <INPUT>.fasta <OUTPUT> <EMAIL>

The output directory <OUTPUT> will contain at least two files: <OUTPUT>/ncfp_aa.fasta and <OUTPUT>/ncfp_nt.fasta. The <OUTPUT>/ncfp_aa.fasta file will contain sequences for which a corresponding coding sequence could be found, and <OUTPUT>/ncfp_nt.aa will contain those coding sequences.

The sequences in <OUTPUT>/ncfp_nt.aa are trimmed and validated by ncfp to produce a conceptual translation identical to the corresponding input protein sequence.

Any protein sequences for which a partner nucleotide coding sequence could not be found will be written to the file <OUTPUT>/skipped.fas

Input sequence formats

Input protein sequences must be provided in FASTA format, and ncfp expects input sequence headers to take one of two forms: “NCBI” or “UniProt”. By default, ncfp expects sequences to be in NCBI format:

ncfp <INPUT>.fasta <OUTPUT> <EMAIL>

For sequence input in UniProt format, one of the -u or --uniprot options must be used, e.g.

$ ncfp -u <INPUT>.fasta <OUTPUT> <EMAIL>
$ ncfp --uniprot <INPUT>.fasta <OUTPUT> <EMAIL>

NCBI header format

In NCBI header format, the sequence identifier is expected to correspond to a valid NCBI protein sequence accession, e.g.

>XP_004520832.1 kunitz-type serine protease inhibitor homolog dendrotoxin I-like [Ceratitis capitata]
MRTKFVLVFALIVCVLNGLGEAQRPAHCLQPHPQGVGRCDMLISGFFYNSERNECEQWTEEGCRVQGGHT
YDFKEDCVNECIEIN

If a coding sequence is identified successfully, the output nucleotide sequence header will have the same accession as a sequence identifier, e.g.

>XP_004520832.1 coding sequence
ATGAGAACTAAATTTGTTTTGGTATTCGCGCTCATTGTTTGTGTACTCAACGGTTTAGGT
GAAGCGCAAAGACCAGCACATTGCTTACAACCACATCCACAAGGAGTTGGCCGTTGTGAT
ATGCTTATCAGTGGTTTCTTCTATAACTCGGAGCGTAATGAGTGCGAGCAATGGACAGAG
GAGGGCTGCCGTGTGCAGGGTGGGCACACATACGATTTCAAAGAAGATTGTGTAAATGAG
TGCATTGAAATTAATTAA

UniProt header format

In UniProt header format, the sequence description string is expected to correspond to a UniProt download and contain the GN gene identifier key:value pair, e.g.

>tr|A0A1V9Y7A7|A0A1V9Y7A7_9STRA Lon protease homolog OS=Thraustotheca clavata GN=THRCLA_11583 PE=3 SV=1
MYRASSKVTSAHNDGIWSTVWTSRNQIISGSLDEVVKSWDASSSEDNAILPVVKQFPGHV
LGTLAVTATKDGRKAATSSLDCQVRILNLESGGIEKTIDTGAGESWQLVYSPDDTFIATG
SQQSKINLINLEQEKIVNSIPVDGKFILAVAYSPDGKHLACGTFEGIVAIYDVETGKQVQ
KYQDRAKPVRSISYSPDGSFLLAASDDMHVNIYDVLHSSLVGSVSGHISWILSVACSPDG

If a coding sequence is identified successfully, the output nucleotide sequence header will have the gene accession as its sequence identifier, e.g.

>THRCLA_11583 coding sequence
ATGTACCGCGCCTCGTCCAAAGTAACGTCGGCTCATAATGATGGAATCTGGAGTACTGTC
TGGACAAGCCGCAATCAAATCATAAGTGGATCTTTGGATGAAGTGGTCAAGAGCTGGGAT
GCGAGTAGTTCCGAGGACAATGCGATTTTGCCTGTTGTCAAGCAATTTCCAGGCCACGTT
CTAGGCACACTGGCAGTGACTGCAACGAAAGATGGTCGAAAAGCTGCTACATCGTCTTTA

Stockholm domain format

UniProt and other sources use Stockholm format to indicate that an amino acid sequence represents a portion of a protein (such as a domain). ncfp can recognise this format and trim the coding sequence to correspond only to the specified region of the protein.

Stockholm format domains are indicated by the syntax /<start>-<stop> immediately following the sequence identifier in FASTA format, e.g.

>tr|B7G6L2|B7G6L2_PHATC/43-112 [subseq from] Predicted protein OS=Phaeodactylum tricornutum (strain CCAP 1055/1) GN=PHATRDRAFT_48282 PE=4 SV=1
-----------------------------SLCV-EVAGA-SQD---DGASIFQGDCN-dG
NKHQVFDFipaPG---TdsgFHRIRA--SHSN-KCLGVADGAL--APG-AEVVQ-

To restrict the coding sequence to the region indicated in Stockholm format, pass either the -s or --stockholm option, e.g.

$ ncfp -u -s <INPUT>.fasta <OUTPUT> <EMAIL>
$ ncfp --uniprot --stockholm <INPUT>.fasta <OUTPUT> <EMAIL>

The output nucleotide sequence does not preserve the Stockholm format location information in the output, nor does it preserve sequence gap symbols:

>PHATRDRAFT_48282 coding sequence
TCGCTCTGCGTGGAGGTGGCTGGAGCGAGCCAAGACGACGGGGCCTCCATATTTCAAGGG
GATTGTAATGACGGAAACAAGCATCAAGTCTTCGACTTCATTCCTGCTCCCGGTACAGAC
AGCGGTTTTCATCGAATTCGAGCCTCGCACTCCAACAAGTGCCTTGGCGTGGCTGATGGG
GCTTTAGCACCTGGAGCTGAGGTAGTGCAA
[1]The user’s email address is passed to NCBI to enable them to monitor use of their service and provide support