Basic Use¶
Given a set of protein sequences in the file <INPUT>.fasta
as input, the path
to an output directory as <OUTPUT>
, and the user’s email address [1] as <EMAIL>
,
the following command will query the NCBI databases for nucleotide coding sequences
corresponding to the input, and write the results to files in <OUTPUT>
:
ncfp <INPUT>.fasta <OUTPUT> <EMAIL>
The output directory <OUTPUT>
will contain at least two files: <OUTPUT>/ncfp_aa.fasta
and <OUTPUT>/ncfp_nt.fasta
.
The <OUTPUT>/ncfp_aa.fasta
file will contain sequences for which a corresponding coding sequence could
be found, and <OUTPUT>/ncfp_nt.aa
will contain those coding sequences.
The sequences in <OUTPUT>/ncfp_nt.aa
are trimmed and validated by ncfp
to produce a conceptual
translation identical to the corresponding input protein sequence.
Any protein sequences for which a partner nucleotide coding sequence could not be found will be written
to the file <OUTPUT>/skipped.fas
Input sequence formats¶
Input protein sequences must be provided in FASTA format.
ncfp
expects input sequence headers to take one of two forms: “NCBI” or “UniProt”. ncfp
will guess at sequence format/origin on the basis of the ID and description fields. By default, sequences will be assumed to be in NCBI format unless:
- the sequence ID conforms to a UniProt UID
- the sequence ID conforms to a UniParc UID (these sequences will be skipped as we should not expect a unique coding sequence)
ncfp <INPUT>.fasta <OUTPUT> <EMAIL>
NCBI header format¶
In the NCBI header format, the sequence identifier is expected to correspond to a valid NCBI protein sequence accession, e.g.
>XP_004520832.1 kunitz-type serine protease inhibitor homolog dendrotoxin I-like [Ceratitis capitata]
MRTKFVLVFALIVCVLNGLGEAQRPAHCLQPHPQGVGRCDMLISGFFYNSERNECEQWTEEGCRVQGGHT
YDFKEDCVNECIEIN
If a coding sequence is identified successfully, the output nucleotide sequence header will have the same accession as a sequence identifier, e.g.
>XP_004520832.1 coding sequence
ATGAGAACTAAATTTGTTTTGGTATTCGCGCTCATTGTTTGTGTACTCAACGGTTTAGGT
GAAGCGCAAAGACCAGCACATTGCTTACAACCACATCCACAAGGAGTTGGCCGTTGTGAT
ATGCTTATCAGTGGTTTCTTCTATAACTCGGAGCGTAATGAGTGCGAGCAATGGACAGAG
GAGGGCTGCCGTGTGCAGGGTGGGCACACATACGATTTCAAAGAAGATTGTGTAAATGAG
TGCATTGAAATTAATTAA
UniProt header format¶
In the UniProt header format, the sequence description string is expected to correspond to a UniProt download
and contain the GN
gene identifier key:value pair, e.g.
>tr|A0A1V9Y7A7|A0A1V9Y7A7_9STRA Lon protease homolog OS=Thraustotheca clavata GN=THRCLA_11583 PE=3 SV=1
MYRASSKVTSAHNDGIWSTVWTSRNQIISGSLDEVVKSWDASSSEDNAILPVVKQFPGHV
LGTLAVTATKDGRKAATSSLDCQVRILNLESGGIEKTIDTGAGESWQLVYSPDDTFIATG
SQQSKINLINLEQEKIVNSIPVDGKFILAVAYSPDGKHLACGTFEGIVAIYDVETGKQVQ
KYQDRAKPVRSISYSPDGSFLLAASDDMHVNIYDVLHSSLVGSVSGHISWILSVACSPDG
If a coding sequence is identified successfully, the output nucleotide sequence header should have the gene accession as its sequence identifier, e.g.
>THRCLA_11583 coding sequence
ATGTACCGCGCCTCGTCCAAAGTAACGTCGGCTCATAATGATGGAATCTGGAGTACTGTC
TGGACAAGCCGCAATCAAATCATAAGTGGATCTTTGGATGAAGTGGTCAAGAGCTGGGAT
GCGAGTAGTTCCGAGGACAATGCGATTTTGCCTGTTGTCAAGCAATTTCCAGGCCACGTT
CTAGGCACACTGGCAGTGACTGCAACGAAAGATGGTCGAAAAGCTGCTACATCGTCTTTA
Stockholm domain format¶
UniProt and other sources use Stockholm format to indicate that an amino acid sequence represents a
portion of a protein (such as a domain). ncfp
can recognise this format and trim the coding sequence to
correspond only to the specified region of the protein.
Stockholm format domains are indicated by the syntax /<start>-<stop>
immediately following the sequence
identifier in FASTA format, e.g.
>tr|B7G6L2|B7G6L2_PHATC/43-112 [subseq from] Predicted protein OS=Phaeodactylum tricornutum (strain CCAP 1055/1) GN=PHATRDRAFT_48282 PE=4 SV=1
-----------------------------SLCV-EVAGA-SQD---DGASIFQGDCN-dG
NKHQVFDFipaPG---TdsgFHRIRA--SHSN-KCLGVADGAL--APG-AEVVQ-
To restrict the coding sequence to the region indicated in Stockholm format, pass either the -s
or --stockholm
option, e.g.
$ ncfp -s <INPUT>.fasta <OUTPUT> <EMAIL>
$ ncfp --stockholm <INPUT>.fasta <OUTPUT> <EMAIL>
The output nucleotide sequence does not preserve the Stockholm format location information in the output, nor does it preserve sequence gap symbols:
>PHATRDRAFT_48282 coding sequence
TCGCTCTGCGTGGAGGTGGCTGGAGCGAGCCAAGACGACGGGGCCTCCATATTTCAAGGG
GATTGTAATGACGGAAACAAGCATCAAGTCTTCGACTTCATTCCTGCTCCCGGTACAGAC
AGCGGTTTTCATCGAATTCGAGCCTCGCACTCCAACAAGTGCCTTGGCGTGGCTGATGGG
GCTTTAGCACCTGGAGCTGAGGTAGTGCAA
Use Protein IDs in the Output¶
By default ncfp lists the nucleotide id retrieved from the nucleotide record in the final FASTA file of nucleotide sequences (if the nucleotide id could not be retrieved the protein ID is used instead). However, some applications require the protein sequence and associated CDS sequence to be identifiable by sharing the same ID, for example when backthreading nucletide sequences onto alignmened protein sequnces using t-coffee.
To use the protein IDs for the IDs of the nucleotide sequences use the –use_protein_ids flag.
ncfp <INPUT>.fasta <OUTPUT> <EMAIL> --use_protein_ids
[1] | The user’s email address is passed to NCBI to enable them to monitor use of their service and provide support |