Examples¶
NCBI format input sequences - no introns¶
The command below [1] identifies coding sequences from NCBI format [2] input for nine proteins that do not contain introns.
ncfp tests/test_input/sequences/input_ncbi.fasta \
tests/test_output/ncbi dev@null.com -v
UniProt format input sequences - no introns¶
The command below [1] identifies coding sequences from UniProt
format [3] input for ten proteins that do not contain introns. The
-u or --uniprot argument is required to specify that the input
sequences are UniProt format, otherwise an error is thrown.
ncfp -u tests/test_input/sequences/input_uniprot.fasta \
tests/test_output/uniprot dev@null.com -v
UniProt/Stockholm input sequences - no introns¶
The command below [1] identifies coding sequences from UniProt
format [3] input for 57 amino acid sequences specifying regions
of a protein in Stockholm notation [4]. The -u or --uniprot
argument is required to specify that the input sequences are UniProt
format, and the -s or --stockholm arguments are required to
tell ncfp to parse the region locations.
ncfp -us tests/test_input/sequences/input_uniprot_stockholm.fasta \
tests/test_output/uniprot_stockholm dev@null.com -v
Human sequences - isoforms and intron/exon structure¶
The command below [1] identifies coding sequences from NCBI
format [3] input for four human proteins with intron/exon structure,
including three isoforms of the same protein from the same locus
(GPR137: NP_001164351.1, NP_001164352.1, and XP_005274161.1).
ncfp tests/test_input/sequences/human.fasta \
tests/test_output/human dev@null.com -v
Logging¶
Verbose output can be written persistently to a logfile using the
-l or --logfile argument and specifying the path to which
the logfile should be written. An example is given in the command below.
ncfp tests/test_input/sequences/human.fasta \
tests/test_output/logging dev@null.com \
-l tests/test_output/logging/human.log
Specifying the cache location¶
By default a new cache database is created every time that ncfp is
run, in the .ncfp_cache hidden subdirectory. The default cache
database filename is ncfpcache_YYYY-MM-DD-HH-MM-SS.sqlite3,
indicating the time that the command was run. This location and
naming convention can be overridden with the -d/--cachedir and
-c/--cachestem arguments, as in the command below.
ncfp tests/test_input/sequences/human.fasta \
tests/test_output/caches dev@null.com \
-d tests/test_output/caches \
-c ncfp_cache
Reusing an existing cache¶
To avoid unnecessary bandwidth/NCBI queries, an existing cache
database can be used. The location of the cache is specified with the
-d/--cachedir and -c/--cachestem arguments, and the
--keepcache option must be specified. If the specified location
does not contain a cache database, one is created. For example:
ncfp tests/test_input/sequences/human.fasta \
tests/test_output/caches dev@null.com \
-d tests/test_output/caches \
-c ncfp_cache
will create a cache at tests/test_output/caches/ncfp_cache.sqlite3,
and
ncfp tests/test_input/sequences/human.fasta \
tests/test_output/caches dev@null.com \
-d tests/test_output/caches \
-c ncfp_cache \
--filestem cached \
--keepcache
will reuse the cachefile without making new queries at NCBI, and
write the output to cached_aa.fasta and cached_nt.fasta [5].
Footnotes
| [1] | (1, 2, 3, 4) The -v option shows verbose output in STDOUT. |
| [2] | The sequence identifier in the FASTA header is a valid NCBI protein accession. |
| [3] | (1, 2, 3) The sequence description in the FASTA header contains a valid GN=<accession> gene identifier. |
| [4] | The sequence identifier in the FASTA header ends with a Stockholm format region definition, e.g. /47-134. |
| [5] | The --filestem argument changes the filestem of the output nucleotide and amino acid sequence files. |