Examples¶

NCBI format input sequences - no introns¶

The command below [1] identifies coding sequences from NCBI format [2] input for nine proteins that do not contain introns.

ncfp tests/test_input/sequences/input_ncbi.fasta \
    tests/test_output/ncbi dev@null.com -v

UniProt format input sequences - no introns¶

The command below [1] identifies coding sequences from UniProt format [3] input for ten proteins that do not contain introns. The -u or --uniprot argument is required to specify that the input sequences are UniProt format, otherwise an error is thrown.

ncfp -u tests/test_input/sequences/input_uniprot.fasta \
    tests/test_output/uniprot dev@null.com -v

UniProt/Stockholm input sequences - no introns¶

The command below [1] identifies coding sequences from UniProt format [3] input for 57 amino acid sequences specifying regions of a protein in Stockholm notation [4]. The -u or --uniprot argument is required to specify that the input sequences are UniProt format, and the -s or --stockholm arguments are required to tell ncfp to parse the region locations.

ncfp -us tests/test_input/sequences/input_uniprot_stockholm.fasta \
    tests/test_output/uniprot_stockholm dev@null.com -v

Human sequences - isoforms and intron/exon structure¶

The command below [1] identifies coding sequences from NCBI format [3] input for four human proteins with intron/exon structure, including three isoforms of the same protein from the same locus (GPR137: NP_001164351.1, NP_001164352.1, and XP_005274161.1).

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/human dev@null.com -v

Logging¶

Verbose output can be written persistently to a logfile using the -l or --logfile argument and specifying the path to which the logfile should be written. An example is given in the command below.

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/logging dev@null.com \
    -l tests/test_output/logging/human.log

Specifying the cache location¶

By default a new cache database is created every time that ncfp is run, in the .ncfp_cache hidden subdirectory. The default cache database filename is ncfpcache_YYYY-MM-DD-HH-MM-SS.sqlite3, indicating the time that the command was run. This location and naming convention can be overridden with the -d/--cachedir and -c/--cachestem arguments, as in the command below.

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/caches dev@null.com \
    -d tests/test_output/caches \
    -c ncfp_cache

Reusing an existing cache¶

To avoid unnecessary bandwidth/NCBI queries, an existing cache database can be used. The location of the cache is specified with the -d/--cachedir and -c/--cachestem arguments, and the --keepcache option must be specified. If the specified location does not contain a cache database, one is created. For example:

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/caches dev@null.com \
    -d tests/test_output/caches \
    -c ncfp_cache

will create a cache at tests/test_output/caches/ncfp_cache.sqlite3, and

ncfp tests/test_input/sequences/human.fasta \
    tests/test_output/caches dev@null.com \
    -d tests/test_output/caches \
    -c ncfp_cache \
    --filestem cached \
    --keepcache

will reuse the cachefile without making new queries at NCBI, and write the output to cached_aa.fasta and cached_nt.fasta [5].

Footnotes

[1]	(1, 2, 3, 4) The `-v` option shows verbose output in `STDOUT`.

[2]	The sequence identifier in the FASTA header is a valid NCBI protein accession.

[3]	(1, 2, 3) The sequence description in the FASTA header contains a valid `GN=<accession>` gene identifier.

[4]	The sequence identifier in the FASTA header ends with a Stockholm format region definition, e.g. `/47-134`.

[5]	The `--filestem` argument changes the filestem of the output nucleotide and amino acid sequence files.