Examples

NCBI format input sequences - no introns

The command below [1] identifies coding sequences from NCBI format [2] input for nine proteins that do not contain introns.

ncfp tests/test_input/sequences/input_ncbi.fasta \
    tests/examples/ncbi dev@null.com -v

NCBI format input sequences - no introns

The command below [1] identifies coding sequences from NCBI format [2] input for two proteins, one of which has an alternative start site.

ncfp --allow_alternative_start_codon \
            tests/test_input/sequences/input_alternative_start.fasta \
            tests/examples/alternative_start dev@null.com -v

UniProt format input sequences - no introns

The command below [1] identifies coding sequences from UniProt format [3] input for ten proteins that do not contain introns. The -u or --uniprot argument is required to specify that the input sequences are UniProt format, otherwise an error is thrown.

ncfp tests/test_input/sequences/input_uniprot.fasta \
    tests/examples/uniprot dev@null.com -v

UniProt/Stockholm input sequences - no introns

The command below [1] identifies coding sequences from UniProt format [3] input for 57 amino acid sequences specifying regions of a protein in Stockholm notation [4]. The -u or --uniprot argument is required to specify that the input sequences are UniProt format, and the -s or --stockholm arguments are required to tell ncfp to parse the region locations.

ncfp -s tests/test_input/sequences/input_uniprot_stockholm.fasta \
    tests/examples/uniprot_stockholm dev@null.com -v

Human sequences - isoforms and intron/exon structure

The command below [1] identifies coding sequences from NCBI format [3] input for four human proteins with intron/exon structure, including three isoforms of the same protein from the same locus (GPR137: NP_001164351.1, NP_001164352.1, and XP_005274161.1).

ncfp tests/test_input/sequences/human.fasta \
    tests/examples/human dev@null.com -v

Logging

Verbose output can be written persistently to a logfile using the -l or --logfile argument and specifying the path to which the logfile should be written. An example is given in the command below.

ncfp tests/test_input/sequences/human.fasta \
    tests/examples/logging dev@null.com \
    -l tests/examples/logging/human.log

Specifying the cache location

By default a new cache database is created every time that ncfp is run, in the .ncfp_cache hidden subdirectory. The default cache database filename is ncfpcache_YYYY-MM-DD-HH-MM-SS.sqlite3, indicating the time that the command was run. This location and naming convention can be overridden with the -d/--cachedir and -c/--cachestem arguments, as in the command below.

ncfp tests/test_input/sequences/human.fasta \
    tests/examples/caches dev@null.com \
    -d tests/examples/caches \
    -c ncfp_cache

Reusing an existing cache

To avoid unnecessary bandwidth/NCBI queries, an existing cache database can be used. The location of the cache is specified with the -d/--cachedir and -c/--cachestem arguments, and the --keepcache option must be specified. If the specified location does not contain a cache database, one is created. For example:

ncfp tests/test_input/sequences/human.fasta \
    tests/examples/caches1 dev@null.com \
    -d tests/examples/caches \
    -c ncfp_cache

will create a cache at tests/test_output/caches/ncfp_cache.sqlite3, and

ncfp tests/test_input/sequences/human.fasta \
    tests/examples/caches2 dev@null.com \
    -d tests/examples/caches \
    -c ncfp_cache \
    --filestem cached \
    --keepcache

will reuse the cachefile without making new queries at NCBI, and write the output to cached_aa.fasta and cached_nt.fasta [5].

Footnotes

[1](1, 2, 3, 4, 5) The -v option shows verbose output in STDOUT.
[2](1, 2) The sequence identifier in the FASTA header is a valid NCBI protein accession.
[3](1, 2, 3) The sequence description in the FASTA header contains a valid GN=<accession> gene identifier.
[4]The sequence identifier in the FASTA header ends with a Stockholm format region definition, e.g. /47-134.
[5]The --filestem argument changes the filestem of the output nucleotide and amino acid sequence files.