Frequently Asked Questions

Here you will find answers to common questions about CRESSENT. If you can’t find an answer to your problem here, please open an issue in the GitHub repository.

General Usage

How can I speed up CRESSENT analysis?

If you want to speed up CRESSENT execution, several options are available:

  • Use sequence clustering: Run the cluster module before phylogenetic analysis to remove redundant sequences

  • Optimize thread usage: Use the --threads parameter with appropriate values for your system

  • Disable time-intensive modules: Skip recombination analysis if not needed for your research

  • Use simpler models: Choose faster evolutionary models (e.g., -m GTR+G4 instead of -m MFP) for quick analysis

Please note that some optimizations may reduce analysis depth or accuracy.

What input formats does CRESSENT accept?

CRESSENT accepts the following input formats:

  • FASTA files: Nucleotide and protein sequences (required)

  • Compressed files: .gz, .bz2, and .xz compression supported

  • GFF files: For structural analysis modules like stem loops and iterons

  • CSV files: For database construction and metadata integration

All sequence files must be in valid FASTA format with unique sequence identifiers.

How do I interpret phylogenetic tree support values?

Bootstrap support values in CRESSENT trees indicate confidence in tree topology:

  • ≥95%: Very strong support, high confidence in the relationship

  • 70-94%: Strong support, generally reliable

  • 50-69%: Moderate support, interpret with caution

  • <50%: Poor support, relationship is uncertain

For critical evolutionary inferences, focus on relationships with ≥70% bootstrap support.

Module-Specific Questions

Why is my alignment empty or of poor quality?

Poor alignment quality can result from several issues:

Input sequence problems:

  • Sequences are not homologous

  • Poor sequence quality with many ambiguous nucleotides

  • Sequences of very different lengths

Solutions:

  • Verify that input sequences represent the same gene/protein

  • Remove sequences with >10% ambiguous characters

  • Use cluster to remove highly divergent sequences

  • Adjust --mafft_ep parameter for better alignment sensitivity

Why does tree construction fail with “insufficient data” errors?

Tree construction can fail when:

  • Too few informative sites: Sequences are too similar or too short

  • Identical sequences: Remove duplicates using cluster

  • Poor alignment: Insufficient overlap between sequences after trimming

Solutions:

  • Check alignment quality and length

  • Use longer sequences or more divergent taxa

  • Adjust trimming parameters (--gap_threshold)

  • Try simpler evolutionary models

How can I improve recombination detection sensitivity?

To enhance recombination detection:

  1. Use high-quality alignments: Ensure proper sequence alignment

  2. Include diverse sequences: Adequate evolutionary distance improves detection

  3. Run all methods: Use --all flag for comprehensive analysis

  4. Verify manually: Inspect alignment around detected breakpoints

  5. Apply statistical filters: Focus on events with p-values <0.05

Why are my sequence logos poorly resolved?

Poor sequence logo quality often indicates:

  • Insufficient sequence conservation: Not enough conserved positions

  • Poor motif definition: Pattern may be too broad or specific

  • Alignment gaps: Excessive gaps in the motif region

Solutions:

  • Increase sequence number in the analysis

  • Refine motif search patterns

  • Use --remove-gaps option in motif

  • Check alignment quality in the motif region

Technical Issues

Why am I getting memory errors?

Memory errors typically occur with:

  • Large datasets (>1000 sequences)

  • Long alignments (>10,000 positions)

  • Database integration with large reference sets

Solutions:

  • Reduce dataset size: Use cluster for sequence reduction

  • Decrease thread count: Lower --threads parameter to reduce memory usage

  • Increase system RAM: Consider upgrading hardware for large analyses

  • Process in batches: Split large datasets into smaller chunks

Why is the execution stuck during tree building?

Tree construction can be slow due to:

  • Large datasets: Many sequences or long alignments

  • Complex models: Model selection (-m MFP) is computationally intensive

  • Insufficient CPU: Single-threaded execution on multi-core systems

Solutions:

  • Use --threads parameter to utilize multiple cores

  • Choose specific models (-m GTR+G4) instead of model selection

  • Monitor system resources to identify bottlenecks

  • Consider using a high-performance computing cluster

Why do binary compilation errors occur?

Some modules compile required binaries automatically:

  • System compatibility: Binaries may not work on all systems

  • Missing dependencies: Required compilers or libraries missing

  • Permission issues: Insufficient write permissions

Solutions:

  • Ensure gcc and make are installed

  • Check system compatibility (Linux/Unix preferred)

  • Run with appropriate permissions

  • Install missing system dependencies

Database and Integration

Why does database integration fail?

Database integration issues often result from:

  • Missing database files: Verify database paths and file existence

  • Incompatible family names: Check spelling and available families

  • Corrupted downloads: Re-download database files if corrupted

Solutions:

  • Verify --db_path points to correct directory

  • Check available families in database documentation

  • Re-run database construction if files are corrupted

How do I handle large-scale comparative analyses?

For large comparative studies:

  1. Use hierarchical clustering: Reduce dataset complexity first

  2. Employ database integration: Add reference sequences for context

  3. Optimize computational resources: Use high-performance computing

  4. Process by viral family: Analyze related viruses separately

  5. Validate key findings: Focus detailed analysis on important relationships

Results Interpretation

How reliable are the recombination predictions?

Recombination reliability depends on:

  • Multiple method agreement: Events detected by ≥3 methods are most reliable

  • Statistical significance: p-values <0.01 indicate strong evidence

  • Biological plausibility: Results should make biological sense

  • Manual validation: Visual inspection of alignments confirms events

What do tanglegram comparisons reveal?

Tanglegrams show congruence between phylogenetic trees:

  • Parallel lines: Congruent phylogenies, similar evolutionary history

  • Crossed lines: Incongruent relationships, possible recombination

  • RF scores: Robinson-Foulds distance quantifies tree differences

High congruence suggests vertical inheritance; incongruence may indicate recombination or different evolutionary pressures.

How do I validate phylogenetic results?

Validate phylogenetic analyses through:

  1. Bootstrap assessment: Focus on well-supported relationships (≥70%)

  2. Model adequacy: Check model fit statistics in log files

  3. Biological consistency: Ensure results match known biology

  4. Alternative methods: Compare with other phylogenetic approaches

  5. Literature comparison: Verify consistency with published studies

Best Practices

How should I cite CRESSENT in publications?

Please cite CRESSENT as:

[Authors]. CRESSENT: A comprehensive toolkit for CRESS DNA virus analysis. [Journal] [Year]. DOI: [DOI].

Additionally, cite specific tools used within CRESSENT modules (MAFFT, IQ-TREE, etc.) as appropriate for your analysis.

What are common analysis pitfalls to avoid?

Avoid these common mistakes:

  1. Skipping quality control: Always validate input data quality

  2. Ignoring statistical support: Don’t over-interpret poorly supported results

  3. Using inappropriate models: Choose evolutionary models suitable for your data

  4. Mixing sequence types: Don’t combine nucleotide and protein sequences inappropriately

  5. Inadequate taxon sampling: Include sufficient diversity for robust analysis

How do I optimize CRESSENT for different viral families?

Different viral families may require specific approaches:

Small genomes (e.g., Circoviridae):

  • Use complete genome sequences when possible

  • Consider codon-based alignments for coding sequences

  • Pay attention to overlapping genes

Large genomes (e.g., some ssDNA viruses):

  • Focus on conserved gene regions

  • Use gene-specific databases when available

  • Consider computational resource requirements

Highly divergent families:

  • Use protein sequences instead of nucleotides

  • Employ more sensitive alignment parameters

  • Consider domain-specific analysis

If you have additional questions not covered here, please check the individual module documentation or contact us through the GitHub repository.