Quickstart
If you are running CRESSENT for the first time, follow the steps below to learn how to perform a complete ssDNA virus analysis. In this tutorial, we use sequences from the Naryaviridae family as an example, based on the dataset from Zhang et al, and demonstrate the workflow using our test dataset.
You can download the data from the corresponding folder in the GitHub repository.
Setting up the environment
First, activate the CRESSENT conda environment:
conda activate cressent
Preparing your data
For this tutorial, organize your data directory as follows:
/test/naryaviridae
├── Dai_naryaviridae_genome.fasta # Nucleotide genome sequences
├── Dai_naryaviridae_caps.faa # Capsid protein sequences
├── Dai_naryaviridae_reps.faa # Replication protein sequences
└── Dai_naryaviridae_prot.faa # All protein sequences
Since this example uses only 3 sequences, we will skip preprocessing steps like clustering and contamination detection that are typically used for larger datasets.
Step 1: Genome Analysis
Recombination Detection
Detect recombination events using all available methods:
cressent recombination \
-i data/Dai_naryaviridae_genome.fasta \
-o output/recombination \
-f recomb_results.csv \
--all
Analyzing Recombination Results
import pandas as pd
# Read recombination results
df = pd.read_csv("output/recombination/recomb_results.csv")
# Filter significant events (p-value < 0.05)
significant = df[df['Pvalue'] < 0.05]
# Count methods detecting each recombinant
method_counts = significant.groupby('Recombinant')['Method'].value_counts().unstack(fill_value=0)
method_counts['Total'] = method_counts.sum(axis=1)
print("Recombination events detected by multiple methods:")
print(method_counts[method_counts['Total'] >= 3])
Expected output showing recombination events detected across all genomes:
Recombinant |
3Seq |
Bootscan |
Chimaera |
MaxChi |
RDP |
Total |
|---|---|---|---|---|---|---|
Genome_1 |
0 |
2 |
1 |
1 |
2 |
6 |
Genome_2 |
0 |
3 |
1 |
4 |
9 |
17 |
Genome_3 |
1 |
1 |
0 |
0 |
1 |
3 |
Nucleotide Sequence Alignment
First, align the nucleotide sequences for recombination analysis:
cressent align \
--threads 24 \
--input_fasta data/Dai_naryaviridae_genome.fasta \
-o output/genome_align
Step 2: Capsid Protein Analysis
Sequence Clustering and Alignment
For larger datasets, you would typically cluster sequences first:
# Skip clustering for small datasets, but clean sequence names
cressent cluster \
-i data/Dai_naryaviridae_caps.faa \
-o output/caps/cluster
Database-Integrated Alignment
Align capsid sequences with the Naryaviridae database for phylogenetic context:
cressent align \
--threads 24 \
--input_fasta data/Dai_naryaviridae_caps.faa \
--db_family "Naryaviridae" \ # avoid using more than 1 family
--protein_type caps \
--db_path /path/to/databases \
-o output/caps/align_family
⚠️⚠️ CAUTION !!! ⚠️⚠️
Since CRESSENT was developed and tailored specifically for family-level ssDNA virus analysis, expanding the database to include multiple additional families may substantially increase processing time and computational cost. Therefore, use the full database with caution.
We recommend performing analyses at the family level whenever possible.
Phylogenetic Tree Construction
Build a phylogenetic tree using an appropriate evolutionary model:
cressent build_tree \
-i output/caps/align_family/Dai_naryaviridae_caps_aligned_trimmed_sequences.fasta \
-o output/caps/tree \
-m Q.pfam+F+G4
💡💡 Evolutionary models have already been precomputed by ModelFinder from IQ-TREE2, which can reduce both processing time and computational cost in this module. 💡💡
Tree Visualization
Create publication-ready tree visualizations:
# Basic circular tree
cressent plot_tree \
--tree output/caps/tree/Dai_naryaviridae_caps_aligned_trimmed_sequences.treefile \
-o output/caps/tree \
--metadata_1 output/caps/align_family/metadata.csv \
--metadata_2 output/caps/tree/Dai_naryaviridae_caps_aligned_trimmed_sequences_sanitized_name_table.tsv \
--layout circular \
--offset 0.15 \
--fig_width 20 --fig_height 15 \
--plot_name nary_caps_tree.pdf
# Tree with alignment visualization
cressent plot_tree \
--tree output/caps/tree/Dai_naryaviridae_caps_aligned_trimmed_sequences.treefile \
-o output/caps/tree \
--metadata_1 output/caps/align_family/metadata.csv \
--metadata_2 output/caps/tree/nary_caps_aligned_trimmed_sequences_sanitized_name_table.tsv \
--alignment output/caps/align_family/Dai_naryaviridae_caps_aligned_trimmed_sequences.fasta \
--layout rectangular \
--plot_tips False \
--plot_name nary_caps_tree_with_alignment.pdf
De Novo Motif Discovery
Discover conserved motifs in capsid proteins:
cressent motif_disc \
-i data/Dai_naryaviridae_caps.faa \
-o output/caps/motif_discovery \
-nmotifs 5 -minw 6 -maxw 10 \
--meme_extra "-mod zoops -evt 0.05" \
--scanprosite
Visualize the discovered motifs:
cressent motif_map_viz \
-f output/caps/motif_discovery/scanprosite_results.csv \
-o output/caps/motif_discovery
Step 3: Replication Protein Analysis
Alignment and Tree Construction
# Align replication proteins with database
cressent align \
--threads 24 \
--input_fasta data/Dai_naryaviridae_reps.faa \
--db_family "Naryaviridae" \
--protein_type reps \
--db_path /path/to/databases \
-o output/reps/align_family
# Build phylogenetic tree
cressent build_tree \
-i output/reps/align_family/Dai_naryaviridae_reps_aligned_trimmed_sequences.fasta \
-o output/reps/tree \
-m Q.yeast+G4
Known Motif Analysis
Search for the Walker A motif and split sequences at this position:
cressent motif \
-i output/reps/align_family/Dai_naryaviridae_reps_aligned_trimmed_sequences.fasta \
-o output/reps/motif \
-p ".{5}GK[TS].{4}" \
--remove-gaps \
--split-sequences \
--generate-logo --split-logo --ncol 2 \
--metadata output/reps/align_family/metadata.csv \
--group-label family
Understanding the outputs
Let’s examine the key files generated by our analysis:
Recombination Results
The recomb_results.csv file contains detailed information about detected recombination events:
Method |
Recombinant |
Major_Parent |
Minor_Parent |
Breakpoint_Start |
Breakpoint_End |
Pvalue |
|---|---|---|---|---|---|---|
RDP |
Genome_A |
Genome_B |
Genome_C |
245 |
678 |
0.023 |
Bootscan |
Genome_A |
Genome_B |
Genome_C |
240 |
685 |
0.034 |
Events detected by multiple methods (≥3) are considered highly reliable.
Phylogenetic Trees
The .treefile contains the phylogenetic tree in Newick format:
((Nary_001:0.1234,Nary_002:0.0987):0.0543,(Nary_003:0.2134,Database_seq:0.1876):0.0621);
Metadata Files
The metadata.csv file provides context for each sequence:
protein_id |
protein_description |
family |
scientific_name |
source |
|---|---|---|---|---|
Nary_001 |
Rep protein |
Naryaviridae |
Naryavirus sp. |
input |
DB_seq_1 |
Rep protein |
Naryaviridae |
Reference strain |
database |
Step 4: Advanced Domain Analysis
Domain-Specific Phylogenetics
After motif splitting, analyze individual protein domains:
# Align and build tree for domain 1 (Helicase)
cressent align \
--threads 24 \
--input_fasta output/reps/motif/split_sequences_1.fasta \
-o output/reps/domain1
cressent build_tree \
-i output/reps/domain1/split_sequences_1_aligned_trimmed_sequences.fasta \
-o output/reps/domain1
# Align and build tree for domain 2 (Endonuclease)
cressent align \
--threads 24 \
--input_fasta output/reps/motif/split_sequences_2.fasta \
-o output/reps/domain2
cressent build_tree \
-i output/reps/domain2/split_sequences_2_aligned_trimmed_sequences.fasta \
-o output/reps/domain2
Tanglegram Comparison
Compare phylogenies between different protein domains:
cressent tanglegram \
--tree1 output/reps/domain1/split_sequences_1_aligned_trimmed_sequences.treefile \
--tree2 output/reps/domain2/split_sequences_2_aligned_trimmed_sequences.treefile \
--label1 "Helicase Domain" \
--label2 "Endonuclease Domain" \
--output output/reps/comparison \
--name_tanglegram "domain_comparison.pdf" \
--height 11 --width 30
Sequence Logo Generation
Create sequence logos for conserved functional domains:
# Helicase domain logo
cressent seq_logo \
-i output/reps/domain1/split_sequences_1_aligned_trimmed_sequences.fasta \
-o output/reps/domain1 \
--output_name helicase_logo.pdf \
--method bits --width 15
# Endonuclease domain logo
cressent seq_logo \
-i output/reps/domain2/split_sequences_2_aligned_trimmed_sequences.fasta \
-o output/reps/domain2 \
--output_name endonuclease_logo.pdf \
--method prob --width 15
Quality Assessment
Evaluating Results
Recombination Analysis:
Focus on events detected by ≥3 methods
Consider p-values < 0.05 as significant
Manually inspect alignment around breakpoints
Phylogenetic Analysis:
Bootstrap values ≥70% indicate strong support
Check for reasonable branch lengths
Ensure biological relevance of groupings
Motif Analysis:
Verify motifs match known functional domains
Check conservation across sequences
Validate with literature when possible
Complete Analysis Script
Here’s a complete script that runs the entire analysis:
#!/bin/bash
# CRESSENT Naryaviridae Analysis Pipeline
echo "Starting CRESSENT analysis of Naryaviridae sequences..."
# Create output directories
mkdir -p output/{genome_align,recombination,caps,reps}
# 1. Genome-level analysis
echo "Step 1: Analyzing genome sequences..."
cressent align --threads 24 --input_fasta data/Dai_naryaviridae_genome.fasta -o output/genome_align
cressent recombination -i output/genome_align/nary_genomes_aligned_trimmed_sequences.fasta \
-o output/recombination -f recomb_results.csv --all
# 2. Capsid protein analysis
echo "Step 2: Analyzing capsid proteins..."
cressent align --threads 24 --input_fasta data/Dai_naryaviridae_caps.faa \
--db_family "Naryaviridae" --protein_type caps --db_path databases/ \
-o output/caps/align_family
cressent build_tree -i output/caps/align_family/Dai_naryaviridae_caps_aligned_trimmed_sequences.fasta \
-o output/caps/tree -m Q.pfam+F+G4
cressent plot_tree --tree output/caps/tree/Dai_naryaviridae_caps_aligned_trimmed_sequences.treefile \
-o output/caps/tree --layout circular --plot_name caps_tree.pdf
# 3. Replication protein analysis
echo "Step 3: Analyzing replication proteins..."
cressent align --threads 24 --input_fasta data/Dai_naryaviridae_reps.faa \
--db_family "Naryaviridae" --protein_type reps --db_path databases/ \
-o output/reps/align_family
cressent motif -i output/reps/align_family/Dai_naryaviridae_reps_aligned_trimmed_sequences.fasta \
-o output/reps/motif -p ".{5}GK[TS].{4}" --split-sequences --generate-logo
# 4. Domain comparison
echo "Step 4: Comparing protein domains..."
cressent align --threads 24 --input_fasta output/reps/motif/split_sequences_1.fasta \
-o output/reps/domain1
cressent build_tree -i output/reps/domain1/split_sequences_1_aligned_trimmed_sequences.fasta \
-o output/reps/domain1
cressent align --threads 24 --input_fasta output/reps/motif/split_sequences_2.fasta \
-o output/reps/domain2
cressent build_tree -i output/reps/domain2/split_sequences_2_aligned_trimmed_sequences.fasta \
-o output/reps/domain2
cressent tanglegram --tree1 output/reps/domain1/*.treefile \
--tree2 output/reps/domain2/*.treefile --label1 "Helicase" --label2 "Endonuclease" \
--output output/reps/comparison --name_tanglegram "domain_comparison.pdf"
echo "Analysis complete! Results are in the 'output' directory."
echo "Key files:"
echo "- Recombination: output/recombination/recomb_results.csv"
echo "- Trees: output/*/tree/*.treefile"
echo "- Visualizations: output/*/tree/*.pdf"
Next Steps
After completing this quickstart tutorial:
Explore individual modules for advanced parameter customization
Check the FAQ for common questions and troubleshooting
Try analysis with your own ssDNA virus sequences
This quickstart provides a solid foundation for using CRESSENT effectively. The modular design allows you to adapt this workflow for different viral families and research questions.