I have already installed Conda
with the miniconda
version inside the cluster in the next directory:
$ /hpc/group/bio1/diego/miniconda3
Then, it is needed to add the channels where most of the biological-usage packages are located. Otherwise, at the moment we try to search for some packages we will obtain that the search got no results:
$ conda config --add channels conda-forge
$ conda config --add channels bioconda
We can procced to install mamba
, a tool that will help me to
get compatible dependencies for each of the packages that I am going to install.
$ conda install -c conda-forge mamba
This set up will allow me to make a new environment where I will install Prokka
and Antismash
, I will call this
environment gmining
from genome mining:
$ mamba create -n gmining antismash prokka
after we donwloaded these packages, we need to install some dependencies that Antismash needs, as specified in its page
$ conda activate gmining
$ download-antismash-databases
After this steps we are all set to do the first step of the annotation
Prokka is a whole genome annotation tool that has been designed to be used with prokariotyc genomes. This is the first step to manupulate a bin or Metagenome assembled genome (MAG).
Prokka will give us a set of files as the output, in the next image (borrowed from the Prokka page) you can see the information that these files will contain.
I will be working in the next directory:
By taking a random bin of the ones that Carlos share with me (P2039_bin.23.fa), I was able to annotate it using Prokka
. There was a problem using
the deafult behaviour of this problem. The .gbk
file that was generated merged the contig name and the contig lenght. That
generates a error in the parsing process in programs as antismash
. I found that you can use the --compliant
flag to
make the locus layout comply to the NCBI expected layout. The next line is an example with one of P2039_bin.23.fa bin:
$ prokka --prefix P2039 --outdir P2039 --kingdom Bacteria --genus Nostoc \
--strain P2039 --usegenus --addgenes --metagenome --compliant --cpus 12 carlos-bins/P2039_bin.23.fa
After Prokka
completed the run, we will have a new set of files inside the P2039
directory:
P2039.err P2039.ffn P2039.fsa P2039.gff P2039.sqn P2039.tsv
P2039.faa P2039.fna P2039.gbk P2039.log P2039.tbl P2039.txt
All of them are useful for different analyses, but we will be working with the
.gbk
files.
Having completed a succesful annotation with Prokka
, we can now take the next step and use Antismash
to annotate the secondary metabolism of the bins provided.
We will run an example with the already annotated bin P2039
:
$ antismash P2039/P2039.gbk --output-dir P2039-antis
The process may take a while depending on the resources that you are using. In the
end we will have a P2039-antis
folder with the next set of output-files:
IFKOFEHL_1.region001.gbk IFKOFEHL_16.region001.gbk IFKOFEHL_6.region002.gbk html
IFKOFEHL_1.region002.gbk IFKOFEHL_3.region001.gbk IFKOFEHL_6.region003.gbk images
IFKOFEHL_10.region001.gbk IFKOFEHL_36.region001.gbk P2039.gbk index.html
IFKOFEHL_10.region002.gbk IFKOFEHL_40.region001.gbk P2039.json js
IFKOFEHL_11.region001.gbk IFKOFEHL_48.region001.gbk P2039.zip regions.js
IFKOFEHL_12.region001.gbk IFKOFEHL_6.region001.gbk css svg
To go in detail into all of the contents of this folder, let’s download this folder
to our own computer (assuming that you are doing it in the cluster) and open the
index.html
file the internet browse of your election. The page that you will
find will be the next one:
The color rectangles in the upper part symbolize the biosynthetic gene cluster(BGC)
that Antismash
found in the P2039
bin. So the first one is named
after the location inside the bin that this BGC was found. So the .gbk
file that
will bring the information of this cluster will be IFKOFEHL_1.region001.gbk
.
The same applies to the other 14 BGCs. If we click in one of the squares, let’s
say the first one, this page will display a new set of marvels:
Here we will see the different genes that are part of the BGC represented as arrows. We also can have access to the information of the lenght, sequence and identity of some of these genes.
We have done this for one of the bins, but we would love to automatize this process.
Now that we have managed to annotate a bin with both Prokka
and Antismash
, we can
create a script that will submit a job to the cluster (Duke University) and will annotate
all the bins that we want:
#!/bin/bash
#Date of creation: 08/29/2022
#This script will be used to run the annotation of bacterial genomes with Prokka and
#Antismash.
# This little program requires that te user gives the next three parameters:
lgem=$1 #Folder location of the genomes that you want to annotate
pref=$2 #Prefix that you would like for the analysis to have
gus=$3 #If all the genomes that you are going to annotat, are from the same genus put that here
#Here we will declare a variable to hold the actual directory:
root=$(pwd) #Gets the path to the directory of this file, on which the outputs ought to be created
sign='$'
#General directories
mkdir -p $pref-gAnot
mkdir -p $pref-gAnot/logs/script
# Directories for the Prokka analysis
mkdir -p $pref-gAnot/prokka
# Dierectories for the Antismash analysis
mkdir -p $pref-gAnot/antismash
cat <<EOF > runAnnotation.sh
#!/bin/bash
#SBATCH --mem-per-cpu=16G #The RAM memory that will be asssigned to each threads
#SBATCH -c 16 #The number of threads to be used in this script
#SBATCH --output=$pref-gAnot/logs/$pref-annot.out #A file with the output information will be generated in the location indicated
#SBATCH --error=$pref-gAnot/logs/$pref-annot.err #If a error occurs, a file will be created in the location indicated
#SBATCH --partition=common
source $(conda info --base)/etc/profile.d/conda.sh
conda activate gmining
ls $lgem | while read line; do gen=${sign}(echo ${sign}line | cut -d'_' -f1);
prokka --prefix ${sign}gen --outdir $pref-gAnot/prokka/${sign}gen --kingdom Bacteria --genus $gus \
--strain ${sign}gen --usegenus --addgenes --metagenome --compliant --cpus 16 \
$lgem${sign}gen*;
antismash --asf --pfam2go $pref-gAnot/prokka/${sign}gen/*.gbk -c 16 \
--output-dir $pref-gAnot/antismash/${sign}gen --output-basename ${sign}gen \
--html-title report-${sign}gen --html-start-compact;
done
EOF
sbatch runAnnotation.sh
mv runAnnotation.sh $pref-gAnot/logs/script
The script is asking us to give three different inputs:
.fasta
files of the bins areI will run the script with the next line of code inside the working directory /hpc/group/bio1/diego/cyanos-Carlos:
$ sh annot.sh carlos-bins/ nostoc Nostoc
At the end of the run (it would take 17 hours more or less), we will have the next set of output folders:
$ ls nostoc-gAnot/antismash/
3 NMS4 P10247 P12502 P12569 P12642 P1574 P2162 P5023 P6636 P8571 P9119 S10 S44 X2
8274 NMS5 P10264 P12521 P12570 P12646 P2037 P2164 P539 P6963 P8575 P9121 S12 S5
8277 NMS7 P1030 P12523 P12573 P12649 P2039 P2170 P607 P6970 P8577 P943 S13 S51
JL23 NMS8 P10324 P12537 P12578 P12650 P2081 P2180 P6447 P8202 P8580 P9639 S24 S52
JL31 NMS9 P10894 P12545 P12584 P14213 P2083 P2213 P6465 P8219 P8690 P9728 S27 S62
JL33 NOS P11060 P12559 P12588 P14264 P2090 P2224 P6524 P8231 P8768 P9820 S31 S66
JL34 P10073 P11388 P12560 P12591 P14269 P2115 P3034 P6600 P8256 P8840 P9822 S32 S67
NMS1 P10160 P11425 P12564 P12639 P14318 P2123 P3068 P6602 P8569 P8857 P9895 S40 S8
NMS2 P10246 P1229 P12567 P12641 P14321 P2152 P330 P6620 P8570 P8926 S1 S43 S9
We have completed the first step wich is annotation.
After inspecting the results from the script submitted to the cluster, I was aware that there are three
bins that do not have an Antismash
output:
I try to re-annotate them using the next little script:
#!/bin/bash
#SBATCH --mem-per-cpu=16G #The RAM memory that will be asssigned to each threads
#SBATCH -c 16 #The number of threads to be used in this script
#SBATCH --output=nostoc-gAnot/logs/some-nostoc.out #A file with the output information will be generated in the location indicated
#SBATCH --error=nostoc-gAnot/logs/some-nostoc.err #If a error occurs, a file will be created in the location indicated
#SBATCH --partition=common
source /hpc/group/bio1/diego/miniconda3/etc/profile.d/conda.sh
conda activate gmining
for i in NMS5 P12639 P6970;
do antismash --asf --pfam2go nostoc-gAnot/prokka/$i/*.gbk -c 16 --output-dir nostoc-gAnot/antismash/$i --output-basename $i --html-title report-$i --html-start-compact
In the end, I obtained the same result. I figured out that I commited a mistake there are two bins with the name P12639
.
Because how I did the script to submit them to the cluster Prokka
put only one output in the P12639
folder. Using the
next script I repeated the process for these two bins:
#!/bin/bash
#SBATCH --mem-per-cpu=16G #The RAM memory that will be asssigned to each threads
#SBATCH -c 16 #The number of threads to be used in this script
#SBATCH --output=nostoc-gAnot/logs/some-nostoc.out #A file with the output information will be generated in the location indicated
#SBATCH --error=nostoc-gAnot/logs/some-nostoc.err #If a error occurs, a file will be created in the location indicated
#SBATCH --partition=common
source /hpc/group/bio1/diego/miniconda3/etc/profile.d/conda.sh
conda activate gmining
ls carlos-bins/P12639* | while read line; do file=$(echo $line | cut -d'/' -f2| cut -d'.' -f1);
prokka --prefix $file --outdir nostoc-gAnot/prokka/$file --kingdom Bacteria --genus Nostoc \
--strain $file --usegenus --addgenes --metagenome --compliant --cpus 16 \
carlos-bins/$file*;
antismash --asf --pfam2go nostoc-gAnot/prokka/$file/*.gbk -c 16 --output-dir nostoc-gAnot/antismash/$file --output-basename $file --html-title report-$file --html-start-compact;
done
This indeed generated two different outputs. Nevertheless, both files do not generated any outputs with Antismash
.
I want to see the status of these bins, I used info-pbins.sh
little programm to extract some important information of
these bins:
#!/bin/bash
echo -e Bin'\t'Completeness'\t'Contamination'\t'#Bins > problematic-bins.txt
for i in NMS5 P12639_bin.2 P12639_bin.3 P6970;
do comp=$(cat metadata/fullqc_set8.csv | grep $i | cut -d',' -f14);
cont=$(cat metadata/fullqc_set8.csv | grep $i | cut -d',' -f15)
assm=$(cat metadata/fullqc_set8.csv | grep $i | cut -d',' -f19)
echo -e $i'\t'$comp'\t'$cont'\t'$assm >> problematic-bins.txt; done
$ sh info-pbins.sh
$ cat problematic-bins.txt
Bin Completeness Contamination #Bins
NMS5 54.41 1.19 1170
P12639_bin.2 84.48 32.99 2883
P12639_bin.3 30.52 12.93 922
P6970 91.12 0.96 1463
As can be seen in the results, each of these bins have to much contings to be processed by Antismash
(> 500). So I will let
them out of the future analyses until we can obtain better assemblies.