I was invited to participate in a project to uncover the presence of a group of bacterial lineajes from the genus Clavibacter in different cultivars. As planned by my Sensei (Professor Nelly Selem-Mojica), we will use both approximation of metagenomics (i.e. Metabarcoding and Shotgun) to try to obtain some answers from the questions that were formulated because of the widely presence of these lineages.
I want to start this project with the exploration of some of the shotgun data. The next two lists of data was given by the team:
Plant | Number of samples | Type of data | Link to the data | Paper |
---|---|---|---|---|
Solanum lycopersicum | 13 | Shotgun | Link | Barajas et al., 2020 |
Zea mays | 5 | Shotgun | Link | Xiong et al., 2021 |
Zea mays | 5 | Shotgun | Link | Akinola et al., 2021 |
Solanum tuberosum | 20 | Shotgun | Link | Shi et al., 2019 |
Solanum tuberosum | 39 | 16S | Link | Shi et al., 2019 |
Capsicum annuum | 12 | Shotgun | Link | Choi et al., 2020 |
Solanum lycopersicum in Capsicum annuum soils | 15 | Shotgun | Link | Newberry et al., 2020 |
Soil of Zea mays and Triticum rotation | 27 | Shotgun | Link | X. Wu et al., 2021 |
Triticum | - | - | No data provided by authors of the paper Quiza et al., 2021 | Quiza et al., 2021 |
Zea mays grown in Triticum soil | 42 | 16S | Link | M. Wu et al., 2021 |
Plant | Number of samples | Type of data | Link to the data |
---|---|---|---|
Medicago sativa | 18 | 16S | Link |
Zea mays | 41 | Shotgun | Link |
Capsicum annuum | 4 | Shotgun | Link |
Titricum | 54 | Shotgun | Link |
I will use the data from Choi-2020 to begin with the analysis.
I have created a folder structure inside clavibacter
main folder as follows:
$ tree
.
├── 16S
└── shotgun
├── capsicum
├── lycopersicum
├── triticum
├── tuberosum
└── zea-mays
Inside the capsicum
folder, I have created the next folder organization:
$ tree -L 1
.
├── choi-2020
├── metadata
├── miscelaneous-capsicum
└── newberry-2020
4 directories, 0 files
I will enter to the choi-2020
folder and create a metadata/
folder and
allocated all the metadata information right there to give structure to out
working directory.
Let’s first downloaded them from the SRA repository in NCBI and move it into the metadata/
folder.
I will download the metadata table and the Accesion table. With the Accseion table I will use SRA-toolkit to download the shotgun reads.
This files are located inside the its portion of the data
folder.
I will use the next command to obtain the reads:
$ cat metadata/SRR_Acc_List.txt | while read line; do fasterq-dump $line -S -p -e 12 -o $line; done
$ mkdir reads/
$ mv *.fastq reads/
$ ls reads/*.fastq | wc -l
24
Now, we have both the forward and reverse reads to begin to work with them.
Next, I want to explore the diversity inside each of these metagenomes. I will use kraken2 to this purpose. I want to use the information
inside the metadata table SraRunTable.txt
to run the kraken commands, and also
to use it to run all the other programs that I will be using along this analysis.
Let’s see the structure of the file:
$ head -n 5 metadata/SraRunTable.txt
Run,Assay Type,AvgSpotLen,Bases,BioProject,BioSample,BioSampleModel,Bytes,Center Name,Collection_date,Consent,DATASTORE filetype,DATASTORE provider,DATASTORE region,Experiment,geo_loc_name_country,geo_loc_name_country_continent,geo_loc_name,HOST,Instrument,Isolation_Source,Lat_Lon,Library Name,LibraryLayout,LibrarySelection,LibrarySource,Organism,Platform,ReleaseDate,replicate,Sample Name,SRA Study
SRR12778013,OTHER,302,20857924148,PRJNA667562,SAMN16378122,Metagenome or environmental,6662690206,CHINESE ACADEMY OF SCIENCES,2018-08-23,public,"fastq,sra","gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",SRX9247618,China,Asia,China: Huishui,pepper,Illumina NovaSeq 6000,SU_epidermis_D,not applicable,10,PAIRED,other,METAGENOMIC,plant metagenome,ILLUMINA,2020-11-05T00:00:00Z,biological replicate 1,Sample_106,SRP286471
SRR12778014,OTHER,302,24003640104,PRJNA667562,SAMN16378121,Metagenome or environmental,7547690247,CHINESE ACADEMY OF SCIENCES,2018-08-23,public,"fastq,sra","gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",SRX9247617,China,Asia,China: Huishui,pepper,Illumina NovaSeq 6000,SU_epidermis_H,not applicable,9,PAIRED,other,METAGENOMIC,plant metagenome,ILLUMINA,2020-11-05T00:00:00Z,biological replicate 3,Sample_99,SRP286471
SRR12778015,OTHER,302,20960803468,PRJNA667562,SAMN16378120,Metagenome or environmental,6757603248,CHINESE ACADEMY OF SCIENCES,2018-08-23,public,"fastq,sra","gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",SRX9247616,China,Asia,China: Huishui,pepper,Illumina NovaSeq 6000,SU_epidermis_H,not applicable,8,PAIRED,other,METAGENOMIC,plant metagenome,ILLUMINA,2020-11-05T00:00:00Z,biological replicate 2,Sample_98,SRP286471
SRR12778016,OTHER,302,22817685198,PRJNA667562,SAMN16378119,Metagenome or environmental,7251222351,CHINESE ACADEMY OF SCIENCES,2018-08-23,public,"fastq,sra","gs,ncbi,s3","gs.US,ncbi.public,s3.us-east-1",SRX9247615,China,Asia,China: Huishui,pepper,Illumina NovaSeq 6000,SU_epidermis_H,not applicable,7,PAIRED,other,METAGENOMIC,plant metagenome,ILLUMINA,2020-11-05T00:00:00Z,biological replicate 1,Sample_97,SRP286471
If I use the next piece of code, I can obtain the first column of all the rows,
which is the information inside the Run
column, the same name that each forward and reverse reads files has.
$ cat metadata/SraRunTable.txt| sed -n '1!p' | while read line; do read=$(echo $line | cut -d',' -f1); echo $read;done
SRR12778013
SRR12778014
SRR12778015
SRR12778016
SRR12778017
SRR12778018
SRR12778019
SRR12778020
SRR12778021
SRR12778022
SRR12778023
SRR12778024
One of the great goals of this project is to create a program that can take the
information that I obtained in this little chapter and autimatically do all the
needed steps to process the data. I will do the first lines of that program here.
This first little program will take the information found inside the
SraRunTable.txt
and will download the reads from all the libraries listed
inside. I will name it down-reads.sh
. This and all the scripts will be locaten
inside the scripts-folder for this repository:
$ cat down-reads.sh
#!/bin/sh
# This is a program that is going to pick a SraRunTable of metadata and
#extract the run label to download, trim and move the libraries information.
# This program requires that you give 1 input data. 1) where this
#SraRunTable is located.
metd=$1 #Location to the SraRunTable.txt
root=$(pwd) #Gets the path to the directory of this file, on which the outputs ought to be created
# Now we will define were the reads are:
runs='reads'
# CREATING NECCESARY FOLDERS
mkdir reads
# DOWNLOADING THE DATA
#Let's use the next piece of code to download the data
cat $metd | sed -n '1!p' | while read line; do read=$(echo $line | cut -d',' -f1); fasterq-dump -S $read -p -e 8 -o $read ; done
mv *.fastq reads/
# The -e flag can be customized. This indicates the number of threads used to do this task.
# MANAGING THE DATA
# We will change the names of the reads files. They have a sufix that makes impossible
#to be read in a loop
ls $runs | while read line ; do new=$(echo $line | sed 's/_/-/g'); mv $runs/$line $runs/$new; done
# Now, we will create a file where the information of the run labes can be located
cat $metd | sed -n '1!p' | while read line; do read=$(echo $line | cut -d',' -f1); echo $read ; done > run-labels.txt
mv run-labels.txt metadata/
I will like to write about some of the steps of this program. According to my
experience, the presence of a _
in the names can cause issues with some programs. This is why the second step of this script changes the names fo all the downloaded
files. The last line is to create a file where we can locate the name of each
library. This information will be located inside the run-labels.txt
file.