bcbio-nextgen in the Nectar Research Cloud
- System Requirements
- Package Configuration
- Ubuntu - APT
- CentOS - YUM
- Common Environment Configuration
- git - configure
- Create the Tool directory
- Access to Ephemeral storage for reference data storage
- Installing bcbio-nextgen
- Getting Started with bcbio-nextgen
Blue Collar Bioinformatics (bcbio) is community-developed tools to enable variant calling, RNA-seq and small RNA analysis. The python toolkit bcbio-nextgen provides best-practice pipelines for fully automated high throughput sequencing analysis. The goal of bcbio-nextgen is to provide pipelines that are: Quantifiable, Analyzable, Scalable, Reproducible, Community developed, and Accessible. It is current being used at institutions around the world such as: Harvard School of Public Health, Science for Life Laboratory, Institute of Human Genetics at UCSF, and IRCCS "Mario Negri" Institute for Pharmacological Research here is a link to some of the bcbio-casestudies and how it is being applied.
This guide is to aid in local configuration of the default Nectar images, (CentOS7, Ubuntu-15.04) so that the automate bcbio-nextgen scripts can be used. It is not intended to be a user guide for bcbio-nextgen. The bcbio-nextgen provides excellent documentation and tutorials to help with learning bcbio.
The install of bcbio-nextgen will take approximately 2.6GB of disk space and will easily fit in the root disk of all m1 and most of the m2 Nectar flavors. However the reference genome data is over 38GB when installed but requires over 45GB of space during the installation process. This will require you to set the path for data directory to either your ephemeral storage or to an attached volume storage of sufficient size. The m1.xlarge, m1.xxlarge and m2.xlarge flavors all have more than sufficient ephemeral disk space for the reference data. The ephemeral storage is exactly that ephemaral and should be used ideally for scratch space. Using the ephemeral storage is can be used when evaluating the bcbio pipelines, testing workflows, or short-term projects that do not require large amount of storage.
Persistent volume or object store is better suited for the reference genome data as it remains outside of your instances and can be used across multiple instances or projects. More information about ephemeral volume or object storage can be found here. https://support.nectar.org.au/support/solutions/articles/6000055382-introduction-to-cloud-storage
The default Nectar images are very lightweight and will require additional packages to be installed. The minimum configuration to be able to run the
bcbio_nextgen_install.py script for the Ubuntu 15.04 image and the CentOS7 image is covered.
Ubuntu - APT
The Advanced Packaging Tool from Ubuntu provides the command-line tool
apt-get command will download the missing packages and all of it's dependent packages from a software repositories to your instance and install the packages into your root disk making the software available immediately to your default user environment upon completion. The minimum configuration requires:
$sudo apt-get update
$sudo apt-get install git
$sudo apt-get install ruby
$sudo apt-get install perlbrew
$sudo apt-get install gcc
$sudo apt-get install g++
$sudo apt-get install gfortran
$sudo apt-get install unzip
$sudo apt-get install openjdk-8-jdk
$sudo apt-get install xorg
CentOS - YUM (DNF coming soon)
CentOS use YUM (Yellowdog Updater, Modified) for package management, it is a command-line tool used to install, update and remove packages from your default user environment.
$sudo yum update
$sudo yum install git
$sudo yum install ruby
$sudo yum install gcc
$sudo yum install gcc-c++
$sudo yum install gcc-gfortran
$sudo yum install unzip
$sudo yum install java-1.8.0-openjdk
$sudo yum install xorg-x11-xauth
Common Environment Configuration
With the necessary packages added to your respective instance you will need to configure your environment and ensure that the tool and data directories have the correct permissions and ownership set.
git - configure
You will need to add your email and user name to your local git configuration. This is a simple procedure as shown.
$git config --global user.email "jane.doe@best_uni.edu.au"
$git config --global user.name Jane Doe
Create the Tool directory
Create a new directory in the root (/) directory where bcbio_nextgen_install.py will install the src and all executables. Then you will need to change the ownership to allow your default group access to your new directory.
Note: You need to use
sudo to do this.
sudo mkdir /nectar
sudo chown -R ubuntu /nectar
sudo mkdir /nectar
sudo chown -R ec2-user /nectar
Access to Ephemeral storage for reference data storage
With the CentOS and Ubuntu images your ephemeral directory (/mnt) are already configured and mounted. However they are still owned by
root to be able access so you must change the ownership as well.
sudo chown -R ubuntu /mnt
mkdir -p /mnt/bcbio-data
sudo chown -R ec2-user /mnt
mkdir -p /mnt/bcbio-data
Note: Once you have mounted your Volume storage it is same procedure as it is for Ephemeral.
The procedure for installing bcbio is very straight forward on the Centos and Ubuntu images once the previous steps have been completed. The commands used with the instances running on Nectar are shown here:
$python bcbio_nextgen_install.py /mnt/bcbio-data --tooldir=/nectar/bcbio \
--genomes GRCh37 --aligners bwa --aligners bowtie2
The reference data takes a couple of hours to install so be patience. You should be able to use nohup (no hangup) to allow you exit from your instance and let it run in the background.
$nohup python bcbio_nextgen_install.py /mnt/bcbio-data --tooldir=/nectar/bcbio \
--genomes GRCh37 --aligners bwa --aligners bowtie2 &
If all goes well you will see something like this...
Creating manifest of installed packages in /mnt/bcbio/manifest
Third party tools upgrade complete.
Upgrade completed successfully.
Finished: bcbio-nextgen, tools and data installed
Genome data installed in:
Tools installed in:
Ready to use system configuration at:
Edit configuration file as needed to match your machine or cluster
You should now be able to proceed to the next section and begin using bcbio!