hotsub

CLI tool to run a batch jobs with ETL/ExTL framework on AWS or other cloud services.

View the Project on GitHub github.com/otiai10/hotsub

Here describes what ETL is and why ExTL is needed

What is “ETL”

“ETL” is a data processing model generally used, not only for bio-informatics, in which data source, computing resources and output destination locations are separated.

// TODO: REFERENCE HERE

“On-demand” ETL on Cloud

When it comes to Cloud, ETL can be effective way to use computing resources and file storage.

See animation of how “on-demand” ETL works ->

Huge reference files matter

It looks working fine, but in the context of bio-informatics, reference files are too large to handle as input files to be downloaded on every VM.

“ExTL”: Extended ETL Framework

Here we suggest Extended ETL Framework, ExTL, to decrease inefficient network traffic and instance time to download such huge reference files.

You can just add --shared flag to hotsub to make the resource shared.

$ hotsub run \
  --tasks  ./my-samples.csv \
  --script ./my-workflow.sh \
  --image  otiai10/STAR-alignment \
+ --shared REFERENCE=s3://bucket/huge/reference

Advantage of ExTL over ETL

As “ExTL” creates an additional instance than computing VM instances, when it’s large number of concurrent jobs, ExTL has an advantage over ETL.

Fig. Time for computing by concurrency.

Fig. Estimated prices by concurrency.

See Poster on GCCBOSC 2018 for more details.