CLI tool to run a batch jobs with ETL/ExTL framework on AWS or other cloud services.
Here describes what ETL is and why ExTL is needed
“ETL” is a data processing model generally used, not only for bio-informatics, in which data source, computing resources and output destination locations are separated.
// TODO: REFERENCE HERE
When it comes to Cloud, ETL can be effective way to use computing resources and file storage.
See animation of how “on-demand” ETL works ->
It looks working fine, but in the context of bio-informatics, reference files are too large to handle as input files to be downloaded on every VM.
Here we suggest Extended ETL Framework, ExTL, to decrease inefficient network traffic and instance time to download such huge reference files.
You can just add --shared
flag to hotsub
to make the resource shared.
$ hotsub run \
--tasks ./my-samples.csv \
--script ./my-workflow.sh \
--image otiai10/STAR-alignment \
+ --shared REFERENCE=s3://bucket/huge/reference
As “ExTL” creates an additional instance than computing VM instances, when it’s large number of concurrent jobs, ExTL has an advantage over ETL.
Fig. Time for computing by concurrency.
Fig. Estimated prices by concurrency.
See Poster on GCCBOSC 2018 for more details.