NCBIx-BigFetch version 0.0.1

This module is useful for downloading very large result sets of sequences 
from NCBI given a text query. It uses YAML to create a configuration 
file to maintain project state in case network or server issues interrupts 
execution, in which case it may be easily restarted after the last batch. 

Working scripts are included in the script directory:

	fetch-all.pp
	fetch-missing.pp
	fetch-unavailable.pp

The recommended workflow is:
	1. Copy the scripts and edit them for a specific project. Use a 
	   new number as the project ID. 
	2. Begin downloading by running fetch-all.pp, which will first 
	   submit a query and save the resulting WebEnv key in a project 
	   specific configuration file (using YAML).
	3. The next morning, kill the fetch-all.pp process and run 
	   fetch-missing.pp until it completes.
	4. Restart fetch-all.pp.
	5. If you wish to re-download "not available" sequences, you may 
	   run fetch-unavailable.pp. However, they will be downloaded at 
	   the end of fetch-all.pp if it completes normally.

If your query result set is so large that your WebEnv times out, simply 
start a new project with that last index of the previous project, and 
it will pick up the result set from there (with a new WebEnv). 

Warning: You may lose a (very) few sequences if your download extends 
across multiple projects. However, our testing shows that the results 
generated with the same query within a few days of each other are largely 
in the same order.

Note: This module was used to download 11,550,000 fasta-formatted protein 
sequences over the course of eight days.


INSTALLATION

To install this module, run the following commands:

	perl Makefile.PL
	make
	make test
	make install


DEPENDENCIES

Class::Std
Class::Std::Utils
LWP::Simple
YAML
Time::HiRes
Bio::SeqIO


COPYRIGHT AND LICENSE

Copyright (C) 2009, Roger A Hall

This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself.