Commit 5f97d57f authored by Chris Hines's avatar Chris Hines
Browse files

initial architecture and estimated times

parent 8d3d6137
# Design and Estimates
## Architecture
Two python modules,
The low level module takes a directory, a bucket/container and a time
it backsup everthing with an mtime since the given time to a new tar file in the container records the results in a DB (sqlite db) and copies the DB to the container for safe keeping (otherwise the DB is also in the directory to be backed up)
The high level module creates multiple threads each one dealing with a single directory/container.
I takes an argument of "copy level" 0 for full copy, 1 for everything since the last level 0, 2 for everthing since the last level 1 or 2 (which ever is most recent)
It will use the level to determine the time window (based on the DB in each directory)
Perform the copy
Delete old copys that are no longer needed
Current MeRC HPC policy is that we have data from a single point in time, sometime in the last week that we can restore from. We will tune the timing of 0 1 and 2, but suggest starting with
0 once a month
1 once a week
2 once a day
This might evolve to
0 once per 3 months
1 once a week
2 once a day
Verificion should be run once a month I would guess (although extracting from a level 0 copy might be difficult depending on size)
Obviously you
a) Can not delete your level 2 backsups untill you have a new level 1 copy
b) Can not delete your level 1 copys untill you have a new level 1 copy
c) Can not delete your level 0 copys untill you have a new level 0 copy
## Phases of copy
0. If a database can not be found, recall it from object store and load
1. Determine the time window to copy (this may be since the last full or since the last differential copy)
2. Record the current time in the DB
3. Use find or rsync to find all files with an mtime since the last copy
4. Store the list of files that are being transfered in this copy
5. tar the files and transfer them to a unique object
6. Record the time of completion, number of files and volume of files transfered, time to transfer.
7. Dump the database to text and copy to object store
## Verification
1. Gather the list of files on the FS that existed at the time of the last copy (i.e. exclude any files that were created since the last copy)
2. Select N files.
3. For each file
1. Determine while tar object contains the most recent copy of the file
2. Download the object and extract the file
3. checksum the backedup file
4. Ensure the file on disk has not changed since the verify began
5. checksum the file on disk
## Implementation Phases and estimates
### Object creation - 6 hours
#### Steps
1. pip install s3
2. test queries to MonStore, creation of buckets
3. create the bucket if needed
4. creates a DB or downloads the DB from objectstore
5. create and delete an object
#### Unit tests
1. Connect to MonStore
2. Create bucket
3. remove bucket
4. open db
5. download db
6. copy db to object store
### Stats - 2 hours
#### Steps
1. Create a table containing the current time, give it an autoincrement ID for the current run and the level (0 1 or 2 explained above)
2. Create a table for statistics. Columns should include ID of current run, ID of previous run (to get the time window) number of files, volume of files time to transfer.
3. Create a table for history, Colums should be the ID of the current run, the path, and the object this path is being stored in (the actual object name can be randomly generated as long as this table exists, or sensibly generated including time stamps, directory paths, copy levels etc)
4. Determine the algorithm for object naming.
#### Unit tests
Errr. None?
### Backup - 20 hours
#### Steps
1. Pick a tool to find files since a given timestamp (should also report file sizes if possible) - 2 hours
2. experiment with the python api to read files, tar and pipe straight to M3 - 6 hours
3. Split the tar at a given size. Say 10GB as a default? - 6 hours
4. see if we can checksum the tar as its created as well (should be possible) - 2 hours
#### Unit tests
1. Create a file, run find, verify its found
2. Create a file, set the mtime, verify its excluded
3. use subprocess to create a tar from two files, checksum the tar, use python pipeline to get the checksum, compare checksums
4. upload two files to objectstore as a tar, download and checksum the tar.
Remember to deal with the various HTTP codes for unavalible and error etc. Probably retry the upload for most of these. Not sure what all the codes will be.
### Functional tests - 6 hours
1. Use the above module to do a "level 0 copy" of a directory download the object untar and verify (time stamp provided manually)
2. create a new file and do a "level 1 copy"
3. Create a new new file and do a "level 1 copy"
### High level Interface - 12 hours
1. Create a map from directory name to bucket name
2. Create a queue for directories/buckets
3. Create a thread pool to
1. pull an item off the queue
2. determine the time window
3. execute the lower level python module with the specified time window
4. Copy the DB for the directory to the objectstore
5. Delete unneeded incremental copies.
### Verification - 12 hours
Performed in the lower level module (i.e. performed for each directory)
1. Determine the time of the last copy of any level.
2. find all files created before the last copy of any level
3. randomly select from this list N files
4. For each of those N files
1. Determine which copy is most recent and which object holds that copy
2. Download the copy and extract the file
3. generate the checksum
4. compare to teh checksum on disk
5. If checksums don not match, record and continue
5. report number of non-matching checksums, throw an exception.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment