Commit 4009fec4 authored by Kerri Wait's avatar Kerri Wait
Browse files

Update architecture.md with mostly formatting changes

parent 5f97d57f
......@@ -2,53 +2,71 @@
## Architecture
Two python modules,
Two python modules:
The low level module takes a directory, a bucket/container and a time
it backsup everthing with an mtime since the given time to a new tar file in the container records the results in a DB (sqlite db) and copies the DB to the container for safe keeping (otherwise the DB is also in the directory to be backed up)
The **low level** module takes a `directory`, a `bucket/container` and a `time`.
The high level module creates multiple threads each one dealing with a single directory/container.
I takes an argument of "copy level" 0 for full copy, 1 for everything since the last level 0, 2 for everthing since the last level 1 or 2 (which ever is most recent)
It will use the level to determine the time window (based on the DB in each directory)
Perform the copy
Delete old copys that are no longer needed
It will:
Current MeRC HPC policy is that we have data from a single point in time, sometime in the last week that we can restore from. We will tune the timing of 0 1 and 2, but suggest starting with
- Back up everthing with an mtime since the given time to a new tar file in the
container,
- Record the results in a DB (sqlite db), and
- Copy the DB to the container for safe keeping (otherwise the DB is also in the
directory to be backed up)
0 once a month
1 once a week
2 once a day
The **high level** module creates multiple threads, each one dealing with a
single directory/container. It takes an argument of `copy level` 0 for full
copy, 1 for everything since the last level 0, 2 for everthing since the last
level 1 or 2 (which ever is most recent).
It will:
- Use the level to determine the time window (based on the DB in each directory)
- Perform the copy
- Delete old copys that are no longer needed
Current MeRC HPC policy is that we have data from a single point in time,
sometime in the last week, that we can restore from. We will tune the timing
of 0 1 and 2, but suggest starting with:
- 0 once a month
- 1 once a week
- 2 once a day
This might evolve to
0 once per 3 months
1 once a week
2 once a day
- 0 once per 3 months
- 1 once a week
- 2 once a day
Verificion should be run once a month I would guess (although extracting from a level 0 copy might be difficult depending on size)
Verification should be run once a month I would guess (although extracting from
a level 0 copy might be difficult depending on size)
Obviously you
a) Can not delete your level 2 backsups untill you have a new level 1 copy
b) Can not delete your level 1 copys untill you have a new level 1 copy
c) Can not delete your level 0 copys untill you have a new level 0 copy
1. Can not delete your level 2 backups until you have a new level 1 copy
2. Can not delete your level 1 copies until you have a new level 1 copy
3. Can not delete your level 0 copies until you have a new level 0 copy
---
## Phases of copy
0. If a database can not be found, recall it from object store and load
1. Determine the time window to copy (this may be since the last full or since the last differential copy)
1. Determine the time window to copy (this may be since the last full or since
the last differential copy)
2. Record the current time in the DB
3. Use find or rsync to find all files with an mtime since the last copy
4. Store the list of files that are being transfered in this copy
5. tar the files and transfer them to a unique object
6. Record the time of completion, number of files and volume of files transfered, time to transfer.
6. Record the time of completion, number of files and volume of files
transferred, time to transfer.
7. Dump the database to text and copy to object store
---
## Verification
1. Gather the list of files on the FS that existed at the time of the last copy (i.e. exclude any files that were created since the last copy)
1. Gather the list of files on the FS that existed at the time of the last copy
(i.e. exclude any files that were created since the last copy)
2. Select N files.
3. For each file
1. Determine while tar object contains the most recent copy of the file
......@@ -57,6 +75,7 @@ c) Can not delete your level 0 copys untill you have a new level 0 copy
4. Ensure the file on disk has not changed since the verify began
5. checksum the file on disk
---
## Implementation Phases and estimates
### Object creation - 6 hours
......@@ -82,9 +101,15 @@ c) Can not delete your level 0 copys untill you have a new level 0 copy
#### Steps
1. Create a table containing the current time, give it an autoincrement ID for the current run and the level (0 1 or 2 explained above)
2. Create a table for statistics. Columns should include ID of current run, ID of previous run (to get the time window) number of files, volume of files time to transfer.
3. Create a table for history, Colums should be the ID of the current run, the path, and the object this path is being stored in (the actual object name can be randomly generated as long as this table exists, or sensibly generated including time stamps, directory paths, copy levels etc)
1. Create a table containing the current time, give it an autoincrement ID for
the current run and the level (0 1 or 2 explained above)
2. Create a table for statistics. Columns should include ID of current run, ID
of previous run (to get the time window) number of files, volume of files time
to transfer.
3. Create a table for history, Colums should be the ID of the current run, the
path, and the object this path is being stored in (the actual object name can be
randomly generated as long as this table exists, or sensibly generated including
time stamps, directory paths, copy levels etc)
4. Determine the algorithm for object naming.
#### Unit tests
......@@ -95,7 +120,8 @@ Errr. None?
#### Steps
1. Pick a tool to find files since a given timestamp (should also report file sizes if possible) - 2 hours
1. Pick a tool to find files since a given timestamp (should also report file
sizes if possible) - 2 hours
2. experiment with the python api to read files, tar and pipe straight to M3 - 6 hours
3. Split the tar at a given size. Say 10GB as a default? - 6 hours
4. see if we can checksum the tar as its created as well (should be possible) - 2 hours
......@@ -104,14 +130,17 @@ Errr. None?
1. Create a file, run find, verify its found
2. Create a file, set the mtime, verify its excluded
3. use subprocess to create a tar from two files, checksum the tar, use python pipeline to get the checksum, compare checksums
3. use subprocess to create a tar from two files, checksum the tar, use python
pipeline to get the checksum, compare checksums
4. upload two files to objectstore as a tar, download and checksum the tar.
Remember to deal with the various HTTP codes for unavalible and error etc. Probably retry the upload for most of these. Not sure what all the codes will be.
Remember to deal with the various HTTP codes for unavalible and error etc.
Probably retry the upload for most of these. Not sure what all the codes will be.
### Functional tests - 6 hours
1. Use the above module to do a "level 0 copy" of a directory download the object untar and verify (time stamp provided manually)
1. Use the above module to do a "level 0 copy" of a directory download the
object untar and verify (time stamp provided manually)
2. create a new file and do a "level 1 copy"
3. Create a new new file and do a "level 1 copy"
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment