52890.fb2
Data storage is a critical part of computing. Fedora includes some powerful facilities for managing your data storage. These tools enable you to build high-availability, fault-tolerant storage systems that can be adjusted and tuned while in use, and also enable you to build backup tools that permit automated, self-consistent backups.
Fedora uses the Linux Logical Volume Management (LVM) system by default for disk storage. LVM combines one or more disk partitions, called Physical Volumes (PVs), into a pool of storage called a Volume Group (VG). From this volume group, one or more Logical Volumes (LVs) are allocated. Each LV is used as a block storage device to contain a filesystem or a swapspace.
Here's where the fun begins: LVs can be resized, created, or deleted on the fly, and disks can be added and deletedwhile the system is in use!
When changing a storage configuration, it is possible to make a mistake and lose data. Take your time, ensure that you are confident of what each step will do before performing it, and make sure you back up your data before performing any LVM operations.
Fedora Core permits you to manage logical volumes graphically or from the command line.
In the examples given here, the volume-group and logical-volume names recommended in Chapter 1 have been used: the volume group is main, and the logical volumes are named root, home, and swap.
If you used the Fedora default names, the main volume group will be named VolGroup00, and the logical volumes will be named LogVol00, LogVol01, and so forth.
Although you can increase or decrease the size of any logical volume at any time, an ext3 filesystem within a logical volume can be reduced in size only when it is not in use (unmounted). If the filesystem is the root filesystem, it is in use whenever the system is running; therefore, the only way to shrink the root filesystem is to use another disk as a temporary root filesystem, which is usually done by running the system from the installation CD in rescue mode (see Lab 10.6, "Using Rescue Mode on an Installation Disc "). There is also a limit to how large a filesystem can grow while in use; growing the filesystem past that point must be done when the filesystem is unmounted.
Start the LVM administration tool by selecting System→Administration→Logical Volume Management. After you enter the root password, the three-panel display shown in Figure 6-1 will appear.
Figure 6-1. Logical Volume Management window
The left pane displays a list of the elements managed by LVM, the middle pane displays the current element in visual form, and the right pane displays a description of the current element.
The element list in the left pane is a collapsing outline. To view the elements within a particular category, click on the small arrow to the left of the category name to rotate it to a downward-pointing position; the elements within that category will be listed immediately below it. For example, to see the logical volumes within the main volume group ( VolGroup00 if you used the default Fedora configuration), click on the arrow beside "main Logical View" (or "VolGroup00 Logical View"), and a list of volume groups will appear beneath that line.
The initial display shows the physical (red) and logical (blue) views of the last volume group listed. If you click on a logical volume in the Logical View, the corresponding areas in the physical view are highlighted, as shown in Figure 6-2 .
Figure 6-2. Viewing the location of LV data within PVs
To increase the size of a logical volume and the filesystem contained in it, select that LV in the lefthand pane, and then click Edit Properties. A properties dialog like the one in Figure 6-3 will appear.
Figure 6-3. LVM properties dialog
Change the unit control from Extents to Gigabytes or Megabytes so that the LV size is displayed in meaningful units; then click on the horizontal slider and drag it to the desired size (or type the size into the "LV size" field or click "Use Remaining").
Click OK. The LV will be resized, then the filesystem will be resized, and then the LVM information will be reloaded to update the display. On most systems, this will take just a few seconds.
If the resize fails with the message "No space left on device," you may have attempted to resize the filesystem past the maximum that can be done while the filesystem is mounted (in use). You can attempt to unmount the filesystem by deselecting the checkbox labeled Mount and then retry the operation (this will always fail for the root filesystem and will usually fail for filesystems containing /var and /home, in which case you may need to use single-user mode).
Shrinking a logical volume using the graphical tool is done exactly the same way as growing it: select the LV you wish to resize, click Edit Properties, enter the new size, and click OK.
The catch is that logical volumes containing ext3 filesystems can be reduced in size only when they are unmounted, so you will be asked if the filesystem may be unmounted during the resize operation. Click Yes.
Whenever the system is booted normally, the root ( / ) and /var filesystems will be in use, so you will not be able to unmount them, and therefore the resize will fail. You'll need to use a special procedure (detailed shortly) to shrink those filesystems.
The /home filesystem is a different story; if you log in as root instead of using a normal user account, the /home filesystem will not be in use, and you can successfully shrink /home . If any non- root users have logged in since the system was booted, they may have left processes running, such as the esound daemon (esd). These can be terminated with the fuser command:
# fuser -k /home/*
/home/chris: 13464c
The output shows that the directory /home/chris was in use as the current directory ( c ) of process 13464 . That process is killed, as specified by the -k option. Once this has been done, you can resize the /home directory.
You can create a new logical volume at any time, as long as there is some free space in the volume group you wish to use.
Select the volume group's Logical View element in the lefthand panel, then click Create New Logical Volume at the bottom of the center panel. The dialog shown in Figure 6-4 will appear.
Figure 6-4. Create New Logical Volume dialog
Enter an LV name consisting of letters, digits, and underscores. Change the LV size unit from Extents to Gigabytes (or Megabytes) and enter the desired LV size directly or by using the slider (click the "Use remaining" button to use all of the free space in the PV).
To create a filesystem in this LV, change the Filesystem type control (near the bottom of the dialog) from None to ext3, and select the checkboxes for Mount and "Mount when rebooted." In the "Mount point" field, type the name of the directory where you wish the new filesystem to appear.
For example, to create a 10 GB partition for music and video files, you could enter an LV name of multimedia , set the size to 10 GB, and create an ext3 filesystem with a mount point of /media .
Click OK. The LV and filesystem will be created and mounted, and you can start using the filesystem immediately.
LVM has the ability to create a snapshot of an LV. The snapshot is an exact copy of the LV as it stood when the snapshot was created, but this is an illusion because the snapshot really stores only a copy of data that was changed since the snapshot was created. You can change the data in the origin LV without affecting the snapshot, and change the data in the snapshot without affecting the original LV.
Snapshots enable you to make a self-consistent backup of a filesystem to media such as tape. If you don't use snapshots and you back up an active filesystem containing a database to tape, the database tables would get copied at different times; if the database contained e-commerce data, perhaps the customer table would get copied before the order table. If an order was received from a new customer while the backup was in progress, it is possible that the order table on the tape will include the order but the customer table may not include the new customer. This could lead to severe problems when trying to use the data at a later time. On the other hand, if you take a snapshot and then back that up, the various files will all be in the same state on tape.
In addition, snapshots are useful for self-administered document recovery: if you take a snapshot of your users' files each night and make that snapshot available to them, they can recover from their own mistakes if they mess up a spreadsheet or delete an important document. For example, if you take a snapshot of /home and make it available as /yesterday/home , the deleted document /home/jamie/budget.ods can be recovered as /yesterday/home/jamie/budget.ods .
Snapshots are also used to test software or procedures without affecting live data. For example, if you take a snapshot of the logical volume containing the /home filesystem, and then unmount the original filesystem and mount the snapshot in its place, you can experiment with procedures that change the contents of home directories. To undo the results of your experiments, simply unmount the snapshot, remount the original directory, and then destroy the snapshot.
To create a snapshot of a LV using the graphical tool, select the LV in the left pane, and then click on the Create Snapshot button at the bottom of the middle pane. You will see the dialog box shown in Figure 6-5 .
Figure 6-5. Creating a snapshot
This dialog looks a lot like the dialog used to create a logical volume ( Figure 6-4 ), and it shouldbecause a snapshot is a special type of LV. Enter a name for the snapshot; I recommend the name of the origin LV, with -snap added to the end. For example, a snapshot of the multimedia LV would be called multimedia-snap .
Next, set the size of the snapshot. The snapshot will appear to be the same size as the origin LV; the size setting here is used to reserve disk space to track the differences between the origin LV and the snapshot. Therefore, if you have a 100 GB LV and the data in that LV changes slowly, a 1 GB snapshot might be reasonable; but if the data in that LV changes rapidly, you will need a much larger snapshot size.
Select the Mount and "Mount when rebooted" checkboxes, and then enter the "Mount point" that you wish to use (such as /backup/media ).
You can view the amount of storage used by the snapshot by selecting the snapshot LV in the left pane, then looking at the snapshot usage in the right pane. The usage is reported as a percentage of the total snapshot size and increases as data is changed in the origin or snapshot volumes. If it approaches 100 percent, you can increase the size of the snapshot LV in the same way that you would resize a regular LV.
To permanently remove a logical volume, select it in the left pane, and then click the Remove Logical Volume button at the bottom of the middle pane. A dialog box will appear, asking you to confirm your choice; when you click Yes, the logical volume will be gone forever.
You can add a partition to a volume group at any time.
The first step is to make the partition a physical volume. Select the disk partition you wish to use under Uninitialized Entities in the left pane, and then click the Initialize Entity button at the bottom of the center pane. A dialog box will warn you of possible data loss; double-check the partition information, and then click Yes if you are certain that you will not lose any critical data.
Be extremely careful with this option because it will delete all of the data on an entire disk partition. If you select the wrong partition on a dual-boot system, you could wipe out all of the data used by the other operating system (such as Windows).
If the Initialize Entity button is deactivated (grayed-out and unclickable), look in the right pane for the reason that the partition is "Not initializable." The most common reason given is Foreign boot partition , which means that the partition is marked as bootable in the drive's partition table. To correct this, use fdisk on the disk containing the partition; for example, run fdisk on the disk /dev/sdb to edit the settings for the partition /dev/sdb1 :
# fdisk /dev/sdb
fdisk accepts single-letter commands. Enter p to print the partition table:
Command (m for help): p
Disk /dev/sdb: 8 MB, 8192000 bytes
4 heads, 16 sectors/track, 250 cylinders
Units = cylinders of 64 * 512 = 32768 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 250 7987+ 1 FAT12
There is only one partition on this particular disk, and it is bootable (note the * in the Boot column). Use the a (activate) command to toggle the boot flag:
Command (m for help): a
Partition number (1-4):
1
Then use w to write the partition table to disk and exit:
Command (m for help): w
The partition table has been altered!
Calling ioctl( ) to re-read partition table.
Syncing disks.
You can now rerun the graphical LVM administration tool and initialize the partition for use with LVM. This gives you a new physical volume that you can work with.
The next step is to add the new physical volume to the volume group. You'll see the newly initialized partition under Unallocated Volumes in the left pane. Click on it, and then click on the button labeled "Add Volume to existing Volume Group." A menu of volume groups will appear; select the one to add it to, and then click Add.
Once you've added a PV, you can use the extra space to create new logical volumes or grow an existing volume.
To take a physical volume (partition) out of a volume group, select the PV in the left pane, and then click "Remove Volume from Volume Group." You will be prompted for confirmation (including any move of data to another device), and the PV will be removed (as long as the free space in the VG exceeds the size of the PV; otherwise, removing the PV would destroy data).
Logical volumes are almost always used to contain filesystems (the other common use is to hold swapspace). In essence, an LV serves as a container for a filesystem. This has several ramifications:
The LV must be created before the filesystem can be created.
The filesystem must be removed before the LV is destroyed.
When growing an LV and filesystem, the LV must be grown first.
When shrinking an LV and filesystem, the filesystem must be reduced first.
Fedora's LVM2 system provides the lvm command for administration. Typing lvm by itself starts a specialized shell:
# lvm
lvm>
At the lvm> prompt, you can enter any of the subcommands shown in Table 6-1.
Table 6-1. LVM subcommands
LVM subcommand | Description |
---|---|
vgs | Displays details about volume groups (compact) |
pvs | Displays details about physical volumes (compact) |
lvs | Displays details about logical volumes (compact) |
vgdisplay | Displays details about volume groups (verbose) |
pvdisplay | Displays details about physical volumes (verbose) |
lvdisplay | Displays details about logical volumes (verbose) |
vgcreate | Creates a volume group |
vgremove | Removes a volume group |
pvcreate | Prepares a block device (such as a disk partition) for inclusion in a volume group by adding a disk label to the start of the block device |
pvremove | Wipes out the disk label created by pvcreate |
vgextend | Adds a physical volume to a volume group |
vgremove | Removes a physical volume from a volume group |
pvmove | Migrates data from one physical volume to another |
lvcreate | Creates a logical volume or snapshot LV |
lvextend | Grows a logical volume |
lvreduce | Shrinks a logical volume |
lvresize | Grows or shrinks a logical volume |
vgscan | Scans block devices for volume groups (necessary when using a rescue-mode boot) |
You can also enter any of these subcommands as the first argument on the lvm command line:
# lvm lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
home main -wi-ao 1.00G
multimedia main -wi-ao 512.00M
root main -wi-ao 9.77G
swap main -wi-ao 1.00G
Symbolic links have been set up from /usr/sbin/<subcommand > to /usr/sbin/lvm , so you can just type the name of the subcommand at the regular bash shell prompt:
# ls -l /usr/sbin/lvs
lrwxrwxrwx 1 root root 3 Mar 20 14:49 /usr/sbin/lvs -> lvm
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
home main -wi-ao 1.00G
multimedia main -wi-ao 512.00M
root main -wi-ao 9.77G
swap main -wi-ao 1.00G
The symbolic links are not available when you are in rescue mode (see Lab 10.6, "Using Rescue Mode on an Installation Disc"), so it's important to remember that you can also use these subcommands as arguments to the lvm command (for example, when in rescue mode, type lvm lvdisplay instead of lvdisplay).
Logical volumes can be accessed using any of three different device nodes:
In the /dev/mapper directory, the entry named by the pattern vg - lv . For example, if the volume group main had a logical volume named home , it could be accessed using the name /dev/mapper/main-home .
There is a separate directory in /dev for each volume group, and an entry for each logical volume within that directory. Our sample volume could be accessed as /dev/main/home . These names are slightly shorter to type than the ones in /dev/mapper , and are actually symbolic links to the longer names.
Using /dev/dm-<number> , where <number> is a number sequentially assigned when volume groups are initially scanned at boot time (or when the LV is created, if it was created after the last boot). If a volume is the second one found during the vgscan , it can be accessed as /dev/dm-1 (the first one found is numbered 0 ). These names are a bit harder to use, since the VG and LV are not identified; to find the corresponding entry in /dev/mapper , compare the minor device numbers. You cannot use these names in rescue mode.
In addition to these device node names, some LVM commands allow the volume group and logical volume names to be written as vg / lv for example, main/multimedia refers to the LV multimedia within the VG main.
To discover the VGs present on your system, use the vgs command:
# vgs
VG #PV #LV #SN Attr VSize VFree
main 2 4 0 wz--n- 20.04G 7.78G
This shows the volume group name, the number of physical volumes, logical volumes, and snapshots; attributes (see the manpage for lvm for details); the volume group size; and the amount of space that is not assigned to a logical volume.
vgdisplay shows the same information as vgs but in a more verbose form:
# vgdisplay
--- Volume group ---
VG Name main
System ID
Format lvm2
Metadata Areas 2
Metadata Sequence No 51
VG Access read/write
VG Status resizable
MAX LV 0
Cur LV 4
Open LV 4
Max PV 0
Cur PV 2
Act PV 2
VG Size 20.04 GB
PE Size 4.00 MB
Total PE 5131
Alloc PE / Size 3140 / 12.27 GB
Free PE / Size 1991 / 7.78 GB
VG UUID 13X0pY-5Vnq-3KlU-7Qlu-sHUc-wrup-zsHipP
The VG UUID at the bottom is a unique ID number placed in the disk label of each PV to identify that it is part of this volume group.
If you have more than one VG present and only want to see information about a specific one, you can specify a volume group name as an argument to vgdisplay or vgs.
To list the PVs present, use pvs or pvdisplay :
# pvs
PV VG Fmt Attr PSize PFree
/dev/hdc3 main lvm2 a- 20.04G 7.77G
/dev/sdb1 main lvm2 a- 4.00M 4.00M
# pvdisplay
--- Physical volume ---
PV Name /dev/hdc3
VG Name main
PV Size 20.04 GB / not usable 0
Allocatable yes
PE Size (KByte) 4096
Total PE 5130
Free PE 1990
Allocated PE 3140
PV UUID RL2wrh-WMgl-pyaR-bHt4-6dCv-23Fd-kX1gvT
--- Physical volume ---
PV Name /dev/sdb1
VG Name main
PV Size 4.00 MB / not usable 0
Allocatable yes
PE Size (KByte) 4096
Total PE 1
Free PE 1
Allocated PE 0
PV UUID HvryBh-kGrM-c10y-yw1v-u8W3-r2LN-5LrLrJ
In this case, there are two PVs present: /dev/hdc3 (an IDE hard disk partition) and /dev/sdb1 (a USB disk I was playing with). Both are part of the VG main . The display shows the attributes (see man lvm ), size, and amount of unallocated space.
In a similar way, you can see logical volume information with lvs or lvdisplay :
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
home main -wi-ao 1.00G
multimedia main owi-ao 512.00M
multimedia-snap main swi-a- 128.00M multimedia 0.02
root main -wi-ao 9.77G
swap main -wi-ao 1.00G
# lvdisplay
--- Logical volume ---
LV Name /dev/main/root
VG Name main
LV UUID LaQgYA-jiBr-G02i-y64m-90fT-viBp-TuZ9sC
LV Write Access read/write
LV Status available
# open 1
LV Size 9.77 GB
Current LE 2500
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:0
...(Lines snipped)...
--- Logical volume ---
LV Name /dev/main/multimedia
VG Name main
LV UUID f7zJvh-H21e-fSn7-llq3-Ryu1-p1FQ-PTAoNC
LV Write Access read/write
LV snapshot status source of
/dev/main/multimedia-snap [active]
LV Status available
# open 1
LV Size 512.00 MB
Current LE 128
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:3
--- Logical volume ---
LV Name /dev/main/multimedia-snap
VG Name main
LV UUID 7U5wVQ-qIWU-7bcz-J4vT-zAPh-xGVN-CDNfjx
LV Write Access read/write
LV snapshot status active destination for /dev/main/multimedia
LV Status available
# open 0
LV Size 512.00 MB
Current LE 128
COW-table size 128.00 MB
COW-table LE 32
Allocated to snapshot 0.02%
Snapshot chunk size 8.00 KB
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:6
This display shows the volume group, attributes (again, see man lvm ), and logical volume size. Additional information is shown for snapshot volumes and LVs that are being copied or moved between PVs. The Block device shown in the lvdisplay output is the major and minor device number.
To increase the size of a logical volume, use the lvextend command:
# lvextend /dev/main/multimedia --size 1G
Extending logical volume multimedia to 1.00 GB
Logical volume multimedia successfully resized
Specify the LV device as the first argument, and use the --size option to specify the new size for the volume. Use a numeric size with one of the size suffixes from Table 6-2 as the value for the --size option.
Table 6-2. Size suffixes used by LVM
Suffix | Name | Size | Approximation |
---|---|---|---|
k, K | Kibibyte (kilobyte) | 210 = 1,024 bytes | Thousand bytes |
m, M | Mebibyte (megabyte) | 220 = 1,048,576 bytes | Million bytes |
g, G | Gibibyte (gigabyte) | 230 = 1,073,741,824 bytes | Billion bytes |
t, T | Tebibyte (terabyte) | 240 = 1,099,511,627,776 bytes | Trillion bytes |
Once you have resized the LV, resize the filesystem contained inside:
#
resize2fs/dev/main/multimedia
resize2fs 1.39 (29-May-2006)
Resizing the filesystem on /dev/main/multimedia to 1048576 (1k) blocks.
The filesystem on /dev/main/multimedia is now 1048576 blocks long.
Note that you do not need to specify the filesystem size; the entire LV size will be used.
If the resize2fs fails with the message No space left on device, the new size is too large for the existing allocation tables
Before reducing the size of a logical volume, you must first reduce the size of the filesystem inside the LV. This must be done when the filesystem is unmounted:
# umount /dev/main/multimedia
Next, run a filesystem check to verify the integrity of the filesystem. This is required in order to prevent data loss that may occur if there is data near the end of the filesystem (this is the area that will be freed up by shrinking) and that data is not properly accounted for in the filesystem tables:
# fsck -f /dev/main/multimedia
e2fsck 1.38 (30-Jun-2005)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/main/multimedia: 11/117248 files (9.1% non-contiguous), 8043/262144 blocks
Now use resize2fs to reduce the size of the filesystem:
# resize2fs /dev/main/multimedia 740M
resize2fs 1.38 (30-Jun-2005)
Resizing the filesystem on /dev/main/multimedia to 189440 (4k) blocks.
The filesystem on /dev/main/multimedia is now 189440 blocks long.
Note that resize2fs expects the size to be the second argument (there is no --size option as there is with the LVM commands).
The LVM commands accept sizes containing decimals (such as 1.2G), but resize2fs does not; use the next smaller unit to eliminate the decimal point (1200M).
Both the filesystem commands and the LVM commands round off sizes to the closest multiple of their internal allocation units. This means that resize2fs and lvreduce may interpret a size such as 750M slightly differently. In order to avoid the potential disaster of resizing the LV to be smaller than the filesystem, always resize the filesystem so that it is slightly smaller than the planned LV size, resize the LV, and then grow the filesystem to exactly fill the LV. In this case, I'm resizing the filesystem to 740 MB and will resize the LV to 750 MB.
Now that the filesystem has been resized, you can shrink the logical volume:
# lvreduce /dev/main/multimedia --size 750M
Rounding up size to full physical extent 752.00 MB
WARNING: Reducing active logical volume to 752.00 MB
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce multimedia? [y/n]: y
Reducing logical volume multimedia to 752.00 MB
Logical volume multimedia successfully resized
Finally, grow the filesystem to completely fill the logical volume:
# resize2fs /dev/main/multimedia
resize2fs 1.38 (30-Jun-2005)
Resizing the filesystem on /dev/main/multimedia to 192512 (4k) blocks.
The filesystem on /dev/main/multimedia is now 192512 blocks long.
The lvcreate command will create a new volume:
# lvcreate main --name survey --size 5G
Logical volume "survey" created
Next, add a filesystem:
# mkfs -t ext3 -L survey -E resize= 20G /dev/main/survey
mke2fs 1.38 (30-Jun-2005)
Filesystem label=survey
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
655360 inodes, 1310720 blocks
65536 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=8388608
40 block groups
32768 blocks per group, 32768 fragments per group
16384 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736
Writing inode tables: done
Creating journal (32768 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
The -t ext3 option specifies the filesystem type, -L survey specifies a optional filesystem volume label (to identify the contents), and -E resize= 20G (also optional) configures a block group descriptor table large enough that the filesystem can be grown up to 20 GB while mounted. In this case, 20 GB is four times the initial size of the filesystem; use whatever upper limit seems reasonable for your application (the table will take roughly 4 KB of space for each gigabyte in the filesystem maximum size, so the overhead is minimal).
You can now mount the filesystem and use it. Here I'll use /usr/lib/survey as the mount point:
# mkdir /usr/lib/survey
# mount /dev/main/survey /usr/lib/survey
To configure the Fedora system to mount this filesystem every time it is booted, add an entry to the file /etc/fstab :
/dev/main/root / ext3 defaults 1 1
LABEL=/boot /boot ext3 defaults 1 2
devpts /dev/pts devpts gid=5,mode=620 0 0
tmpfs /dev/shm tmpfs defaults 0 0
proc /proc proc defaults 0 0
sysfs /sys sysfs defaults 0 0
/dev/main/swap swap swap defaults 0 0
/dev/main/home /home ext3 defaults 1 2
/dev/main/multimedia /tmp/media ext3 defaults 1 2
/dev/main/survey /usr/lib/survey ext3 defaults 1 2
The new line (highlighted in bold) contains the filesystem block device, the mount point, the filesystem type, any mount options ( defaults specifies the default options, which include mounting the filesystem at boot time), whether the filesystem should be backed up ( 1 meaning yes ), and the fsck sequence number ( 2 is for filesystems that should be checked but that are not the root filesystem).
The lvcreate command is also used to create snapshot volumes:
# lvcreate -s /dev/main/survey --name survey-snap --size 500M
Logical volume "survey-snap" created
The -s option indicates that this is a snapshot LV. Specify the origin LV as the first positional argument, and use the --name and --size options as you would for a regular lvcreate command. However, the value given for the --size option must be the amount of space allocated for tracking the differences between the origin LV and the snapshot LV.
Once the snapshot has been created, it can be mounted and used:
# mkdir /usr/lib/survey-snap
# mount /dev/main/survey-snap /usr/lib/survey-snap
To have the snapshot automatically mounted when the system is booted, edit the file /etc/fstab in the same way that you would for a regular filesystem.
To see how much of a snapshot's storage is in use, use lvs or lvdisplay :
# lvs
LV VG Attr LSize Origin Snap% Move Log Copy%
home main -wi-ao 1.00G
multimedia main -wi-a- 752.00M
root main -wi-ao 9.77G
survey main owi-ao 5.00G
survey-snap main swi-ao 500.00M survey 8.27
swap main -wi-ao 1.00G
# lvdisplay /dev/main/survey-snap --- Logical volume ---
LV Name /dev/main/survey-snap
VG Name main
LV UUID IbG5RS-Tcle-kzrV-Ga9b-Jsgx-3MY6-iEXBGG
LV Write Access read/write
LV snapshot status active destination for /dev/main/survey
LV Status available
# open 1
LV Size 5.00 GB
Current LE 1280
COW-table size 500.00 MB
COW-table LE 125
Allocated to snapshot 8.27%
Snapshot chunk size 8.00 KB
Segments 1
Allocation inherit
Read ahead sectors 0
Block device 253:7
In this case, 8.27% of the snapshot storage has been used, or about 41 MB. If this approaches 100%, you can grow the snapshot LV using lvextend in the same way that a regular LV is grown.
To remove a logical volume, unmount it, and then use lvremove :
# umount /usr/lib/survey-snap
# lvremove /dev/main/survey-snap
Do you really want to remove active logical volume "survey-snap"? [y/n]: y
Logical volume "survey-snap" successfully removed
Removing an LV is irreversible, so be sure that you're not deleting any important data.
To set up a partition for use as a physical volume, use the pvcreate command to write the LVM disk label, making the partition into a physical volume:
# pvcreate /dev/sde1
Physical volume "/dev/sde1" successfully created
If the disk is not partitioned, you can use fdisk or (more easily) parted to create a partition before running pvcreate.
These commands create a single partition that fills the entire disk /dev/sde:
# parted /dev/sde mklabel msdos# parted -- /dev/sdemkpart primary ext2 1 -1
In this case, the partition will be /dev/sde1.
You can then add that PV to an existing volume group:
# vgextend main /dev/sde1
Volume group "main" successfully extended
The vgreduce command is used to reduce the size of a volume group by removing a physical volume. It will fail if any space on the PV is in use:
# vgreduce main /dev/sdb1
Physical volume "/dev/sdb1" still in use
In this case, an attempt to remove /dev/sdb1 from the volume group main failed. To move the data off a PV (assuming that there is sufficient space available on other PVs in the volume group), use the pvmove command:
# pvmove /dev/sde1 /dev/sde1: Moved: 100.0%
Depending on the amount of date to be moved, this operation can take quite a while to run. When it is complete, you can remove the physical volume:
# vgreduce main /dev/sdb1
Removed "/dev/sdb1" from volume group "test"
You can then use that partition for other uses. If you want to erase the LVM disk label, use the pvremove command:
# pvremove /dev/sde1
Labels on physical volume "/dev/sde1" successfully wiped
Some filesystems, such as those containing /var or /etc , may be in use anytime the system is booted normally. This prevents the use of resize2fs to shrink ext2 and ext3 filesystems or to grow them large enough to exceed the block group descriptor table.
To use resize2fs on these filesystems, you must use runlevel s , which is single-user mode. Boot your system, and press the spacebar when the GRUB boot screen appears. Press the A key to append text to the boot line; then type s and press Enter. After a few seconds, a root shell prompt will appear ( sh-3.1# ).
At this shell prompt you can unmount the filesystem, then use fsck , resize2fs , and lvreduce (or lvextend ). For example, to reduce the size of /home to 925 MB:
sh-3.1# umount /home
sh-3.1# fsck -f /dev/main/home
e2fsck 1.38 (30-Jun-2005)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/main/home: 121/256000 files (2.5% non-contiguous), 12704/262144 blocks
sh-3.1# resize2fs /dev/main/home 900M
resize2fs 1.38 (30-Jun-2005)
Resizing the filesystem on /dev/main/home to 230400 (4k) blocks.
The filesystem on /dev/main/home is now 229376 blocks long.
sh-3.1# lvreduce /dev/main/home --size 950 M
Rounding up size to full physical extent 952.00 MB
WARNING: Reducing active logical volume to 952.00 MB
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce home? [y/n]: y
Reducing logical volume home to 952.00 MB
Logical volume home successfully resized
sh-3.1# resize2fs /dev/main/home
resize2fs 1.38 (30-Jun-2005)
Resizing the filesystem on /dev/main/home to 243712 (4k) blocks.
The filesystem on /dev/main/home is now 243712 blocks long.
The warning message displayed by lvreduce is accurate: if you set the logical volume size smaller than the filesystem size, you will lose data! Be extremely careful when resizing volumes; it's a good idea to back up your data first.
If your system has the default Volume Group and Logical Volume names, substitute the correct name (such as /dev/VolGroup00/LogVol00) for /dev/main/home. The problem is that it's hard to keep the logical volume names straightwhich is why I recommend using more meaningful names.
Note that, as before, the filesystem was resized to be slightly smaller than the desired size, then expanded to fill the LV after the LV was resized.
When you're done, type reboot or press Ctrl-Alt-Delete to restart the system.
To reduce or substantially grow the root filesystem, you'll have to boot from a device other than your normal disk. The most convenient way to do this is to boot from the Fedora Core installation media; when the boot screen appears ( Figure 1-1 ), type linux rescue and press Enter.
After prompting you for the language ( Figure 1-5 ) and keyboard type ( Figure 1-6 ) the same way it does for a network installation (use the arrow keys and Enter to select the correct value for each), the system will ask if you wish to start the network interfaces, as shown in Figure 6-6 . Select No by pressing Tab and then Enter.
Figure 6-6. Rescue mode network interface dialog
The next screen, shown in Figure 6-7 , enables you to select filesystem mounting; select Skip by pressing Tab twice and then pressing Enter.
Figure 6-7. Rescue mode filesystem mounting dialog
You will then be presented with a shell prompt ( sh-3.1# ). The LVM device nodes will not be present until you scan for them and activate them:
sh-3.1# lvm vgscan
Reading all physical volumes. This may take a while...
Found volume group "main" using metadata type lvm2
sh-3.1# lvm vgchange -ay
3 logical volume(s) in volume group "main" now active
The LVM device nodes will be created in /dev/mapper/<vg-lv> and /dev/<vg>/<lv>. The /dev/dm-<N> nodes are not created.
You can now resize the root partition:
sh-3.1# fsck -f /dev/main/root
WARNING: couldn't open /etc/fstab: No such file or directory
e2fsck 1.38 (30-Jun-2005)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/main/root: 134009/1532576 files (0.5% non-contiguous), 793321/1531904 blocks
sh-3.1# resize2fs /dev/main/root 5600M
resize2fs 1.38 (30-Jun-2005)
Resizing the filesystem on /dev/main/root to 1433600 (4k) blocks.
The filesystem on /dev/main/root is now 1433600 blocks long.
sh-3.1# lvreduce /dev/main/root --size 5650M
Rounding up size to full physical extent 5.53 GB
WARNING: Reducing active logical volume to 5.53 GB
THIS MAY DESTROY YOUR DATA (filesystem etc.)
Do you really want to reduce root? [y/n]: y
Reducing logical volume root to 5.53 GB
Logical volume root successfully resized
sh-3.1# resize2fs /dev/main/root
resize2fs 1.38 (30-Jun-2005)
Resizing the filesystem on /dev/main/root to 1449984 (4k) blocks.
The filesystem on /dev/main/root is now 1449984 blocks long.
Type exit or press Ctrl-D to exit from the rescue-mode shell. The system will then reboot; don't forget to remove the installation media.
LVM works by dividing storage space into same-sized pieces called extents , which may be anywhere from 1 to 128 MB in size. The extents that make up physical storage are called physical extents (PEs); the extents that make up logical volumes are called logical extents (LEs).
Obviously, each LE exists as a PE somewhere in the LVM system. A kernel facility called the device mapper converts between LE and PE extent numbers. When the physical extents are changedas the result of a pvmove , for examplethe logical extent numbers remain the same, providing continuity for the filesystem.
Extents tend to be fairly largeanywhere from 8 KB to 16 GB in size, but typically in the 1 to 128 MB range (32 MB is the default extent size used during installation). Larger extent sizes cause a reduction in the LVM overhead because the extent tables are smaller and need to be consulted less often. However, LVs and PVs must be a multiple of the extent size, so a large size limits granularity. The extent size can be configured when the VG is created, either at boot time or by using the --physicalextentsize argument to vgcreate .
A large, efficient extent size is usually too big for effective copy-on-write operation during snapshots, so a smaller chunk size is used for copy-on-write management. This can be configured using the --chunksize option to lvcreate .
It is possible to take multiple snapshots of a filesystem. For example, you could have snapshots of /home for each day in the preceding week, making it even easier for your users to restore their own files in the case of accidental deletion or damage. However, when you have multiple snapshots in place, a single write can trigger a lot of copy-on-write activityso don't go overboard, or your write performance could really suffer.
The LVM system has striping capability, which spreads data over multiple PVs. Data can be read from multiple PVs simultaneously, increasing throughput in some cases.
To enable striping, use the -i (stripe-count) and -I (stripe-size) arguments to the lvcreate command:
# lvcreate main -i 3 -I 8 --name mysql --size 20G
The stripe count must be equal to or less than the number of PVs in the VG, and the stripe size (which is in kilobytes) must be a power of 2 between 4 and 512.
You can also select striping in the LV Properties area of the Create New Logical Volume dialog ( Figure 6-4 ).
To protect data integrity, recent versions of LVM provide a mirroring capability, which stores two copies of each physical extent on two different disks. However, this is noted as a technology preview capability in Fedora Core 6, meaning that it's at a beta-test stage.
An alternative approach that is stable, proven, and provides a wider range of configuration options is to layer LVM on top of the md RAID system (discussed in Lab 6.2, "Managing RAID ").
LVM can be layered on top of the Linux md RAID driver, which combines the flexibility of LVM with striping, mirroring, and advanced error-correction capabilities. See Lab 6.2, "Managing RAID ," for details on how this is configured.
Although you can use a raw disk as a PV, it's not recommended. The graphical administration tools don't support it, and the amount of space lost to a partition table is minimal (about 1 KB).
If you suspect that a disk drive is failing, and you want to save the data that is on that drive, you can add a replacement PV to your volume group, migrate the data off the failing (or slow or undersized) disk onto the new PV, and then remove the original disk from the volume group.
To migrate data off a specific PV, use the pvmove command:
# pvmove /dev/hda3
LVM is all about flexibilitybut for absolute maximum flexibility, divide your disk into multiple partitions and then add each partition to your volume group as a separate PV.
For example, if you have a 100 GB disk drive, you can divide the disk into five 20 GB partitions and use those as physical volumes in one volume group.
The advantage to this approach is that you can free up one or two of those PVs for use with another operating system at a later date. You can also easily switch to a RAID array by adding one (or more) disks, as long as 20 percent of your VG is free, with the following steps:
1. Migrate data off one of the PVs.
2. Remove that PV from the VG.
3. Remake that PV as a RAID device.
4. Add the new RAID PV back into the VG.
5. Repeat the process for the remaining PVs.
You can use this same process to change RAID levels (for example, switching from RAID-1 (mirroring) to RAID-5 (rotating ECC) when going from two disks to three or more disks).
The manpages for lvm , vgcreate , vgremove , vgextend , vgreduce , vgdisplay , vgs , vgscan , vgchange , pvcreate , pvremove, pvmove , pvdisplay , pvs , lvcreate , lvremove , lvextend , lvreduce , lvresize , lvdisplay , lvs
The LVM2 Resource page: http://sourceware.org/lvm2/
A Red Hat article on LVM: http://www.redhat.com/magazine/009jul05/departments/red_hat_speaks/
Redundant Arrays of Inexpensive Disks (RAID) is a technology for boosting storage performance and reducing the risk of data loss due to disk error. It works by storing data on multiple disk drives and is well supported by Fedora. It's a good idea to configure RAID on any system used for serious work.
RAID can be managed by the kernel, by the kernel working with the motherboard BIOS, or by a separate computer on an add-in card. RAID managed by the BIOS is called dmraid ; while supported by Fedora Core, it does not provide any significant benefits over RAID managed solely by the kernel on most systems, since all the work is still performed by the main CPU.
Using dmraid can thwart data-recovery efforts if the motherboard fails and another motherboard of the same model (or a model with a compatible BIOS dmraid implementation) is not available.
Add-in cards that contain their own CPU and battery-backed RAM can reduce the load of RAID processing on the main CPU. However, on a modern system, RAID processing takes at most 3 percent of the CPU time, so the expense of a separate, dedicated RAID processor is wasted on all but the highest-end servers. So-called RAID cards without a CPU simply provide additional disk controllers, which are useful because each disk in a RAID array should ideally have its own disk-controller channel.
There are six "levels" of RAID that are supported by the kernel in Fedora Core, as outlined in Table 6-3.
Table 6-3. RAID levels supported by Fedora Core
RAID Level | Description | Protection against drive failure | Write performance | Read performance | Number of drives | Capacity |
---|---|---|---|---|---|---|
Linear | Linear/Append. Devices are concatenated together to make one large storage area (deprecated; use LVM instead). | No. | Normal. | Normal | 2 | Sum of all drives |
0 | Striped. The first block of data is written to the first block on the first drive, the second block of data is written to the first block on the second drive, and so forth. | No. | Normal to normal multiplied by the number of drives, depending on application. | Multiplied by the number of drives | 2 or more | Sum of all drives |
1 | Mirroring. All data is written to two (or more) drives. | Yes. As long as one drive is working, your data is safe. | Normal. | Multiplied by the number of drives | 2 or more | Equal to one drive |
4 | Dedicated parity. Data is striped across all drives except that the last drive gets parity data for each block in that "stripe." | Yes. One drive can fail (but any more than that will cause data loss). | Reduced: two reads and one write for each write operation. The parity drive is a bottleneck. | Multiplied by the number of drives minus one | 3 or more | Sum of all drives except one |
5 | Distributed parity. Like level 4, except that the drive used for parity is rotated from stripe to stripe, eliminating the bottleneck on the parity drive. | Yes. One drive can fail. | Like level 4, except with no parity bottleneck. | Multiplied by the number of drives minus one | 3 or more | Sum of all drives except one |
6 | Distributed error-correcting code. Like level 5, but with redundant information on two drives. | Yes. Two drives can fail. | Same as level 5. | Multiplied by the number of drives minus two | 4 or more | Sum of all drives except two |
For many desktop configurations, RAID level 1 (RAID 1) is appropriate because it can be set up with only two drives. For servers, RAID 5 or 6 is commonly used.
Although Table 6-3 specifies the number of drives required by each RAID level, the Linux RAID system is usually used with disk partitions, so a partition from each of several disks can form one RAID array, and another set of partitions from those same drives can form another RAID array.
RAID arrays should ideally be set up during installation, but it is possible to create them after the fact. The mdadm command is used for all RAID administration operations; no graphical RAID administration tools are included in Fedora.
The fastest way to see the current RAID configuration and status is to display the contents of /proc/ mdstat :
$ cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hdc1[1] hda1[0]
102144 blocks [2/2] [UU]
md1 : active raid1 hdc2[1] hda3[0]
1048576 blocks [2/2] [UU]
md2 : active raid1 hdc3[1]
77023232 blocks [2/1] [_U]
This display indicates that only the raid1 ( mirroring) personality is active, managing three device nodes:
md0
This is a two-partition mirror, incorporating /dev/hda1 (device 0) and /dev/hdc1 (device 1). The total size is 102,144 blocks (about 100 MB). Both devices are active.
md1
This is another two-partition mirror, incorporating /dev/hda3 as device 0 and /dev/hdc2 as device 1. It's 1,048,576 blocks long (1 GB), and both devices are active.
md2
This is yet another two-partition mirror, but only one partition ( /dev/hdc3 ) is present. The size is about 75 GB.
The designations md0 , md1 , and md2 refer to multidevice nodes that can be accessed as /dev/md0 , /dev/md1 , and /dev/md2 .
You can get more detailed information about RAID devices using the mdadm command with the -D (detail) option. Let's look at md0 and md2 :
# mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Mon Aug 9 02:16:43 2004
Raid Level : raid1
Array Size : 102144 (99.75 MiB 104.60 MB)
Device Size : 102144 (99.75 MiB 104.60 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Tue Mar 28 04:04:22 2006
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : dd2aabd5:fb2ab384:cba9912c:df0b0f4b
Events : 0.3275
Number Major Minor RaidDevice State
0 3 1 0 active sync /dev/hda1
1 22 1 1 active sync /dev/hdc1
# mdadm -D /dev/md2
/dev/md2:
Version : 00.90.03
Creation Time : Mon Aug 9 02:16:19 2004
Raid Level : raid1
Array Size : 77023232 (73.46 GiB 78.87 GB)
Device Size : 77023232 (73.46 GiB 78.87 GB)
Raid Devices : 2
Total Devices : 1
Preferred Minor : 2
Persistence : Superblock is persistent
Update Time : Tue Mar 28 15:36:04 2006
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 0
Spare Devices : 0
UUID : 31c6dbdc:414eee2d:50c4c773:2edc66f6
Events : 0.19023894
Number Major Minor RaidDevice State
0 0 0 - removed
1 22 3 1 active sync /dev/hdc3
Note that md2 is marked as degraded because one of the devices is missing.
To create a RAID array, you will need two block devicesusually, two partitions on different disk drives.
If you want to experiment with RAID, you can use two USB flash drives; in these next examples, I'm using some 64 MB flash drives that I have lying around. If your USB drives are auto-mounted when you insert them, unmount them before using them for RAID, either by right-clicking on them on the desktop and selecting Unmount Volume or by using the umount command.
The mdadm option --create is used to create a RAID array:
# mdadm --create -n 2 -l raid1 /dev/md0 /dev/sdb1 /dev/sdc1
mdadm: array /dev/md0 started.
There are a lot of arguments used here:
--create
Tells mdadm to create a new disk array.
-n 2
The number of block devices in the array.
-l raid1
The RAID level.
/dev/md0
The name of the md device.
/dev/sdb1 /dev/sdc1
The two devices to use for this array.
/proc/mdstat shows the configuration of /dev/md0 :
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
63872 blocks [2/2] [UU]
unused devices: <none>
If you have three or more devices, you can use RAID 5, and if you have four or more, you can use RAID 6. This example creates a RAID 5 array:
# mdadm --create -n 3 -l raid5 /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdf1
mdadm: largest drive (/dev/sdb1) exceed size (62464K) by more than 1%
Continue creating array? y
mdadm: array /dev/md0 started.
Note that RAID expects all of the devices to be the same size. If they are not, the array will use only the amount of storage equal to the smallest partition on each of the devices; for example, if given partitions that are 50 GB, 47.5 GB, and 52 GB in size, the RAID system will use 47.5 GB in each of the three partitions, wasting 5 GB of disk space. If the variation between devices is more than 1 percent, as in this case, mdadm will prompt you to confirm that you're aware of the difference (and therefore the wasted storage space).
Once the RAID array has been created, make a filesystem on it, as you would with any other block device:
# mkfs -t ext3 /dev/md0
mke2fs 1.38 (30-Jun-2005)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
16000 inodes, 63872 blocks
3193 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=65536000
8 block groups
8192 blocks per group, 8192 fragments per group
2000 inodes per group
Superblock backups stored on blocks:
8193, 24577, 40961, 57345
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 28 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
Then mount it and use it:
# mkdir /mnt/raid
# mount /dev/md0 /mnt/raid
Alternately, you can use it as a PV under LVM. In this example, a new VG test is created, containing the LV mysql :
# pvcreate /dev/md0
Physical volume "/dev/md0" successfully created
# vgcreate test /dev/md0
Volume group "test" successfully created
# lvcreate test --name mysql --size 60M
Logical volume "mysql" created
# mkfs -t ext3 /dev/test/mysql
mke2fs 1.38 (30-Jun-2005)
...(Lines skipped)...
This filesystem will be automatically checked every 36 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
# mkdir /mnt/mysql
# mount /dev/test/mysql /mnt/mysql
You can simulate the failure of a RAID array element using mdadm :
# mdadm --fail /dev/md0 /dev/sdc1
mdadm: set /dev/sdc1 faulty in /dev/md0
The "failed" drive is marked with the symbol (F) in /proc/ mdstat :
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[2](F) sdb1[0]
63872 blocks [2/1] [U_]
unused devices: <none>
To place the "failed" element back into the array, remove it and add it again:
# mdadm --remove /dev/md0 /dev/sdc1
mdadm: hot removed /dev/sdc1
# mdadm --add /dev/md0 /dev/sdc1
mdadm: re-added /dev/sdc1
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
63872 blocks [2/1] [U_]
[>....................] recovery = 0.0% (928/63872) finish=3.1min speed=309K/sec
unused devices: <none>
If the drive had really failed (instead of being subject to a simulated failure), you would replace the drive after removing it from the array and before adding the new one.
Do not hot-plug disk drivesi.e., physically remove or add them with the power turned onunless the drive, disk controller, and connectors are all designed for this operation. If in doubt, shut down the system, switch the drives while the system is turned off, and then turn the power back on.
If you check /proc/mdstat a short while after readding the drive to the array, you can see that the RAID system automatically rebuilds the array by copying data from the good drive(s) to the new drive:
# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdc1[1] sdb1[0]
63872 blocks [2/1] [U_]
[=============>.......] recovery = 65.0% (42496/63872)
finish=0.8min speed=401K/sec
unused devices: <none>
The mdadm command shows similar information in a more verbose form:
# mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Thu Mar 30 01:01:00 2006
Raid Level : raid1
Array Size : 63872 (62.39 MiB 65.40 MB)
Device Size : 63872 (62.39 MiB 65.40 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Thu Mar 30 01:48:39 2006
State : clean, degraded, recovering
Active Devices : 1
Working Devices : 2
Failed Devices : 0
Spare Devices : 1
Rebuild Status : 65% complete
UUID : b7572e60:4389f5dd:ce231ede:458a4f79
Events : 0.34
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 spare rebuilding /dev/sdc1
A RAID array can be stopped anytime that it is not in useuseful if you have built an array incorporating removable or external drives that you want to disconnect. If you're using the RAID device as an LVM physical volume, you'll need to deactivate the volume group so the device is no longer considered to be in use:
# vgchange test -an
0 logical volume(s) in volume group "test" now active
The -an argument here means activated: no . (Alternately, you can remove the PV from the VG using vgreduce .)
To stop the array, use the --stop option to mdadm :
# mdadm --stop /dev/md0
The two steps above will automatically be performed when the system is shut down.
To restart the array, use the --assemble option:
# mdadm --assemble /dev/md0 /dev/sdb1 /dev/sdc1
mdadm: /dev/md0 has been started with 2 drives.
To configure the automatic assembly of this array at boot time, obtain the array's UUID (unique ID number) from the output of mdadm -D :
# mdadm -D /dev/md0
/dev/md0:
Version : 00.90.03
Creation Time : Thu Mar 30 02:09:14 2006
Raid Level : raid1
Array Size : 63872 (62.39 MiB 65.40 MB)
Device Size : 63872 (62.39 MiB 65.40 MB)
Raid Devices : 2
Total Devices : 2
Preferred Minor : 0
Persistence : Superblock is persistent
Update Time : Thu Mar 30 02:19:00 2006
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
UUID : 5fccf106:d00cda80:daea5427:1edb9616
Events : 0.18
Number Major Minor RaidDevice State
0 8 17 0 active sync /dev/sdb1
1 8 33 1 active sync /dev/sdc1
Then create the file /dev/ mdstat if it doesn't exist, or add an ARRAY line to it if it does:
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 uuid=c27420a7:c7b40cc9:3aa51849:99661a2e
In this file, the DEVICE line identifies the devices to be scanned (all partitions of all storage devices in this case), and the ARRAY lines identify each RAID array that is expected to be present. This ensures that the RAID arrays identified by scanning the partitions will always be assigned the same md device numbers, which is useful if more than one RAID array exists in the system. In the mdadm.conf files created during installation by Anaconda, the ARRAY lines contain optional level= and num-devices= enTRies (see the next section).
If the device is a PV, you can now reactivate the VG:
# vgchange test -a y
1 logical volume(s) in volume group "test" now active
The mdmonitor service uses the monitor mode of mdadm to monitor and report on RAID drive status.
The method used to report drive failures is configured in the file /etc/ mdadm.conf . To send email to a specific email address, add or edit the MAILADDR line:
# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR raid-alert
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=dd2aabd5:fb2ab384:cba9912c:df0b0f4b
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=2b0846b0:d1a540d7:d722dd48:c5d203e4
ARRAY /dev/md2 level=raid1 num-devices=2 uuid=31c6dbdc:414eee2d:50c4c773:2edc66f6
When mdadm.conf is configured by Anaconda, the email address is set to root . It is a good idea to set this to an email alias, such as raid-alert , and configure the alias in the /etc/ aliases file to send mail to whatever destinations are appropriate:
raid-alert: chris, 4165559999@msg.telus.com
In this case, email will be sent to the local mailbox chris , as well as to a cell phone.
When an event occurs, such as a drive failure, mdadm sends an email message like this:
From root@bluesky.fedorabook.com Thu Mar 30 09:43:54 2006
Date: Thu, 30 Mar 2006 09:43:54 -0500
From: mdadm monitoring <root@bluesky.fedorabook.com>
To: chris@bluesky.fedorabook.com
Subject: Fail event on /dev/md0:bluesky.fedorabook.com
This is an automatically generated mail message from mdadm
running on bluesky.fedorabook.com
A Fail event had been detected on md device /dev/md 0 .
It could be related to component device /dev/ sdc1 .
Faithfully yours, etc.
I like the "Faithfully yours" bit at the end!
If you'd prefer that mdadm run a custom program when an event is detectedperhaps to set off an alarm or other notificationadd a PROGRAM line to /etc/mdadm.conf :
# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR raid-alert
PROGRAM
/usr/local/sbin/mdadm-event-handler
ARRAY /dev/md0 level=raid1 num-devices=2 uuid=dd2aabd5:fb2ab384:cba9912c:df0b0f4b
ARRAY /dev/md1 level=raid1 num-devices=2 uuid=2b0846b0:d1a540d7:d722dd48:c5d203e4
ARRAY /dev/md2 level=raid1 num-devices=2 uuid=31c6dbdc:414eee2d:50c4c773:2edc66f6
Only one program name can be given. When an event is detected, that program will be run with three arguments: the event, the RAID device, and (optionally) the RAID element. If you wanted a verbal announcement to be made, for example, you could use a script like this:
#!/bin/bash
#
# mdadm-event-handler :: announce RAID events verbally
#
# Set up the phrasing for the optional element name
if [ "$3" ]
then
E=", element $3"
fi
# Separate words (RebuildStarted -> Rebuild Started)
$T=$(echo $1|sed "s/\([A-Z]\)/ \1/g")
# Make the voice announcement and then repeat it
echo "Attention! RAID event: $1 on $2 $E"|festival --tts
sleep 2
echo "Repeat: $1 on $2 $E"|festival --tts
When a drive fails, this script will announce something like "Attention! RAID event: Failed on /dev/md0 , element /dev/sdc1 " using the Festival speech synthesizer. It will also announce the start and completion of array rebuilds and other important milestones (make sure you keep the volume turned up).
When a system with RAID 1 or higher experiences a disk failure, the data on the failed drive will be recalculated from the remaining drives. However, data access will be slower than usual, and if any other drives fail, the array will not be able to recover. Therefore, it's important to replace a failed disk drive as soon as possible.
When a server is heavily used or is in an inaccessible locationsuch as an Internet colocation facilityit makes sense to equip it with a hot spare . The hot spare is installed but unused until another drive fails, at which point the RAID system automatically uses it to replace the failed drive.
To create a hot spare when a RAID array is initially created, use the -x argument to indicate the number of spare devices:
# mdadm --create -l raid1 -n 2 -x 1 /dev/md0 /dev/sdb1 /dev/sdc1 /dev/sdf1
mdadm: array /dev/md0 started.
$ cat /proc/mdstat
Personalities : [raid1] [raid5] [raid4]
md0 : active raid1 sdf1[2](S) sdc1[1] sdb1[0]
62464 blocks [2/2] [UU]
unused devices: <none>
Notice that /dev/sdf1 is marked with the symbol (S) indicating that it is the hot spare.
If an active element in the array fails, the hot spare will take over automatically:
$ cat /proc/mdstat
Personalities : [raid1] [raid5] [raid4]
md0 : active raid1 sdf1[2] sdc1[3](F) sdb1[0]
62464 blocks [2/1] [U_]
[=>...................] recovery = 6.4% (4224/62464) finish=1.5min speed=603K/sec
unused devices: <none>
When you remove, replace, and readd the failed drive, it will become the hot spare:
# mdadm --remove /dev/md0 /dev/sdc1
mdadm: hot removed /dev/sdc1
...(Physically replace the failed drive)...
# mdadm --add /dev/md0 /dev/sdc1
mdadm: re-added /dev/sdc1
# cat /proc/mdstat
Personalities : [raid1] [raid5] [raid4]
md0 : active raid1 sdc1[2](S) sdf1[1] sdb1[0]
62464 blocks [2/2] [UU]
unused devices: <none>
Likewise, to add a hot spare to an existing array, simply add an extra drive:
# mdadm --add /dev/md0 /dev/sdh1
mdadm: added /dev/sdh1
Since hot spares are not used until another drive fails, it's a good idea to spin them down (stop the motors) to prolong their life. This command will program all of your drives to stop spinning after 15 minutes of inactivity (on most systems, only the hot spares will ever be idle for that length of time):
# hdparm -S 180 /dev/[sh]d[a-z]
Add this command to the end of the file /etc/rc.d/rc.local to ensure that it is executed every time the system is booted:
#!/bin/sh
#
# This script will be executed *after* all the other init scripts.
# You can put your own initialization stuff in here if you don't
# want to do the full Sys V style init stuff.
touch /var/lock/subsys/local
hdparm -S 180 /dev/[sh]d[a-z]
Self-Monitoring, Analysis, and Reporting Technology (SMART) is built into most modern disk drives. It provides access to drive diagnostic and error information and failure prediction.
Fedora provides smartd for SMART disk monitoring. The configuration file /etc/ smartd.conf is configured by the Anaconda installer to monitor each drive present in the system and to report only imminent (within 24 hours) drive failure to the root email address:
/dev/hda -H -m root
/dev/hdb -H -m root
/dev/hdc -H -m root
(I've left out the many comment lines that are in this file.)
It is a good idea to change the email address to the same alias used for your RAID error reports:
/dev/hda -H -m raid-alert
/dev/hdb -H -m raid-alert
/dev/hdc -H -m raid-alert
If you add additional drives to the system, be sure to add additional entries to this file.
Fedora's RAID levels 4 and 5 use parity information to provide redundancy. Parity is calculated using the exclusive-OR function, as shown in Table 6-4.
Table 6-4. Parity calculation for two drives
Bit from drive A | Bit from drive B | Parity bit on drive C |
---|---|---|
0 | 0 | 0 |
0 | 1 | 1 |
1 | 0 | 1 |
1 | 1 | 0 |
Notice that the total number of 1 bits in each row is an even number. You can determine the contents of any column based on the values in the other two columns ( A = B XOR C and B = A XOR C ); in this way, the RAID system can determine the content of any one failed drive. This approach will work with any number of drives.
Parity calculations are performed using the CPU's vector instructions (MMX/3DNow/SSE/AltiVec) whenever possible. Even an old 400 MHz Celeron processor can calculate RAID 5 parity at a rate in excess of 2 GB per second.
RAID 6 uses a similar but more advanced error-correcting code (ECC) that takes two bits of data for each row. This code permits recovery from the failure of any two drives, but the calculations run about one-third slower than the parity calculations. In a high-performance context, it may be better to use RAID 5 with a hot spare instead of RAID 6; the protection will be almost as good and the performance will be slightly higher.
During the early stages of the boot process, no RAID driver is available. However, in a RAID 1 (mirroring) array, each element contains a full and complete copy of the data in the array and can be used as though it were a simple volume. Therefore, only RAID 1 can be used for the /boot filesystem.
The GRUB boot record should be written to each drive that contains the /boot filesystem (see Lab 10.5, "Configuring the GRUB Bootloader")
RAID can combine drives of different types into an array. This can be very useful at times; for example, you can use a USB hard disk to replace a failed SATA drive in a pinch.
Daily disk or tape backups can be up to 24 hours out of date, which can hamper recovery when your main server is subject to a catastrophic disaster such as fire, circuit-frying power-supply-unit failure, or theft. Up-to-the-minute data backup for rapid disaster recovery requires the use of a remote storage mirror.
iSCSI ( SCSI over TCP/IP) is a storage area network technology that is an economical alternative to fiber channel and other traditional SAN technologies. Since it is based on TCP/IP, it is easy to route over long distances, making it ideal for remote mirroring.
Fedora Core includes an iSCSI initiator , the software necessary to remotely access a drive using the iSCSI protocol. The package name is iscsi-initiator-utils . Obviously, you'll need a remote iSCSI drive in order to do remote mirroring, and you'll need to know the portal IP address or hostname on the remote drive.
Create the file /etc/initiatorname.iscsi , containing one line:
InitiatorName=iqn. 2006-04.com.fedorabook:bluesky
This configures an iSCSI Qualified Name (IQN) that is globally unique. The IQN consists of the letters iqn , a period, the year and month in which your domain was registered ( 2006-04 ), a period, your domain name with the elements reversed, a colon, and a string that you make up (which must be unique within your domain).
Once the initiator name has been set up, start the iscsi server daemon:
# service iscsi start
You may see some error messages the first time you start the iscsi daemon; these can be safely ignored.
Next, use the iscsiadm command to discover the volumes (targets) available on the remote system:
# iscsiadm -m discovery -tst -p 172.16.97.2
[f68ace] 172.16.97.2:3260,1 iqn.2006-04.com.fedorabook:remote1-volume1
If the remote drive requires a user ID and password for connection, edit /etc/iscsid.conf.
The options indicate discovery mode, sendtargets ( st ) discovery type, and the portal address or hostname. The result that is printed shows the IQN of the remote target, including a node record ID at the start of the line ( f68ace ). The discovered target information is stored in a database for future reference, and the node record ID is the key to accessing this information.
To connect to the remote system, use iscsiadm to log in:
# iscsiadm -m node --record f68ace --login
The details of the connection are recorded in /var/log/messages :
Mar 30 22:05:18 blacktop kernel: scsi1 : iSCSI Initiator over TCP/IP, v.0.3
Mar 30 22:05:19 blacktop kernel: Vendor: IET Model: VIRTUAL-DISK Rev: 0
Mar 30 22:05:19 blacktop kernel: Type: Direct-Access ANSI SCSI revision: 04
Mar 30 22:05:19 blacktop kernel: SCSI device sda: 262144 512-byte hdwr sectors (134 MB)
Mar 30 22:05:19 blacktop kernel: sda: Write Protect is off
Mar 30 22:05:19 blacktop kernel: SCSI device sda: drive cache: write back
Mar 30 22:05:19 blacktop kernel: SCSI device sda: 262144 512-byte hdwr sectors (134 MB)
Mar 30 22:05:19 blacktop kernel: sda: Write Protect is off
Mar 30 22:05:19 blacktop kernel: SCSI device sda: drive cache: write back
Mar 30 22:05:19 blacktop kernel: sda: sda1
Mar 30 22:05:19 blacktop kernel: sd 14:0:0:0: Attached scsi disk sda
Mar 30 22:05:19 blacktop kernel: sd 14:0:0:0: Attached scsi generic sg0 type 0
Mar 30 22:05:19 blacktop iscsid: picking unique OUI for the same target node name iqn.2006-04.com.fedorabook:remote1-volume1
Mar 30 22:05:20 blacktop iscsid: connection1:0 is operational now
This shows that the new device is accessible as /dev/sda and has one partition ( /dev/sda1 ).
You can now create a local LV that is the same size as the remote drive:
# lvcreate main --name database --size 128M
Logical volume "database" created
And then you can make a RAID mirror incorporating the local LV and the remote drive:
# mdadm --create -l raid1 -n 2 /dev/md0 /dev/main/database /dev/sdi1
mdadm: array /dev/md0 started.
Next, you can create a filesystem on the RAID array and mount it:
# mkfs -t ext3 /dev/md0
mke2fs 1.38 (30-Jun-2005)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
32768 inodes, 130944 blocks
6547 blocks (5.00%) reserved for the super user
First data block=1
Maximum filesystem blocks=67371008
16 block groups
8192 blocks per group, 8192 fragments per group
2048 inodes per group
Superblock backups stored on blocks:
8193, 24577, 40961, 57345, 73729
Writing inode tables: done
Creating journal (4096 blocks): done
Writing superblocks and filesystem accounting information: done
This filesystem will be automatically checked every 27 mounts or
180 days, whichever comes first. Use tune2fs -c or -i to override.
# mkdir /mnt/database
# mount /dev/md0 /mnt/database
Any data you write to /mnt/database will be written to both the local volume and the remote drive.
Do not use iSCSI directly over the Internet: route iSCSI traffic through a private TCP/IP network or a virtual private network (VPN) to maintain the privacy of your stored data.
To shut down the remote mirror, reverse the steps:
# umount /mnt/database
# mdadm --stop /dev/md0
# iscsiadm -m node --record f68ace --logout
A connection will be made to the remote node whenever the iSCSI daemon starts. To prevent this, edit the file /etc/iscsid.conf :
#
# Open-iSCSI default configuration.
# Could be located at /etc/iscsid.conf or ~/.iscsid.conf
#
node.active_cnx = 1
node.startup = automatic
#node.session.auth.username = dima
#node.session.auth.password = aloha
node.session.timeo.replacement_timeout = 0
node.session.err_timeo.abort_timeout = 10
node.session.err_timeo.reset_timeout = 30
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.session.iscsi.DefaultTime2Wait = 0
node.session.iscsi.DefaultTime2Retain = 0
node.session.iscsi.MaxConnections = 0
node.cnx[0].iscsi.HeaderDigest = None
node.cnx[0].iscsi.DataDigest = None
node.cnx[0].iscsi.MaxRecvDataSegmentLength = 65536
#discovery.sendtargets.auth.authmethod = CHAP
#discovery.sendtargets.auth.username = dima
#discovery.sendtargets.auth.password = aloha
Change the node.startup line to read:
node.startup = manual
Once the remote mirror has been configured, you can create a simple script file with the setup commands:
#!/bin/bash
iscsiadm -m node --record f68ace --login
mdadm --assemble /dev/md0 /dev/main/database /dev/sdi1
mount /dev/md0 /mnt/database
And another script file with the shutdown commands:
#!/bin/bash
umount /mnt/database
mdadm --stop /dev/md0
iscsiadm -m node --record f68ace --logout
Save these scripts into /usr/local/sbin and enable read and execute permission for both of them:
# chmod u+rx /usr/local/sbin/ remote-mirror-start
# chmod u+rx /usr/local/sbin/ remote-mirror-stop
You can also install these as init scripts (see Lab 4.6, "Managing and Configuring Services and Lab 4.12, "Writing Simple Scripts ").
This can be done through /etc/mdadm.conf . In each ARRAY line, add a spare-group option:
# mdadm.conf written out by anaconda
DEVICE partitions
MAILADDR root
ARRAY /dev/md0 spare-group= red uuid=5fccf106:d00cda80:daea5427:1edb9616
ARRAY /dev/md1 spare-group= red uuid=aaf3d1e1:6f7231b4:22ca60f9:00c07dfe
The name of the spare-group does not matter as long as all of the arrays sharing the hot spare have the same value; here I've used red . Ensure that at least one of the arrays has a hot spare and that the size of the hot spare is not smaller than the largest element that it could replace; for example, if each device making up md0 was 10 GB in size, and each element making up md1 was 5 GB in size, the hot spare would have to be at least 10 GB in size, even if it was initially a member of md1 .
Array rebuilds will usually be performed at a rate of 1,000 to 20,000 KB per second per drive, scheduled in such a way that the impact on application storage performance is minimized. Adjusting the rebuild rate lets you adjust the trade-off between application performance and rebuild duration.
The settings are accessible through two pseudofiles in /proc/sys/dev/raid , named speed_limit_max and speed_limit_min . To view the current values, simply display the contents:
$ cat /proc/sys/dev/raid/speed_limit*
200000
1000
To change a setting, place a new number in the appropriate pseudo-file:
# echo 40000 >/proc/sys/dev/raid/speed_limit_max
Sometimes, a drive manufacturer just makes a bad batch of disksand this has happened more than once. For example, a few years ago, one drive maker used defective plastic to encapsulate the chips on the drive electronics; drives with the defective plastic failed at around the same point in their life cycles, so that several elements of RAID arrays built using these drives would fail within a period of days or even hours. Since most RAID levels provide protection against a single drive failure but not against multiple drive failures, data was lost.
For greatest safety, it's a good idea to buy disks of similar capacity from different drive manufacturers (or at least different models or batches) when building a RAID array, in order to reduce the likelihood of near-simultaneous drive failure.
The manpages for md , mdadm , mdadm.conf , hdparm , smartd , smartd.conf , mkfs , mke2fs , and dmraid
The manpages for iscsid and iscsiadm
The Linux-iSCSI project: http://linux-iscsi.sourceforge.net
The Enterprise iSCSI Target project: http://iscsitarget.sourceforge.net/
Hard disks are mechanical devices. They are guaranteed to wear out, fail, and lose your data. The only unknown is when they will fail.
Data backup is performed to guard against drive failure. But it's also done to guard against data loss due to theft, fire, accidental deletion, bad editing, software defects, and unnoticed data corruption.
Before making backups, you must decide:
What data needs to be backed up
How often the data needs to be backed up
How quickly you need to restore the data
How far back in time you need to be able to restore
Based on this information, you can develop a backup strategy, including a backup technology, schedule, and rotation.
Any data that you want to preserve must be backed up; usually, this does not include the operating system or applications, because you can reinstall those.
Table 6-5 lists some common system roles and the directories that should be considered for backup.
Table 6-5. Directories used for critical data storage in various common system roles
System role | Standard directories | Notes |
---|---|---|
Database server (e.g., MySQL) | /var/lib/mysql | Stop the database server or use snapshots to ensure consistency between tables. |
Web server | /var/www/etc/httpd/home/*/~public_html | Also include any data directories used by web applications. |
DNS nameserver | /var/named/etc/named.conf | This information usually changes slowly. |
Desktop system, or any system accessed by individual users | /home | Exclude cache directories such as /home/*/.mozilla/firefox/*/Cache. |
Samba server | All directories served by Samba | |
CUPS print server | /etc/cups | Configuration information only; usually changes slowly. |
All systems | /etc | Configuration information for most software and hardware installed on the system. |
Generally, backup frequency should be decided based on how often (and when) the data changes, and how many changes you are willing to lose.
For example, printer configuration data may be changed only a few times a year, and losing the latest change won't cost much in terms of the work required to re-create that change. Word processing documents may be changed daily, and you may want to ensure that you don't lose more than one day's work (or even a half-day's work); on the other hand, orders on a busy web site may be received every few seconds, and you may decide that you can't live with the loss of more than a few minutes worth of data.
How long can you live without your data? The answer probably depends on regulatory and operational issues.
Some types of informationsuch as information about cross-border shipmentsmust be reported to government agencies on a daily basis, for example, and delays are penalized by fines of thousands of dollars per day. This puts a tremendous amount of pressure on the data-recovery process. On the other hand, personal music and photo collections may not need to be restored until weeks or months after the data loss.
Some types of data loss or corruption may not be realized until weeks, months, or years after they have occurred, while others will be immediately obvious. In some caseswhen data changes quicklyit may be necessary to be able to restore data to the state it was in on a specific date, while in other cases it's sufficient to be able to restore data to the state that it was in at the end of a particular month.
Files may be selected for backup on an incremental basisonly files that have been changed since the last backup are selectedor a full backup may be performed.
Incremental backups often require significantly less storage space than full backups when dealing with large sets of individual files such as word processing documents because the number of documents that are changed each day is usually fairly small. On the other hand, a small SQL update query may cause all of the files in a database to be modified, nullifying the benefits of incremental backup in that context.
An incremental backup scheme usually involves making full backups periodically and then making incremental backups until the scheduled time of the next full backup. Restoring from an incremental backup therefore requires you to restore a full backup, then restore all of the incremental backups from that point forward. Thus, the time required for a restore operation may be much longer than for a system that uses only full backups. Also, if one of the backups is unusable due to media corruption or damage, you will not be able to reliably perform a full recovery.
Given the choice between full and incremental backups, I recommend using full backups whenever practical.
Cost, capacity, and speed usually drive the selection of backup media. There are many options available:
DVDR/RW
DVD is an attractive medium. Fedora includes software to produce compressed optical discs that are automatically decompressed by the kernel when they are read. The compression ratio will depend on the type of data being backed up; text files may compress by 7590 percent, while data that is already in a compressed format (such as OpenOffice.org documents) may not compress at all. You can reasonably expect 50 percent compression for a typical mix of user files, and 75 percent for databases containing text data; that means a single-sided DVDR, which costs only a few cents and which has a nominal capacity of 4.7 GB (usable capacity of slightly over 4.3 GB), will hold 8+ GB of regular user files or 16+ GB of database files. DVD is also a fast, random-access medium.
CD-R/RW
Similar to DVD, with a lower storage capacity and wider deployment. Because higher-capacity DVDs are similarly priced (actually, cheaper in some jurisdictionssuch as Canadadue to music levies on CDs), DVDs are preferred except when backing up a device such as a laptop that has only a CD-RW drive.
Tape
Tape is by far the most economical choice for high-volume data backup (>10 GB uncompressed), but it still doesn't come cheap. Tape drives can cost more than the disk drives being backed up, and each backup tape can cost 2550 percent of the price of the corresponding disk storage. Tapes are also fairly slow during search and restore operations due to their sequential nature.
Disk
Hard disks can be used for data backup. USB drives are particularly convenient for this purpose, but removable drive trays can also be used with ATA or SATA drives. Hard drives are fast, but expensive and fragile.
Remote storage
Copying an archive of data to a remote system periodically.
Remote mirror
Making an immediate copy of all data written to the local disk drive provides the ultimate backup, but this approach is complicated and does not by itself guard against data corruption or accidental file deletion. For one approach to remote mirroring, see " . . . mirroring to a remote drive as part of a disaster-recovery plan? " in the "What About . . . " section in Lab 6.2, "Managing RAID ."
I'm going to focus on DVD and tape storage options in this lab.
When using DVDs, you have the option of selecting DVDR media, which can only be written once. This provides an inexpensive, compact, and permanent archive through time; assuming one disc per day, a year's worth of discs will take only about 4L of space and cost less than $100.
For tape and DVDRW media, you'll need to decide on your media rotation strategy. This is a compromise between the number of tapes/discs and how far back in time you wish to restore.
A simple rotation scheme involves buying a set amount of media and rotating through it. For example, 20 discs or tapes used only on weekdays will enable you to restore files to the state they were in during any weekday in the preceding four weeks.
A multilevel scheme permits you to go back farther in time. A simple three-level scheme (known as Grandfather/Father/Son ) is shown in Table 6-6 .
Table 6-6. Grandfather/Father/Son backup scheme with 20 discs/tapes
Level | Media used | Discs or tapes required |
---|---|---|
A (Son) | Monday Thursday | 4 |
B (Father) | Three out of every four Fridays | 3 |
C (Grandfather) | Fridays not covered by level B | 13 |
This scheme uses the same 20 discs or tapes, but permits you to restore to:
Any weekday in the preceding week
The end of any week in the preceding four weeks
The end of any four-week period in the preceding year
Note that level A media will be more frequently used than level B or C media and will therefore need to be replaced more often.
You must also decide where and how you will store your media. Unless the media is stored offsite, a disaster such as fire or theft could result in the loss of both the original storage drives and the backup media, but storing media offsite will slow the restoration process.
There are many ways of labeling backups, but one of the easiest is to create a file named system-<hostname> in the root directory immediately before producing the backup, and include that as the first file in the backup volume:
# touch /system-$(hostname)
# ls -l /system-*
-rw-r--r-- 1 root root 0 Jul 1 01:34 /system-bluesky.fedorabook.com
This will identify the originating system name as well as the date and time of the backup (from the file timestamp).
To back up data to DVD, use the growisofs command:
# growisofs -Z /dev/dvd -RJ -graft-points /etc=/etc /home=/home /system-*
This will back up the /etc and /home directories to /dev/dvd (the default DVD recorder). -Z indicates that this is the first session on the disc, and -RJ enables long filename handling compatible with Unix/Linux (Rock Ridge) and Windows (Joliet) systems. The graft-points option permits the backed-up directories to be stored in specific directories on the disc. /etc=/etc and /home=/home specify the directories to be backed up, ensuring that each directory is placed in a directory with the same name on the disc. The argument /system-* places the system label file in the root directory of the DVD.
This command will work with DVD-R, DVD+R, DVD-RW, and DVD+RW media.
To create a compressed DVD, use the mkzftree command to create a compressed copy of the origin directories:
# mkdir /tmp/zftree
# mkzftree /home /tmp/zftree/home
# mkzftree /etc /tmp/zftree/etc
You will need sufficient disk space to hold the compressed image before it is written to the optical disc.
Then use the -z option to growisofs :
# growisofs -Z /dev/dvd -RJz /tmp/zftree /system-*
Putting this all together into a script, and mailing the results to the email alias backup-alert , we get this:
#!/bin/bash
#
# backup-dvd :: backup selected directories to a compressed DVD
#
# List of the directories to be backed up
DIRLIST
= "
/etc /home "
# Create timestamp file
(
rm -f /system-*
touch /system-$(hostname)
# Make directory for compressed backup tree
rm -rf /tmp/zftree 2>/dev/null
mkdir /tmp/zftree
RESULT=0
for DIR in $DIRLIST
do
mkzftree $DIR /tmp/zftree${DIR}
RESULT=$(( $? + $RESULT ))
done
if [ "$RESULT" -eq 0 ]
then
# Burn the DVD
growisofs -Z /dev/dvd -RJz /tmp/zftree /system-*
# Eject the disc
eject
else
echo "Skipping burn: file compression failed."
fi
# Delete the zftree
rm -rf /tmp/zftree 2>/dev/null
) 2>&1|mail -s "Backup Log $(hostname)" backup-alert
Edit the DIRLIST line so that it contains a list of the directories to be backed up, separated by spaces.
Save this file as /usr/local/bin/backup-dvd and then make it executable:
# chmod u+rx /usr/local/bin/backup-dvd
And be sure to create an email alias for the backup-alert user in the file /etc/aliases :
backup-alert: chris frank
To produce a backup, execute this script:
# backup-dvd
But it's a better idea to configure the system to run this script automatically every night (see Lab 6.4, "Scheduling Tasks ").
To back up directories to tape, use the tape archiver ( tar ):
# tar -cf /dev/st0 /system-* /etc /home
tar: Removing leading \Q/' from member names
tar: Removing leading \Q/' from hard link targets
In this command, /dev/st0 is the first tape drive, and /etc and /home are the directories being backed up.
To perform a compressed backup, add the z (for gzip compression) or j (for bzip2 compression) option:
# tar -czf /dev/st0 /system-* /etc /home
tar: Removing leading \Q/' from member names
tar: Removing leading \Q/' from hard link targets
Here is a script that will perform a tape backup:
#!/bin/bash
#
# backup-tape :: backup selected directories to a compressed tape
#
# List of the directories to be backed up
DIRLIST="
/etc /home "
# Create timestamp file
(
rm -f /system-*
touch /system-$(hostname)
# Produce the tape
tar -czf /dev/st0 /system-* $DIRLIST
# Eject the tape if possible
mt -f /dev/st0 eject
) 2>&1|mail -s "Backup Log $(hostname)" backup-alert
Save this script as /usr/local/bin/backup-tape .
Like the backup-dvd script, this script will send an email report to the email alias backup-alert . To include a list of files in the email report, add the -v option to the tar command:
tar -cz v f /dev/st0 /system-* $DIRLIST
To produce a backup tape, run the script from the command line:
# backup-tape
It's best to run this script automatically every night (see Lab 6.4, "Scheduling Tasks ").
When restoring from tape, it's a good idea to restore to a location other than the original file location to ensure that critical data is not accidentally overwritten. These commands will perform a full restore of a tape to the directory /tmp/restore :
# mkdir /tmp/restore
# cd /tmp/restore
# tar xvzf /dev/st0
To restore only certain files, specify the filenames as arguments to tar :
# tar xvzf /dev/st0 home/chris/
If the file specified is a directory, all of the files and subdirectories in that directory will be restored.
Restoring from disc is easy: just copy the files that you want to the location that you want. You can do this graphically, or you can restore all of the files on the disc:
# mkdir /tmp/restore
# cd /tmp/restore
# cp -r /media/CDROM/* .
To verify that a tape backup is readable, use tar's t option to view a table of contents of the tape:
# tar tvzf /dev/st0
-rw-r--r-- root/root 0 2006-07-01 01:34:24 system-bluesky.fedorabook.com
drwxr-xr-x root/root 0 2005-09-23 15:01:38 etc/gconf/
drwxr-xr-x root/root 0 2005-03-02 11:59:15 etc/gconf/gconf.xml.mandatory/
drwxr-xr-x root/root 0 2005-08-29 00:53:34 etc/gconf/1/
-rw-r--r-- root/root 840 2005-03-02 11:59:11 etc/gconf/1/path
drwxr-xr-x root/root 0 2006-03-20 01:33:22 etc/gconf/schemas/
...(Lines skipped)...
Since the label file /system-* is the first file on the tape, you can view the originating machine as well as the date and time of the backup by just viewing the first line of the table of contents:
# tar tvzf /dev/st0|head -1
-rw-r--r-- root/root 0 2006-07-01 01:34:24 system-bluesky.fedorabook.com
To verify that all of the files on an optical disc are readable, use find to read each file on the mounted disc:
# find /media/cdrecorder -exec cp {} /dev/null \;
Only errors will be reported.
The growisofs command is part of the package dvd+rw-tools , which was originally intended for use with DVD+RW media. Since the original design, it has grown to include support for all DVD media formats. It operates as a frontend to the mkisofs command, which produces a filesystem in the ISO 9660 format that is the standard for optical media, and then writes the mkisofs output to the disc burner.
ISO 9660 is unfortunately limited to eight-character filenames with a three-character extension. The Rock Ridge (RR) extension adds support for long filenames, user and group ownership, and permission mode under Linux; Joliet extensions add similar support for the Windows operating systems. Using the -JR option to growisofs causes the created disk to be compatible with both Rock Ridge and Joliet.
mkzftree makes a recursive copy of a directory structure, compressing any files that would benefit from compression during the copy process. The resulting directory structure can be passed to mkisofs with the -z option, which will cause mkisofs to create additional Rock Ridge records with information about the data compression used. These records in turn enable the kernel's filesystem layer to decompress the files on the fly when reading them from disc.
When backing up to tape, tar converts a directory structure to a continuous stream of bytes. A short header contains the pathname, ownership, permissions modes, size, and timestamps for a file, followed by the data for that file; this is repeated for each file in the archive.
The z option to tar causes it to start gzip and process all data through it. As an alternative, the j option will process the archive stream through bzip2 , which may offer better compression in some circumstances.
You can simply place the appropriate vgcreate and mount commands at the start of your backup script, and umount and vgremove commands at the end of the script.
Here is a slightly fancier version of the DVD backup script, which accepts a list of vg / lv pairs and creates a compressed DVD backup. Set the LVLIST and SNAPSIZE variables to whatever values you wish to use:
#!/bin/bash
#
# backup-dvd :: backup selected directories to a compressed DVD
#
# List of the vg/lv to be backed up
LVLIST="main/home main/var"
# Amount of space to use for snapshots
SNAPSIZE="1G"
# Create timestamp file
(
rm -f /system-*
touch /system-$(hostname)
# Make directory for compressed backup tree
rm -rf /tmp/zftree
mkdir /tmp/zftree
RESULT=0
for VGLV in $LVLIST
do
echo "========= Processing $VGLV..."
# Get information about the vg/lv
VG=$(echo $VGLV|cut -f1 -d/)
LV=$(echo $VGLV|cut -f2 -d/)
SNAPNAME="${LV}-snap"
OLDMOUNT= \
$(grep "^/dev/${VGLV}" /etc/fstab|tr "\t" " "|tr -s " "|cut -f2 -d" ")
NEWMOUNT="/mnt/snap${OLDMOUNT}"
# Create a snapshot
lvcreate -s $VGLV --name $SNAPNAME --size $SNAPSIZE
RESULT=$(( $? + $RESULT ))
# Mount the snapshot
mkdir -p $NEWMOUNT
mount -o ro /dev/${VG}/${SNAPNAME} ${NEWMOUNT}
RESULT=$(( $? + $RESULT ))
# Place it in the zftree
mkdir -p /tmp/zftree$(dirname $OLDMOUNT)
mkzftree ${NEWMOUNT} /tmp/zftree${OLDMOUNT}
RESULT=$(( $? + $RESULT ))
# Unmount the snapshot
umount $NEWMOUNT
# Release the snapshot
lvremove -f ${VG}/${SNAPNAME}
done
if [ "$RESULT" -eq 0 ]
then
# Burn the DVD
growisofs -Z /dev/dvd -RJz /tmp/zftree /system-*
# Eject the disc
eject
else
echo "Skipping burn: snapshot or file compression failed."
fi
# Delete the zftree
rm -rf /tmp/zftree 2>/dev/null
) 2>&1|mail -s "Backup Log $(hostname)" backup-alert
Each LV to be backed up must have a mount point identified in /etc/fstab.
The device node /dev/st0 is the default (first) tape drive on the system, configured to rewind after each use. /dev/nst0 is the same device but without the automatic rewind.
In order to position the tape, Fedora provides the mt command, described in Table 6-7 .
Table 6-7. mt tape control commands
mt command | Description |
---|---|
mt rewind | Rewinds the tape |
mt fsf | Forward-skips a file |
mt fsf count | Forward-skips count files |
mt bsf | Backward-skips a file |
mt bsf count | Backward-skips count files |
mt status | Displays the drive status |
mt offline or mt eject | Rewinds and ejects the tape (if possible) |
The mt command uses /dev/tape as its default device; create this as a symbolic link to /dev/nst0 if it does not already exist:
# ln -s /dev/nst 0 /dev/tape
You can now create a multibackup tape:
# mt rewind
# tar cvzf /dev/tape /home
# tar cvzf /dev/tape /etc
# mt rewind
To read a specific backup on a multibackup tape, rewind to the beginning (just to be sure you're at the start), and then skip any files (backups) necessary to reach the archive you want. These commands will access the table of contents for the second archive, for example:
# mt rewind
# mt fsf
# tar tvzf /dev/tape
etc/
etc/smrsh/
etc/smrsh/mailman
etc/group-
etc/gnopernicus-1.0/
etc/gnopernicus-1.0/translation_tables/
...(Lines snipped)...
Fedora Core includes amanda , a powerful client-server tape backup system that can be used for this purpose. See the amanda manpages for details.
The manpages for st , mt , tar , growisofs , mkisofs , and amanda
CD and DVD Archiving: Quick Reference Guide for Care and Handling (NIST): http://www.itl.nist.gov/div895/carefordisc/disccare.html
Magnetic Tape Storage and Handling: A Guide for Libraries and Archives (NML): http://www.imation.com/america/pdfs/AP_NMLdoc_magtape_S_H.pdf
Fedora Core can schedule tasks to be run at specific times. This is useful for making backups, indexing data, clearing out temporary files, and automating downloads and it's easy to set up.
To schedule a task, use crontab with the -e option to edit your list of scheduled tasks:
$ crontab -e
The vi editor will start up, and any existing scheduled tasks will appear (if you don't have any scheduled tasks, the document will be blank). Edit the file using standard vi editing commands.
Each scheduled task occupies a separate line in this file. Each line consists of five time fields, followed by the command to be executed. In order, the file fields are:
minute
The number of minutes past the hour, 059
hour
The hour of the day, 023
day
The day of the month, 131
month
The number of the month, 112
day of the week
The day of the week, 06 (Sunday to Saturday) or 17 (Monday to Sunday), or written out
A time field may contain an asterisk, which means any .
Here is an example:
30 * * * * /home/chris/bin/task1
The script or program /home/chris/bin/task1 will be executed at 30 minutes past the hour, every hour of every day of every month. Here are some other examples:
15 1 * * * /home/chris/bin/task2
0 22 * * 1 /home/chris/bin/task3
30 0 1 * * /home/chris/bin/task4
0 11 11 11 * /home/chris/bin/task5
task2 will be executed at 1:15 a.m. every day. task3 will be executed at 10:00 p.m. every Monday. task4 will be run at 12:30 a.m. on the first of every month. task5 will be run at 11:00 a.m. each Remembrance Day (Veteran's Day).
You can use a range ( low - high ), a list of values ( 1,2,3 ), or */ increment to specify every increment unit. Here are some more examples to illustrate:
0,15,30,45 9-16 * * * /home/chris/bin/task6
*/2 * * * * /home/chris/bin/task7
0 7 1-7 * 3 /home/chris/bin/task8
task6 will be run every 15 minutes (at 0, 15, 30, and 45 minutes past the hour) from 9:00 a.m. to 4:45 p.m. every day. task7 will be executed every two minutes. task8 will be executed at 7:00 a.m. on the first Wednesday of each month (the only Wednesday between the first and seventh of the month).
By default, any output (to stdout or stderr ) produced by a scheduled command will be emailed to you. You can change the email destination by including a line that sets the MAILTO environment variable:
MAILTO=cronman@gmail.com
30 * * * * /home/chris/bin/task1
15 1 * * * /home/chris/bin/task2
0 22 * * 1 /home/chris/bin/task3
In fact, you can also set any standard environment variables; the two most useful are SHELL , which overrides the default shell ( bash ), and PATH , which overrides the default path ( /bin:/usr/bin ). Here's an example:
PATH=/bin:/usr/bin:/usr/local/bin:/sbin:/usr/sbin
SHELL=/bin/zsh
MAILTO=""
30 * * * * adjust-network
Fedora also provides a system for running scripts on an hourly, daily, weekly, and monthly basis, simply by placing the script into a designated directory. These scripts run as root . Table 6-8 shows the time of execution for each directory.
Table 6-8. Scheduled task directories
Directory | Frequency | Time of execution | Task examples |
---|---|---|---|
/etc/cron.hourly | Hourly | :01 past each hour | Send/receive netnews |
/etc/cron.daily | Daily | 4:02 a.m. every day | Analyze web logs, rotate logs, delete old temporary files, monitor cryptographic certificate expiry, update installed software |
/etc/cron.weekly | Weekly | 4:22 a.m. every Sunday | Clean up old yum packages, index manpages |
/etc/cron.monthly | Monthly | 4:42 a.m. on the first day of every month | (None defined) |
Many Fedora packages install files into these directories to schedule tasks; for example, the webalizer package installs /etc/cron.daily/00webalizer to set up automatic web log analysis.
If a task is not performed because the system is off at the scheduled time, the task is performed at the next boot or the next regularly scheduled time, whichever comes first (except for hourly tasks, which just run at the next scheduled time). Therefore, the regularly scheduled maintenance tasks will be still be executed even on a system that is turned on only from (say) 8:00 a.m. to 5:00 p.m. on weekdays.
The cron server daemon executes tasks at preset times. The crontab files created with the crontab command are stored in a text file in /var/spool/cron .
There is also a system-wide crontab file in /etc/crontab and additional crontab files, installed by various software packages, in /etc/cron.d . These crontab files are different from the ones in /var/spool/cron because they contain one additional field between the time values and the command: the name of the user account that will be used to execute the command.
This is the default /etc/crontab file installed with Fedora Core:
SHELL=/bin/bash
PATH=/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
HOME=/
# run-parts
01 * * * * root run-parts /etc/cron.hourly
02 4 * * * root run-parts /etc/cron.daily
22 4 * * 0 root run-parts /etc/cron.weekly
42 4 1 * * root run-parts /etc/cron.monthly
The entries in this file execute the scripts in the directories listed in Table 6-8 . Note that the sixth field is root , meaning that these scripts are executed with root permission.
The files in /etc/cron.d may also be executed by the anacron service during system startup ( anacron takes care of running jobs that were skipped because your computer was not running at the scheduled time). The files /var/spool/anacron/cron.daily , /var/spool/anacron/cron.monthly , and /var/spool/anacron/cron.weekly contain timestamps in the form YYYYMMDD recording when each level of task was last run.
The default /etc/anacrontab looks like this:
# /etc/anacrontab: configuration file for anacron
# See anacron(8) and anacrontab(5) for details.
SHELL=/bin/sh
PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
MAILTO=root
1 65 cron.daily run-parts /etc/cron.daily
7 70 cron.weekly run-parts /etc/cron.weekly
30 75 cron.monthly run-parts /etc/cron.monthly
The three entries at the end of this file have four fields, specifying the minimum number of days that must have elapsed since a command was last run before it is run again, the number of minutes after anacron is started that the command should be executed, the anacron label (corresponding to the timestamp filename in /var/spool/anacron ), and the command to be executed. If the specified number of days has elapsedfor example, the weekly tasks have not been executed in more than a weekthe anacron service starts the appropriate tasks after the specified delay (so, in this example, weekly tasks would be executed approximately 70 minutes after system boot).
In many parts of the world, daylight savings time , or summer time , shifts the local time by one hour through the spring and summer months. In most jurisdictions in North America, the local time jumps from 2:00 a.m. to 3:00 a.m. during the spring time change and from 3:00 a.m. to 2:00 a.m. during the autumn time change. The spring time change has been held on the first Sunday in April, but that will change (experimentally) to the second Sunday in March in 2007. The fall time change has been held on the last Sunday in October, which will change to the first Sunday in November in 2007. If the changes do not result in significant energy savings, governments may revert to the traditional dates.
This means that there is no 2:30 a.m. local time on the day of the spring time change, and that 1:30 a.m. local time happens twice on the day of the fall time change.
crond was written to take this issue into account. Jobs scheduled to run between 2:00 and 3:00 a.m. during the spring time change will execute as soon as the time change occurs, and jobs scheduled to run between 1:00 and 2:00 a.m. during the autumn time change will be executed only once.
The environment variable EDITOR can be used to specify a different editor, such as emacs , joe , or mcedit . You can set this variable temporarily by assigning a value on the same command line as the crontab command:
$ EDITOR= joe crontab -e
It may be useful to edit your ~/.bash_profile and add this line to permanently specify a different editor:
export EDITOR= mcedit
When executed without any arguments, the crontab command will read the crontab configuration from the standard input. You can use this feature to load the configuration from a file:
$ crontab < /tmp/newcrontab
To see the current crontab configuration, use the -l option:
$ crontab -l
# Backup ~chris/oreilly/ to bluesky:~chris/backup/ as a tar archive
30 0,12 * * * /usr/local/bin/bluesky-backup-oreilly
# Update the local rawhide repository
0 5 * * * /usr/local/bin/rawhide-rsync
Putting these features together, you can create a simple script to edit a crontab configuration:
#!/bin/bash
# addtmpclean :: add a crontab entry to clean ~/tmp daily
(crontab -l ; echo "30 4 * * * rm -f ~/tmp/*")|crontab
The manpages for cron , crontab(1), crontab(5), anacron, and anacrontab