Tutorial 1) Part 5a. Text Files
For recording input and output Unix relies on a platform-independent file format that can be shared with users of different systems across the globe. The standard that was adopted in 1968 and is called the American Standard Code for Information Interchange (ASCII), which we will simply call text files. Unlike word processing or Powerpoint files, which require the proper version of proprietary software to interpret their binary encoding, text files contain alphanumeric data and punctuation characters but no formatting info (margins, fonts, custom views, etc.). As ASCII is the simplest file format to work with textual data, it is easy to share and perform fast operations on.
The ASCII character encoding consists of a typical keyboard: upper and lower-case letters A-z, digits 0-9 and punctuation marks such as you will find on a keyboard. In all, there are exactly 256 ASCII characters. This is because storage of one character in a text file represents 1 byte of data, where 1 byte = 8 binary bits. Contemporary digital computers (as opposed to mechanical, analog or quantum computers) encode data in binary digits (bits): each bit can take on one of 2 values: denoted 0 or 1. Binary is the simplest way to encode information and perform simple arithmetic, though it takes more digits to represent a number in base-2 than in base-10 (decimal). One byte is a string of eight 0s and 1s, and 1 byte can take on 2^8 = 256 possible values. When you open a text file in Unix, the terminal quickly decodes the binary code into a human-readable format: plain text.
You'll note that many of the ASCII characters are not used very often, so it is not a highly compressed data format. There are programs (zip, archive, gzip) that will compress repetitive textual data by analyzing the locations of where characters are repeated, analogous to compressing a bitmap image. A zipped file must unzipped before the data can be viewed.
Part 5b. Creating and Managing Files
Let's create a text file consisting of one line of text.
cd ~/day1
For the following lengthy command, select the entire line by clicking 3 times over it, copy the highlighted line by hitting ^c or Command-C (Mac), click on your terminal window, and then paste the line using shift-insert (Cygwin) or Command-V:
echo 1234567890abcdefghijklmnopqrstuvwxyz > alpha.dat
Note how it even enters the command for you. This is because the end of a text line contains a hidden character called the newline character, denoted by the escape sequence \n. If you transfer an ASCII file from an old PC to a Mac, the newlines may get misinterpreted and show up as ^M.
Verify that the file contains text:
cat alpha.dat
ls -l
-rw-r--r-- 1 rhills 43873 37 Jan 9 13:53 alpha.dat
Field 5 tells us the file contains 37 bytes, or ASCII characters, of data.
Characters 2-10 of the ls -l output are the file access permissions. Permissions tell us who has access to the file on your computer, which may be shared with other users. r means the file is readable (view access), w means the file is writeable (edit access), and x means the file is executable: shell scripts or binary programs that can be run. On a shared computer, there are 3 permission levels: each file has an owner associated with it, the owner belongs to a specific group (rarely used), and the last level refers to any user on the computer (other). -rwxr-xr-x is a common file permission for servers, it means that the file owner can read, write and execute (rwx) while other users can only read or execute (r-x). If you are hosting a public webpage all users need to be able to search inside each directory and access all files. Directories are specified by a d in the first space, and have an x at every 3rd position if searchable.
r = read, w = write, x = execute
-rwx rw- r-- 1 newuser staff
typeownergroupothers
ls -l .. list contents of parent directory (..)
drwxr-xr-x 1 rhills admin 102 Jan 9 13:53 day1
ls -lFG .. flags can help folders stand out from regular files
drwxr-xr-x 3 rhills 43873 102 Jan 9 13:53 day1/
-rwxr-xr-x 3 rhills 43873 1137 Jan 9 13:53 some.txt
Access permissions are why you have to be logged in as Administrator in order to change operating system files:
drwx------ 2 richard staff 2048 Jan 2 1997 private
drwxrwx--- 2 root admin 2048 Jan 2 1997 admin
-rw-rw---- 2 root admin 12040 Aug 20 1996 admin/userinfo
drwxr-xr-x 3 richard staff 2048 May 13 09:27 public
For protected health information (PHI can be anything related to patient care, billing, or personally identifiable information), HIPAA regulations require that reasonable safeguards be in place to protect electronic data. You can hide data in your Unix terminal using the change file modes command (chmod):
echo Name address birthdate SSN > phi.dat
ls -l
-rw-r--r-- 1 rhills 43873 27 Jan 9 14:29 phi.dat
chmod go-r phi.dat group and other users are subtracted from read access
ls -l
-rw------- 1 rhills 43873 27 Jan 9 14:29 phi.dat
You can even write protect this file from yourself:
chmod u-w phi.dat the user is minus write access
ls -l
-r-------- 1 rhills 43873 27 Jan 9 14:29 phi.dat
Part 6a. Escape Sequences: Spaces and special characters
Commands in Unix shells and programming languages are case-sensitive. One exception is web domain addresses (capitalization makes no difference in http URLs and email addresses). Some operating systems let you have two files with the same name in a directory if they have different capitalization, but this is not recommended.
Unix will also recognize filenames that contain spaces, but these are cumbersome to handle on the command line because spaces are interpreted as two different command arguments. It is therefore recommended to use an underscore (_) in the filename rather than space.
echo see > m e
l
-rw-r--r-- 1 rhills 6 Jan 9 14:44 m
cat m Why didn't the echo command create a file named ' m e ' ?
There are times in programming when you will need to just print a special text character ($, space, etc) that normally has special meaning to the shell interpreter. The backslash (\) operator (below the delete key) escapes the next character from being interpreted by the shell.
echo $novar
novar: Undefined variable.
echo \$novar
echo spot > z\ \ z each space character requires a backslash
ls -lrt
-rw-r--r-- 1 rhills 5 Jan 9 14:48 z z
Single quotes are easier to type and accomplish the same thing:
echo run > 'y y' (y-space-space-y)
l -rt
echo 3;4
echo '3;4' Backslash-escaping is not used with single quotes.
echo " shell PID: $$" Double quotes allow you to not escape a symbol.
Since using a backslash is cumbersome, see if your shell will "autocomplete" filenames for you... type cat z and hit the tab key--you should see the command line change to cat z\ \ z. Hit the enter key and it will print the file. If you type cat j and hit the tab key twice, you should see a list since you have more than one file beginning with the letter j. Supply more characters: ob5.i and hit tab, it should complete the command. Autocomplete is a nice feature that ensures you are supplying the right filename. It also expands directory names when searching folders.
QUESTION 7: What does the shell do when you supply a space or other punctuation mark in a Unix file name without the backslash?
Part 6b. Move and Copy Files
Organizing files in Unix involves the move (mv) and copy (cp) commands. Each command expects two space-separated arguments: the source file and the name (or path if using a different directory) of the destination. Move and copy work with either relative or absolute pathnames.
cd ~/day1
echo start > source
mv source dest moving a file or directory is the equivalent of dragging it to a new location (cut-and-paste)
l
mv alpha.dat .. source (alpha.dat) and destination (..) are relative paths
ls -rt .. two dots represents the parent directory
mv ~/alpha.dat ~/day1/abc.dat absolute paths; file is also renamed
l
cp abc.dat abc.copy copying a file duplicates it (copies have a later timestamp, while moved files do not)
l
An important point in Unix is that move and copy will overwrite any existing files if you choose a destination that already exists:
echo whoopsy > daisy
mv daisy abc.copy
cat abc.copy
Move is unique from copy in that you can supply multiple source files and move them into a single directory (specified last):
mkdir work
mv a* work the asterisk wildcard moves all files beginning with "a" into the work folder
ls work
The copy command requires a flag to operate on a directory:
cp -r work work.backup the recursive flag (-r) is a common command option: it goes inside all subdirectories
cp -a work work.archive the archive flag is best for backing up a directory: it preserves the original timestamp (last access/modify) associated with each file.
l
Unix has separate commands to delete files and directories:
touch garbage the touch command creates an empty file
ls -rt
rm garbage
l
mkdir trash
rm trash the remove file command will not work on directories
rm: trash: is a directory
ls -rt
rmdir trash the remove directory command will only work if the directory is empty
l
What if we had a bunch of files?
echo aa > 1aa
echo bb > 1b
echo cc > 1c
echo dd > 1d
We can reference some or all of them at once using wildcards such as *, ?, or []:
cat 1* prints contents of files beginning with '1'
cat 1? prints contents of files named "1?" where ? is any single character
cat 1[bc] only returns matches for one of the characters contained in brackets
Special note: DO NOT type rm *. This would erase all files in your directory!
Part 6c. File Searching
The get regular expression command (grep) is similar to the Find command in Word. A regular expression is a string of characters that may also contain wildcards such as asterisk (*). If your search really needs a space you must employ a character backslash or enclose the entire character string in quotes ("). Single quotes (') can be used in some cases but have a subtle effect on the order in which multiple programs interpret the special characters.
Create a file by highlighting the following 5 lines, copying and pasting them into your terminal, and hit the return key.
echo Unix command summary > sum
echo "" >> sum
echo So many unix commands... >> sum
echo take practice. >> sum
cat sum
grep pattern files the grep command expects two arguments: a character string followed by the file to search
grep command sum
Unix command summary grep returns all lines in the file containing the string "command"
So many unix commands...
grep -i unix sum the -i flag ignores upper/lower case
Unix command summary
So many unix commands...
grep "y u" sum quotes are needed to include a space in the string
So many unix commands...
If you need to search for a file with a given string in the filename, use the find command with the -name flag to search through an entire directory's contents:
find directory flag string
find ~ -name "*.??t" returns files in your home directory ending with .out, .txt, dat, etc.
Asterisk directs the interpreter to match a string of characters of any length, while each question mark (?) matches any single character.