Tutorial 1) Part 5a. Text Files

For recording input and output Unix relies on a platform-independent file format that can be shared with users of different systems across the globe. The standard that was adopted in 1968 and is called the American Standard Code for Information Interchange (ASCII), which we will simply call text files. Unlike word processing or Powerpoint files, which require the proper version of proprietary software to interpret their binary encoding, text files contain alphanumeric data and punctuation characters but no formatting info (margins, fonts, custom views, etc.). As ASCII is the simplest file format to work with textual data, it is easy to share and perform fast operations on.

The ASCII character encoding consists of a typical keyboard: upper and lower-case letters A-z, digits 0-9 and punctuation marks such as you will find on a keyboard. In all, there are exactly 256 ASCII characters. This is because storage of one character in a text file represents 1 byte of data, where 1 byte = 8 binary bits. Contemporary digital computers (as opposed to mechanical, analog or quantum computers) encode data in binary digits (bits): each bit can take on one of 2 values: denoted 0 or 1. Binary is the simplest way to encode information and perform simple arithmetic, though it takes more digits to represent a number in base-2 than in base-10 (decimal). One byte is a string of eight 0s and 1s, and 1 byte can take on 2^8 = 256 possible values. When you open a text file in Unix, the terminal quickly decodes the binary code into a human-readable format: plain text.

You'll note that many of the ASCII characters are not used very often, so it is not a highly compressed data format. There are programs (zip, archive, gzip) that will compress repetitive textual data by analyzing the locations of where characters are repeated, analogous to compressing a bitmap image. A zipped file must unzipped before the data can be viewed.

Part 5b. Creating and Managing Files

Let's create a text file consisting of one line of text.

cd ~/day1

For the following lengthy command, select the entire line by clicking 3 times over it, copy the highlighted line by hitting ^c or Command-C (Mac), click on your terminal window, and then paste the line using shift-insert (Cygwin) or Command-V:

echo 1234567890abcdefghijklmnopqrstuvwxyz > alpha.dat

Note how it even enters the command for you. This is because the end of a text line contains a hidden character called the newline character, denoted by the escape sequence \n. If you transfer an ASCII file from an old PC to a Mac, the newlines may get misinterpreted and show up as ^M.

Verify that the file contains text:

cat alpha.dat

ls -l

-rw-r--r-- 1 rhills 43873 37 Jan 9 13:53 alpha.dat

Field 5 tells us the file contains 37 bytes, or ASCII characters, of data.

Characters 2-10 of the ls -l output are the file access permissions. Permissions tell us who has access to the file on your computer, which may be shared with other users. r means the file is readable (view access), w means the file is writeable (edit access), and x means the file is executable: shell scripts or binary programs that can be run. On a shared computer, there are 3 permission levels: each file has an owner associated with it, the owner belongs to a specific group (rarely used), and the last level refers to any user on the computer (other). -rwxr-xr-x is a common file permission for servers, it means that the file owner can read, write and execute (rwx) while other users can only read or execute (r-x). If you are hosting a public webpage all users need to be able to search inside each directory and access all files. Directories are specified by a d in the first space, and have an x at every 3rd position if searchable.

r = read, w = write, x = execute

-rwx rw- r-- 1 newuser staff

typeownergroupothers


ls -l .. list contents of parent directory (..)

drwxr-xr-x 1 rhills admin 102 Jan 9 13:53 day1

ls -lFG .. flags can help folders stand out from regular files

drwxr-xr-x 3 rhills 43873 102 Jan 9 13:53 day1/

-rwxr-xr-x 3 rhills 43873 1137 Jan 9 13:53 some.txt


Access permissions are why you have to be logged in as Administrator in order to change operating system files:

drwx------ 2 richard staff 2048 Jan 2 1997 private

drwxrwx--- 2 root admin 2048 Jan 2 1997 admin

-rw-rw---- 2 root admin 12040 Aug 20 1996 admin/userinfo

drwxr-xr-x 3 richard staff 2048 May 13 09:27 public


For protected health information (PHI can be anything related to patient care, billing, or personally identifiable information), HIPAA regulations require that reasonable safeguards be in place to protect electronic data. You can hide data in your Unix terminal using the change file modes command (chmod):

echo Name address birthdate SSN > phi.dat

ls -l

-rw-r--r-- 1 rhills 43873 27 Jan 9 14:29 phi.dat

chmod go-r phi.dat group and other users are subtracted from read access

ls -l

-rw------- 1 rhills 43873 27 Jan 9 14:29 phi.dat

You can even write protect this file from yourself:

chmod u-w phi.dat the user is minus write access

ls -l

-r-------- 1 rhills 43873 27 Jan 9 14:29 phi.dat


Part 6a. Escape Sequences: Spaces and special characters

Commands in Unix shells and programming languages are case-sensitive. One exception is web domain addresses (capitalization makes no difference in http URLs and email addresses). Some operating systems let you have two files with the same name in a directory if they have different capitalization, but this is not recommended.

Unix will also recognize filenames that contain spaces, but these are cumbersome to handle on the command line because spaces are interpreted as two different command arguments. It is therefore recommended to use an underscore (_) in the filename rather than space.

echo see > m e

l

-rw-r--r-- 1 rhills 6 Jan 9 14:44 m

cat m Why didn't the echo command create a file named ' m e ' ?

There are times in programming when you will need to just print a special text character ($, space, etc) that normally has special meaning to the shell interpreter. The backslash (\) operator (below the delete key) escapes the next character from being interpreted by the shell.


echo $novar

novar: Undefined variable.

echo \$novar

echo spot > z\ \ z each space character requires a backslash

ls -lrt

-rw-r--r-- 1 rhills 5 Jan 9 14:48 z z

Single quotes are easier to type and accomplish the same thing:

echo run > 'y y' (y-space-space-y)

l -rt

echo 3;4

echo '3;4' Backslash-escaping is not used with single quotes.

echo " shell PID: $$" Double quotes allow you to not escape a symbol.

Since using a backslash is cumbersome, see if your shell will "autocomplete" filenames for you... type cat z and hit the tab key--you should see the command line change to cat z\ \ z. Hit the enter key and it will print the file. If you type cat j and hit the tab key twice, you should see a list since you have more than one file beginning with the letter j. Supply more characters: ob5.i and hit tab, it should complete the command. Autocomplete is a nice feature that ensures you are supplying the right filename. It also expands directory names when searching folders.

QUESTION 7: What does the shell do when you supply a space or other punctuation mark in a Unix file name without the backslash?

Part 6b. Move and Copy Files

Organizing files in Unix involves the move (mv) and copy (cp) commands. Each command expects two space-separated arguments: the source file and the name (or path if using a different directory) of the destination. Move and copy work with either relative or absolute pathnames.

cd ~/day1

echo start > source

mv source dest moving a file or directory is the equivalent of dragging it to a new location (cut-and-paste)

l

mv alpha.dat .. source (alpha.dat) and destination (..) are relative paths

ls -rt .. two dots represents the parent directory

mv ~/alpha.dat ~/day1/abc.dat absolute paths; file is also renamed

l

cp abc.dat abc.copy copying a file duplicates it (copies have a later timestamp, while moved files do not)

l

An important point in Unix is that move and copy will overwrite any existing files if you choose a destination that already exists:

echo whoopsy > daisy

mv daisy abc.copy

cat abc.copy

Move is unique from copy in that you can supply multiple source files and move them into a single directory (specified last):

mkdir work

mv a* work the asterisk wildcard moves all files beginning with "a" into the work folder

ls work

The copy command requires a flag to operate on a directory:

cp -r work work.backup the recursive flag (-r) is a common command option: it goes inside all subdirectories

cp -a work work.archive the archive flag is best for backing up a directory: it preserves the original timestamp (last access/modify) associated with each file.

l

Unix has separate commands to delete files and directories:

touch garbage the touch command creates an empty file

ls -rt

rm garbage

l

mkdir trash

rm trash the remove file command will not work on directories

rm: trash: is a directory

ls -rt

rmdir trash the remove directory command will only work if the directory is empty

l

What if we had a bunch of files?

echo aa > 1aa

echo bb > 1b

echo cc > 1c

echo dd > 1d

We can reference some or all of them at once using wildcards such as *, ?, or []:

cat 1* prints contents of files beginning with '1'

cat 1? prints contents of files named "1?" where ? is any single character

cat 1[bc] only returns matches for one of the characters contained in brackets

Special note: DO NOT type rm *. This would erase all files in your directory!

Part 6c. File Searching

The get regular expression command (grep) is similar to the Find command in Word. A regular expression is a string of characters that may also contain wildcards such as asterisk (*). If your search really needs a space you must employ a character backslash or enclose the entire character string in quotes ("). Single quotes (') can be used in some cases but have a subtle effect on the order in which multiple programs interpret the special characters.


Create a file by highlighting the following 5 lines, copying and pasting them into your terminal, and hit the return key.

echo Unix command summary > sum

echo "" >> sum

echo So many unix commands... >> sum

echo take practice. >> sum

cat sum


grep pattern files the grep command expects two arguments: a character string followed by the file to search

grep command sum

Unix command summary grep returns all lines in the file containing the string "command"

So many unix commands...

grep -i unix sum the -i flag ignores upper/lower case

Unix command summary

So many unix commands...

grep "y u" sum quotes are needed to include a space in the string

So many unix commands...

If you need to search for a file with a given string in the filename, use the find command with the -name flag to search through an entire directory's contents:

find directory flag string

find ~ -name "*.??t" returns files in your home directory ending with .out, .txt, dat, etc.

Asterisk directs the interpreter to match a string of characters of any length, while each question mark (?) matches any single character.

Proceed to: Part 7