compsci2
Fri 03/30/07 cont.
if the key is maxint, then you're done.
if we have a stable sort for transactions, i can do all of them at the cost of 1.
this is pretty important -- businesses out there use these .
-------
Wed 04/04/07
seqential files -- you either read the next record or you quit
but before we leave sequential -
buffering and blocking
how do you change these?
if its small enough, you can read the whole thing , change it and read it out
but if you file doesn't fit in memory,
old master -> update -> new master
that works as long as you have these transactions -- sort them - and then run them through.
generally, the masterfile is huge, the number of transactions is comparatively small.
balanced line -- we cant get rid ofall the intermediate steps -- only one pass through the file instead of one pass for each change.
case
old master
10
12
20
new master
transaction
fixed length records
fixed key position
so we can use system sort
change to record 12 and then we'll make another change
merge sort has two cleanup loops
current buffer
if we still have the record when we're done with the transactions, you write it.
more modern process chart
an interactive program makes a transaction that gets written on the log file -- you keep appending the log file so that it looks like youre updating the file ...and you are just Qing it. then the data in there is in chrnololigical order , we do a stable sor tofr the transaction list and then we can run update.
a new type of file --
name addr id
0 smith
1 jones
2 brown
3 wong
4
5
6
7
8
never have to worry about sorting the old master.
ok back to our new file up there.
key = id
you can read and write any record directly with direct files, but you have to know which record you want -- relative record number.
fstream df;
df.open("Data", ios: in | ios: out)
ios: binary
data.name = ?
data.addr = ?
data.id = ?
df.write((char *) & data, sizeof(data));
so we build it just like a sequential file.
df.seekg(rec * sizeof(data));
df.read(char *)& data, sizeof(data));
cout << name;
the g stands for get -- means taat you're reading it.
in our current version, you can seak to a record and then read the next one after it, etc -- but dont count on that.
taking the relative record number times the size of the data
once the file's out htere, you can also rewrite data.
so if this file is out there and i want to rewrite the record, i can.
df.seekp(rec * sizeof(data));
df.write(binary data stuffs).
p, for "put" implies you're going to write.
the issues --
illegal to do this if rec is less than zero or greater than the number of records.
in our example, rec has to be 0 to 9 for a write.
we can seekp to 1 past the end of the file.
read backworks
for(rec=9; rec >0; rec--;)
read the rec.
we can go any way you want.
so now instead of 2n i/o to change my salary, it takes 2 i/o's.
sequential: batch :: direct: index.
'cuz we have to clue where the record is otherwise.
index can be binary search tree or whatever.
we keep the index in memory and search it to find the record .
we essentailly have a pointer to the record in our index file.
so why not always use a direct file?
say we have records of size 219
we end up reading the entire sector and the next one (because of two buffers)
if you want to read record 2, it has to rebuffer the whole thing .. lose the cpu, etc.
so sequential is logical i/o's
direct is physical i/o
but the big deal with direct is taat its not a function of n.
we'll be direct for the rest of the semester.
when you write a record, you take over the whole next sector unless the os is smart enough to put the nexter record there.
and then next file written causes us to have fragmentation when we want to write our next file.
direct file, no reason to sort unless there's an index.
sequential -- can't update if its not sorted.
now, if you never update it, then don't sort it.
can't delete a record with direct -- i can mark it as deleted ..
i can put it on an avail list.
linked list in a file.
we solved a lot of these issues already .. we're just looking at it differently.
-------
Wed 04/11/07
rec size 2029
15,000 records
took 1800 time units (sequential)
read as a direct file
4000 time units
forward 400 time units.
so how come its so much longer to read it backwards
we pay for physical reads. .. buffering is not on our side.
we could make this worse by reading in random order, where beuffering is completely helpless.
you can read any record you want, but you're gonna pay for it.
why is writing slower? - it doesn't know how much bytes you want to write -- so you may run out of room in one place and go looking for more.
on an ibm system, it asks how big you want the file so taat it can allocate a nice chunck to you.
what's in memory is "free" compared to how many i/o's you're going to do.
direct file
0
1
2
3
4
5
6
.
.
;
eof
seekg(rec*sizeof(data));
df.read(char* & data, sizeof(data));
seekp -- to write it.
you can write any record you like.
seek g -- 0 to less than n
seek p from zero to n - you can go one over and it'll append to your file.
how do you know which record you want? index!
pos = findkey(val)
if(pos == -1) not found
otherwise we know where it is
df.seekg(ptr[pos]* sizeof(data));
df.read((char*) & data, sizeof(data));
how come we need the index? - we're probably not lucky enough that we can actually go directly to the data file -- so we need oome structure on the front end.
given a direct file, make changes to the data.
need to be able o add, delete, and make changes to it.
so everythng we did in sequential land, we need to do in direct land.
search
lets do a change first.
transactions -- ads changes and deletes.
for a change
pos = fkindkey(key);
if(pos ==-1) "not found"
//now that I found it
df.seekp(ptr[pos] * sizeof(data); // we have fixed records - otherwise, things get really complicated.
df.read(char*&data sizeof(data);
data.field = new stuff
df.seekp(ptr[pos] * sizeof(data));
df.write((char*) & data, sizeof(data);
if you were changing everything in the record, then you wouldn't need the read.
Add (key, bunch of vals)
---
// it might be you've got available space (from deletes)
addr = from AVAIL
if(addr == -1) then we set it to n .. addr = max_record; (not necessarily n) .
so i'm going to reuse a record number or set it to end.
// set up data with all the new stuff. ie, address, phone number etc.
df.seekp(addr * sizeof(data));
df.write((char *) * data, sizeof(data));
//then we have one more record
n++;
Addkey(key, addr); // we need the key and the pointer to where the data is.
1 i/o per ad.
now lets do a delete.
what we could do is just call delkey
delkey(key);
if its not there to find, then its not there.
but we should also add it to an avail list
so lets do this:
pos = Findkey(key);
if(pos == -1) "not found to delete"
add "ptr[pos]" to AVAIL list
now lets call DelKey(key)
then we n--; // one less record
so always first question -- direct or sequential.
how do we represent avail
linked list?
-------
Fri 04/13/07
direct files
------------
if you have a direct file, then you need an index
seekg read
seekp write
ads, changes, deletes.
can't delete a record, but we can put it on an avail list.
if you don't have any space avail, you can ad to the bottom, but that cuases file fragmentation
df.open("name", ios::out|ios::binary);
loop
df.write(,);
AddKey(
n++)
forever
df.close();
.. so if you have a bunch of adds to do, treat it like a sequeential file because then buffering is on your side.
Direct Files & linked lists.
top - [A |---> etc.
we'll use the array based ish representation
add a link field to the end of each record. some into top would be equal to 1 -- so top points to record 1.
printlist(top)
while top ≠ 1
df.seekg(top* sizeof(data);
df.read((char*) & data, sizeof(data);
cout << data.key << data.stuff;
top = data.link; // p gets te link of p
end
end
we had a lot of problems with linked lists in memory ..
Node * x = new Node;
if the heap is empty, then that's not going to work.
in our array representation, we could alos run out of memory.
but with disk space, you have a low likelyhood of running out -- virtually unlimited disk space.
AddToFront(top, data)
//get a free record on the file
addr = n; // assume we get the new record at the end of the file
data.link = top;
top =addr;
df.seekp(addr*sizeof(data);
df.write((char*) & data, sizeof(data);
"we need to know exactly how this is going to work"
add to the middle of the list.
we'll do an add after
Data(addr) = stuff;
link(addr) = link(p);
Link(p) = addr;
think our normal algorithms and then convert it into file.
df.seekp(p*sizeof(data);
df.read(char*) & data2; // need to read the p record .. so put it in a different data variable -- don't clobber other data
data1.link = addr;
seekp(p*sizeof(data1));
write((char*) & data1, sizeof(data1));
temp=data(link)
data.link = temp;
df.seekp(addr * sizeof(data);
df.write((char*) & data, sirzeof(data);
so three i/o's
write the normal alg and then worry about adapting it.
now what about a tree
you've got an extra pointer, yes .. but that's just 2-4 more bytles
onow we have two pointers - now we can do trees.
take a look at left of root
so if you're on a file, read root and then set p = data(left).
so all the structures we did before, you can do on files.
direct file.
and index
primary key index.
file is indexed on the key.
Adds - add whereever there's room and add it to the index.
change - 2 i/o's
delete - how do you delete a record?
delete - remove it from the index and put it on the avialbable space list
Avail is a linked list.
avail points to 7.
delete p
link(p) = avail
avail = p
//need to write the link .. but we don't need to read -- the node is junk anyway
df.seekp(p * sizeof(data));df.write(car*)& data, seizeof(data);
avail=p;
avail points to three in our example and then tat goes to 7 - and then we're done.
now we want to take from the list.
//get a free record
//what we used to do
//free = avail;
//avail = link(avail);
//as long as avail ≠-1
if(AVAIL = -3) .. we're empty, add to the end of the file.
//else
free = AVAIL;
df.seekg(AVAIL *sizeofdata);
df.read
Avail = data.link;
so you may not need avail if you don't do much deleting.
all kinds of choices for setting up data files.
class Idx {
public:
int key;
int ptr;
};
Idx Pkey[9000];
save this index!!!
index.open(---, ios:out|ios:binary);
index.write(char*) Pkey, sinzeof(PKey);
index.close();
we could steal the first key and put avail there
or we could put it in the first data field ..
but don't put it on the bottom -- it'll change.
if you need more room in a record, just get another record - and point to it.
-------
Mon 04/16/07
senior projects - 4/24 - presentations 1-9pm.
sequential -- sorted -- update merged. old/new master.
direct - no benefit to batching transaction
2 i/o for changes 1 or 2 for add .. depends on i/o
and 1 possibly for delete -- depending on avail
Backup -
say we backp once a month .. do you write a backup program -- in sequential land, you get a free backup.
so what if , on the 27th, things go wrong. .. well you can go back -- you get a free backup
what if something goes wrong -- we're running on master version 27. well if something goes wrong, i can backup to yesterday, repeat my transactiins and i get my new file back.
but having all those complete files takes up a lot of disk space -- so you can store one backup and store all the transactions.
take your backup .. and stable sort all the transactions .. and you can get back to any date.
time stamp the transactions .. sort on the major key (primary key) and minor key (time stamp) .. and you don't have to worrry about stable sort.
so save the biginning of the month,
log file of transactions, timestamp. rerun when needed.
direct files ---
save the beginning of the month
keep another log file of transactions
sort it -- don't really need to . . its fast anyway..
don't forget to save the index, too!
lets timestamp this log file, too - so we know how far to rerun the transactions -- tell me when .. i'll get you there.
so how do we save the master for a direct file - you oould just oopy it, but then you save all the deleted records. so what if we go through the index and put the records out in order of the index. - so you could rebuild the index easily. .. proportionality constant can take advantage of buffering and blocking 'cuz you're going in order - O(n).
so backup -- sequential its free
direct land -- at the beginning of the month we need to save it, but i don't need the index file -- we can save in index order.
bank - terrible idea to do sequential -- we need to know the new balance Now
payroll - sequential is beautiful because you hit every file anyway, and then you've got blocking and buffering on your side.
direct file - fixed length records, but you can link to another record and get another chunck.
you pay dearly for direct files, but you can get any record you want with 1 i/o.
indicies
you can only put the max size of your index number of records in there.
if its a static index, i'm limited
can treat it as a sequential search, binary search, hash.
BST (AVL)
sequential - slow
binary search -- gotta sort it , of course.
BST - avl .. good .. but use an array representation or its really slow to read it in.
hashing. - can store either max .. or unlimited with chaining.
fastest -- address calculation -- i don't need an index!
but we're not always lucky enough to get our keeps dence.
big yeahbut -- you have to build a blank file first because you have to have a 481 to get a 482.
we want this fast performance out of hashing
whatif we take the hashing chaining and put it in the file then we don't need an index again.
we would have to stratify our data file into primary and overflow.
b-tree?!?!?
-------
Wed 04/18/07
direct file -ing
static index.
sequential, binary search -- that's a static index.
bst, avl -- static because it would be inefficient to use dynamic memory -- we'd have to rebuild each time.
dynamic??
hashing? - well that's also static because you have the same arguement as bst.
but we could hash and chain and then put the chains on file - that's dynamic
addr calc
suppose 8192 is id number
then we need to stick it at 8192. .. so you have to preformat to get down there.
if(df.rdstate()!=0) "error found
df.clear(); //clears the error
hitting the eof is an error condition - so you can't hit it and then expect to open the file again in the same program, because all subsequent i/o fails.
tree
m levels - 2^m nodes.
five way tree ..
m levels , 5^m
how fast is this -- log based 5.
256 way tree -- n = 256^m
handout.
block search - big deal
you caa have a tremendously beig file and find anything in this unord3ered file in two i/o's
you can read any one of these blocks
two-dimensional array.
key and a pointer field.
padding - so that we can add more records.
then add jane to the bottom of our data file.
read root
read level 2
read leve 3
rewrite level 3
write to the direct file.
new rule: if you can do it in memory, its free.
you can see why there's padding in there.
add jone.
- june gets moved down .. add joan. cost -- 5 i/os
add jenny
- doesn't fit -- gotta kick june out
and then we put june below, but then we have to move kristen out of the level higher because she's no longer the top of that list.
but once we update all this, we maintain our efficient method.
this is dynamic -- if the level fills up, go to another level.
padding
1) pad terminal level blocks
2) all keys are in the terminal level. - if i'm looking for dea, i have to read through to the terminal level because all my keys are there.
we need padding because we don't want to have to change the parrent. .. and that can propogate up to root
but the search efficiency is good - log based block size.
any time you change, you've gotta rewrite the whole block.
once you pay the price for updating this, its an extremely fast searching structure.
lets fix this overflowing into the neighbors problem
class block {
string key [200];
int ptr[200];
};
once you get down to the bottom, you treat the int as a pointer to the data on the file.. otherwise, you treat it as pointers to the index file.
blocks ordered.
m - way tree
otghers call it a block search;
we'll look at and adaptation called a b-tree
a b-tree is one of these -- and it looks the same, the difference is -- when i get down to one of these bottom levels..
we have a full block and we want to add george to it
we're gong to split the fill block in to two blocks -- so make another block -- every block is the same .. take half the values and leave them and half go into our new block -- taht keep spadding in there -- which is good..
the top two entries are george and gracie -- so those are going to stay
enny and joan .. somebody's gotta point to the new block -- well the partent does -- the oriinal parent. .. the value that goes in there is the former middle one -- janet
and the whole thing didn't even group a level.
if the root is fulll, then the whole thng grows a level -- and it not only grows .. it shrinks.
we will look at the ajacent block -- and level it off throught theeparent -- and that's not a big deal.
friday --
how do you start this thing -- with a single block
keep adding until full -- and then split.
-------
Fri 04/20/07
claybrook - tree
not every key is at the terminal level -- instead if we find the key, it has a direct pointer to the data
in the block search, we had only one pointer field .. now we have a data file pointer and a block pointer
every block is at least half full except the root
also has an odd number of values -- bcause when they split .
how this gets going
significantly different thann the block search -- we're not going to overflow into our neighbor's area.
add 43
get a new block - 1/2 of the values are going to come down
we get a brand new root
middle value gets moved up
half stay ; half go in my new block
every block is at least half full
how fast
log based (block size/2) of n.
split - half go to new half stay in old and the middle gets pormoted to the parent.
blocks reside on an index file as direct records.
algorithms
Add
all adds are at the terminal level.
IF block size is ok, DONE.
how do we get back to the parent -- we could put a parent pointer in the empty first entry.
...
otherwise, split
top 1/2 stay
bottom 1/2 go to a new block
middle one goes to the parent.
top entry of new block gets blk ponter field of the one we moved up.
5. block ptr of entry moved up gets new block.
is this costly -- well not too bad -- if you dont split, you're fine
if you do, you buld a new block and split ..
then you go up to the parent
but we're maintaining our log of blocksize/2
no other structure that we've looked like is dynamic like this.
but only on e of these blocks from memory.
delete
what if we delete 33 -- whole contents of the root here.
then we can start combining boxes again.
gooing to shrink if it can.
a terminal delete is a piece of cake
if i delete one of these that's only half full -- violates rule .. so we then have to go through the neighbor and level it off.
maintains as many blocks as you need.
hybrid b* tree and b+ tree.
delete's a little more complicated.
-------
Mon 04/23/07
to review --
every block in the b-tree is the same.
key, datapointer, block pointer
the data file pointers point directly to the data.
top block pointer -- everyone less than,
middle
blocks are always half full (except for the root) .. and its totally dynamic.
if you find something in the b-tree .. calling b-tree with root.
Btreesearch(top, value)
p = top;
while p≠ NULL
input block p into index
pos = bsearch(index, val)
// it'll return either where it is or where it should be.
if index[pos].key = value; then return(pos)
p = Index[pos].blk_ptr
end
return [pos]... return the pointer to where it should be. something.
....
take a look at a delete
Delete - terminal level
Search
If size of terminal block is OK, then we delete and we're done.
if size is "small" -- less than 1/2
- move value from adjacent block (thru parent) until both sizes are the same.
if i've got two adjacent blocks that are small ..
ie, a block less than half and a block that is 1/2 -- then we can combine the two.
what if I want to delete 288.
follow the block pointer field and take the top entry. .. and then delete 312.
lets delete 238
follow to terminal level .. 246.
grab 246 and then delete it from the end.
the intermediate stuff isn't changed.
Delete non-terminal
find immediate seccessor -- follow the block pointer field all the way to the terminal level.
replace the entry at the non-terminal with the terminal value. Now delete the value on the terminal level.
we have three columns in our blocks, but on the terminal level, all those blocks are null.
B+ tree - has only two columns and a bit - bit says what the pointer is pointing to -- data or a block.
advantange -- you can fit more blocks on disk -- can handle a bigger n with last space.
but now you have to go all the way to the terminal leve .. and you're duplicating all the intermediate values at the terminal level. .. every time I search i go down to the terminal level.
B+ tree -- a b-tree with just two columns .. but it still does the split thing.
..
now, b* tree??
why is it half full. - when we split, we get half full again.
if we're 2/3 full, how do we split when we're full.
if we have two full blocks, they could split into 3 blocks that are 2/3 full.
if we're got 2/3 full, we've got fewer levels , so searching is faster.
B* tree -- 2/3 full.
adding will be a little more costly, but search is better.
what happens on a delete .. terminal value.
if you delete a block and its 2/3 full you're happy
otherwise, you level off with neighbor.
we need three blocks -- 1 small, two at 2/3s .. so that there go into two.
what about 7/8's full?
-------
Wed 04/25/07
Index
from lab -- we can put as many records as we want .. but we get longer chains.
can we get rid of the index?
what if we just do what it hashes to and go to that place in the file.
but we can't be writing our links jst anywhere, or things will has to it .. so make the top part the primary area .. and the lower part our overflow area.
and then you can write overflows anywhere you want once down in the overflow area.
we're gong to add to the front of the list .. but use the first part as a header node. .. well, the top of the list.
this is called a hash file.
our other model was called a hash index where we chain on the data file.
assuming the same size of the hash table / primary table .. finding stuff is the same speed.
but there's more prep for the 2nd model because we have to preformat the primary area.
not a dynamic index in our first example, but a dynamic structure-- you can hold as much as you want.
so 3 different dynamic representations of a file -- b-tree, hashing with chaining, and the hash file.
now with a hash file, if you oon'' have a lot of records, you have a lot of wasted space.
262144 bytes of memory available.
and int takes 4 bytes, so we have room for 65536 .. "64k"
so my index size is 65536
we have n records -- how fast is that?
n/index size - average number of i/os .. average chain length.
two-level b-tree
262144/12 - 3 columns of ints.
.. 21845
21844 entires we get -- because the first one isn't used.
and they all point to 21844.
21844*21844+21844 - records for a two level index.
2 for the indexes and 1 for the data file. 3.
that amount of data would take 7000 i/os for our hash index.
but hashing is faster to add - you're adding to the front.
but b-tree is not too bad.
but if you're only half full in the b-tree, then i have 10922 * 10922 + 10922 .. assuming the root is half full
so thenwe only story 199 mil
so then, on the average 1800 i/os for the hash table compared to 3 i/o for b-tree.
ac and windows use a b-tree.
best job -- software engineer -- they used a bunch of categoryies for best.
log based 21000
b-tree's important for really big files.
as soon as you say sequential, you're going to be dealing with proportional to n.
lets speed the hash index up
say we have an index of 1000 .. logically
segment that index file .. use a segment size of 100
so take the first 100 and put it in one segment.
then take the next 100 and put those in a segment. segment it.
start early; draw pictures.
all we want is one segment in memory .. but logically its 1000 .. 10 blocks = 1000.
probe = key % 1000;
values from 0 - 999 .. I know which block it is because the block is going to be probe/segment size.
just 1 more i/o (get the right block in) .. but we've increased our logical index size!
and we can go down from 7000 to 2 i/os.
we could do progressive overflow within the block -- now we don't have to follow the chains.
well if that fill up, then put a little pointer field to a new block ..
friday - segmented hash file.
-------
Fri 04/27/07
hash table that goes to table size
probe = hash(key)
we're logically hashing to that , but physically we're hashing to blocks
.. we'll make them size 100
use exactly the same hash function - numbers from 0 to 999
and then we know which block its its . in our example, 100
block = probe/blocksize;
integer division
adding two more statements wil allow us to segment
once we know the block, we go to the relative record number .. blk
now i reference index sub pos - .. and we'll do linear probing on that.
we have collisions to worry about, but if a bunch of things hash to 210, they just fill up circularly in the block -- if we run out of block .. then we add a pointer to another block.
and if that one fills up , then we do the same thing.
linear probing inside the block.
Findkey(key)
probe = hash(key)
blk = probe/blksize;
pos = probe % blksize;
index.seekg(blk * sizeof(indx)
index.read(char*) index, sizeof(index))
// linear probing
loop
for i = 1 to blksize
case: INdex.key = 0; then its not found .. because we have an empty space
case: index.key = keyy, then we found it
otherwise, pos = pos % size;
end
end
if no onveflow -- index[maxsize].ptr = -1, then we didn't find it.
otherwise, blk = index[max].ptr
and read it in
pos = 0
forever
end
deletekey
findkey()
if it finds it
mark for reuse .. like -2
rewrite the block
Addkey()
addkey is really findkey .. and quit when you get a zero or a reuse
if we have a full table, then bring in the next block and start looking to tsee if there's any room there .. and if there's no room there, then we actually create a new overflow block and put our key at position zero.
link it up to the appropriate block.
how fast is this?
search -
best: 1 for index, 1 for file
worst: n/blksize
average: n/tablesize/2 .. if you find it early
speed it up -- make the table size bigger.
the worst case is worse than the b-tree.
now the way we wwrote this we canot shrink, but you can measure performance and if it gets really bad . you can take the keys in overflow and re add them.
you'd want to satart at the overflow at the end of the list first.
direct file index .. primary key index.
if we wanted to search on something else, make another index.
the new index is just another line in addkey
sequential file is ordered by primary key -- it has to be .. you're stuck
could build another index to look up home town ..
take the chity inex
chi is in ther only once
deep here
green bay
make the city point to the first person thht lives in chicago, and then have it point to the next person that lives in chicago.
city index .. just another index.
non-unique keys we can have a unique value index .. our city index and then it has a pointer field to the first value in the unique and then the link field in the file goes to the next one.
-------
Mon 04/30/07
3872 vander wegen
direct file.
pkey -- primary key index .. logically, just points to where the data is.
then you can have another index.
then we found out you could have non-unique indecies -- you take the nonunique . do it by model ..
we make a unique value table, which is really just another index.
every model 91, for example, go over to the file and they get linked up on the model link field.
kind of one structure does it all.
with blocking -- all of the blocks fit in one index file. .. you could store all your root pointers and whatnot in the first record of your index file.
the last handout
field definition table.
union - can overlay definitions
so finduniquekey("id",11915,0,' ');
you use one or the other value -- we use into, float, or string. 0 and space are dummy values.
findkey returns a position - it does what it used to do.
finduniquekey.. does all
data file record is at my current index bloxk[pos].data;
.. unless pos is -1..
if its not unique - i get back the start of a linked list .. same routine whether its unique or not unique.
...
so we can dynamically generate indecies .. and then blow them away when we dont need them.
this is teh beginning of the database.
final is wed. at 2:15
-------
Wed 05/02/07
exam really on files
1) OS properties of evolution
multiprogramming and a single processor
running a job or tast -- either get time slices or you get interrupted.
interrupt driven every time you ask for some i/o
journey of a byte - what happens when you say read and write.
request gets queued etc
we worried a lot about that and what goes on when you ask for something
idea of an i/o channel as a separate computer .. moves the disk and reads from it.
disk formats -- sectors -- can only sense the beginning of a sector -- blocking -- one big physical record
buffering -- stay one buffer ahead -- you anticipate what the next read is going to be
we wrorried about text files vs binary
system sort -- merge .. kind of a great idea, but if you're going to use that, you have implications of what your file looks like
you need to tell it the key -- so fixed location for the key .. and probably fixed length records as well -- but text files have eol marks, so you know how long a line is.
from here, we got into direct files -- random acess -- seekg or seekp in C++.
but now we're doing physical i/os
logical vs physical i/o
should be able to create a direct file -- can do that as a sequential file --
relative record number.
should be able to read/write -- you can actually write on top of any record that you like.
start building an index
logical index -- an array of int pointers.
we had static structures -- limit was that you cant put more data records than the size of your index
we looked at dynamic
linked processing .. so adding a link field, we could do linked list processing.
we had the issue of an avail list
a lot of what we did in memory we can do in files -- now limited only by disk
ow do you save an index to a file --
we want to be able to increase the efficiency of the programs. - if you're going to go to the disk, go to the disk.
we worried about updating -- made a big deal about this -- how do you update a file, how do you cahnge it
added changed delete
for both sequential and direct files
process charts are different
sequential
old master -> new master .. you had transactions that had to be sorted . had data capture program
and you go a free backup because tomorrow the new pmaster becomes the old master.
direct file -- one file .. master . the version.
and you also had a file called an index file -- so that you know how to interpret the file.
no batching needed with a direct file -- its the same number of i/os whether you group them or not.
so huge difference in the up-to-dateness of the data file in these two pictures.
backup -
backup in sequential land is kinda free - you get a free backup -- if something goes wrong , you can always fix the transactions and run it again.
in direct land, it became more complex because once you made a change , hten change was there.
for sequential -- save one old master .. if something goes wrong , you can rerun all the transactions and then you generate any position that you want.
secondary indicies .. that applies to a direct file .. just have another index.
what happens if the values in there are not unique --
then we have a unique value table and that we ran as a plain old unique key -- the non-unique part of that , we linked on the file .. so we fiollowe the links
secondary indcies that were either unique or non unique.
more time in these dynamic indecies.
block search -- M-way tree - trouble -- what happens if a block filles up
we got into B tree and B+ and B* tree
dynmaicallly expand and shrink.
hash index - you can have a very very large one by segmenting the has table -- we took this index and broke it up into chunchs and wrote those out as blocks.
hash file - where you don'' have an index at all. .. overflow area.
last thing we did = field definition table.
field name, pointer to where it started.
generic processing -- whether unique or not unique.
big deal: i/o counts.
one of the questions will be!!
Findkey for b-tree
deletekey from hash file
probably not so much coding ..
show a sample of a three level b-tree structure
or maybe given something like that , what would it look like after these three values are deleted -- how do the algorithms work
final wed 2:15
-------
Fri 05/04/07
wed - i/o counts -- it nails you.
sequential files -- not as bad as we said -- blocking and buffering ..
i/o -
df.read
sequential file -
straight through the file
2000 logical i/os .. how many physical? depends on how many we get in a sector.
O(n) - but a sweet constant of porportionality
direct files -- rotational delay, seek delay, .. you pay a lot for direct files
so each logial i/o in a direct fiel is physical -- but if you don't do very many of those, its pretty quick.
GONNA be around monday ..
b-tree - make each block a multiple of your sector size
.. and make blocks really big.
no old stuff
b
b+ - two columns in stead of 3
b* - 2/3 full