Commodore‎ > ‎BASIC‎ > ‎

Variable Format

This page gives the details about how Commodore BASIC stores user variables.  A user variable is a piece of the computer's memory (RAM) that a programmer may refer to by name.  Besides user variables (described here), BASIC also has reserved variables which are conceptually the same (although their internal format is quite different than described here), and secret variables which are various bytes of memory maintained by BASIC (or the KERNAL) which the programmer only has indirect access to (i.e., they have no name).  All versions of CBM BASIC use the same format for user variables (simply "variables" from now on), except for strings on some versions (described below).
 
Some of the details given here are not useful to the casual BASIC programmer.  However those details will be important when/if you want to interface machine-language programs with BASIC (typically with USR or SYS).
 
  What's in a name?  
A very important thing is the variable's name.  In CBM BASIC, a variable name always begins with an un-shifted letter of the alphabet (on most machines, this will be a capital letter by default).  In detail, its character code must be between 65 (ASCII "A" or PETSCII "a") and 90 (ASCII "Z" or PETSCII "z").  A variable name may optionally have more characters, each of which may be a an un-shifted letter of the alphabet or a digit "0" to "9" (character codes 48 to 57).  However, there are three important points to be made:
  1. A variable name must not be, or have imbedded, any BASIC keyword.  For example TO and ATOM would be illegal names (because TO is a keyword).
  2. If a variable has only a one-character name, it conceptually has ASCII null (code zero) as a second character (the null is not really stored in the program).
  3. If a variable has more than two characters, they will be ignored.  So XY and XY2 would refer to the same variable ("XY" in this example).
There is an exception to rule 1: a variable name may have a reserved variable name imbedded within it.  For example, LOST would be a valid name for a user variable, although ST is a reserved variable name (that is, ST is a BASIC keyword).  The reason for this (useful) exception is that reserved variables are not stored as tokens (like functions, commands, and statments are stored in a program).
 
There are two broad classes of variables: scalars and arrays.  Scalar variables are most common; so common, in fact, that the term "variable" (without qualification) refers to a scalar variable.  These hold a single piece of information.  An array variable (or simply called an array), in contrast, holds multiple values.  First we'll consider scalar variables because they are simpler, and they are used as the foundation for arrays.
 
A scalar variable may be one of four types, indicated by the syntax of a BASIC line.  This is usually a suffix (or lack thereof) but in one case a prefix.  BASIC encodes the type (in RAM storage) by setting the high bit (bit 7) in none, one, or both bytes which store the name in variable memory. Here is a listing in table format:
Type Prefix Suffix High bits Example Name Example Bytes
(hexadecmal)
floating-point number

none A 41 00
integer number
% both chars A% C1 80
string
$ second char A$ 41 80
user function FN
first char FN A C1 00 
 
It is important to note that the byte encoding (setting of high bits) only applies to variable memory (a part called the variable descriptor); it does not apply to bytes that store a BASIC program line (this is just normal text, except for the keyword FN).
 
The other, arguably more, important part of a variable is its value.  A variable's type (listed above) determines the values it may hold, and how the value will be stored in variable memory.  All scalar variables use 5 bytes (conceptually) to store the value.  Thus, when including the two bytes which store the name, a variable uses 7 bytes of variable memory.  This is not very efficient for string or integers (as we shall see), but a constant size makes variable name searches faster (and size normally isn't a problem with scalar variables).
 
A floating-point type uses all 5 bytes allocated in variable storage.  The first byte holds the number's binary exponent (-127 to 127) in excess 128 format (i.e., the value 128 is added to the real binary exponent... or for you machine-language programmers, simply flip the high bit).  Note a value of zero will have the first/exponent byte set to zero (this is logically equivalent to an exponent of -128).  The next four bytes hold the number's mantissa, stored high-byte first.  However, the first of these discards the always-set high bit (except for zero) and replaces it with a sign bit (known as "packed" format).  So the high bit will be 0 if the number is positive or 1 if the number is negative.  If the number is zero (neither positive nor negative), all bytes of the mantissa are irrelevant.  The mantissa represents a floating-point value from 0.5 (inclusive) to 1.0 (exclusive).  Some examples:
Floating-point value Mantissa Binary Exponent
(times 2value)
Encoded Exponent
(hexadecmal)
Encoded Mantissa
(hexadecmal)
zero whatever -infinity 00 xx xx xx xx (don't care)
0.5 0.5 0 80 00 00 00 00 
-.5 -.5 0 80 80 00 00 00
1 0.5 1 81 00 00 00 00
-1 -.5 1 81 80 00 00 00
1.5 0.75 1 81 40 00 00 00
-1.5 -.75 1 81 C0 00 00 00 
2 0.5 2 82 00 00 00 00 
-2 -.5 2 82 80 00 00 00 
15 0.9375 4 84  70 00 00 00 
-15  -.9375  4 84 F0 00 00 00 
255 0.99609375 8 88 7F 00 00 00
-255 -.99609375 8 88 FF 00 00 00 
0.2 (1/50) 0.8 -2  7E 4C CC CC CD 
-.2  -.8 -2 7E CC CC CC CD 
 
An integer (scalar) type only uses the first two bytes of the allocated five-byte storage. It is stored as a signed number.  Thus the valid range is -32768 to +32767.  Contrary to normal 6502 convention (but the same as floating-point), the data is stored high-byte first.  Some examples:
Integer value Encoded Value 
(hexadecmal)
zero 00 00 00 00 00
1 00 01 00 00 00
-1 FF FF 00 00 00
2 00 02 00 00 
-2 FF FE 00 00 
15 00 0F 00 00 
-15  FF F1 00 00 
255 00 FF 00 00
-255 FF 01 00 00 
 
On most CBM computers, the first part of a string variable (known as a string descriptor) uses 3 byte of the allocated five-byte storage.  The first gives the length of the string.  If not zero, the next two bytes form a pointer to the actual string data (in standard 6502 low-byte-first format).  On the CBM-II series only, a fourth byte is used which indicates which RAM Bank holds the string data.  Some examples (the data address will vary wildly in practice):
String value String Descriptor 
(hexadecmal)
String Data Address
(hexadecmal)
"" 00 xx xx 00 00 none
"HELLO" 05 F9 FE 00 00 FEF9 (most machines)
"HELLO" 05 F8 FB 04 00 FBF8, Bank 4 (CBM-II with 256K or more RAM)
"HELLO" 05 F8 FB 02 00 FBF8, Bank 2 (CBM-II with 128K RAM)
"HELLO" 05 F8 FB 01 00 FBF8, Bank 1 (CBM-II with 64K RAM)
 
The string data is allocated dynamically and may move when BASIC performs garbage collection (see FRE).  It will consist of (at least) the number of bytes specified by the length (first byte of the descriptor, see above).  In BASIC versions 3.5 and greater, another two or three bytes will follow the data in RAM.  I call these bytes a back-pointer.  For actual string variables, the back-pointer will hold the address of the string descriptor (i.e., it points back to the descriptor).  For non-variable strings (temporary strings), the high-byte of the back-pointer will be 255.  This back-pointer obviously uses additional memory, and maintaining it makes BASIC very slightly slower.  The reason for its existence is that it makes garbage collection much, much faster!  For example, on a C64 (BASIC 2) which lacks any back-pointer, garbage collection can take 30 seconds or more, and the computer will appear "dead" during this time.  The same program on a Plus/4 (BASIC 3.5) or C128 (BASIC 7) will finish garbage collection in 1 or 2 seconds.  Note on computers in the CBM-II series, the back-pointer is three bytes instead of two; the extra (final) byte holds the bank number where the descriptor may be found.
 
A user function is stored very similar to a string in BASIC's variable memory.  The first two bytes are a pointer to the definition in program memory of the user's function (i.e., the first non-space character following "=" in the function definition, see DEF).  The next two bytes point the data of the independent variable.  The last byte is unused, although it will typically have the first character of the definition (a bug/inefficiency in the BASIC interpreter).  Note when you create a function using DEF, a second variable may be created.  If the the indendependent variable in the definition (the name used in parenthesis) does not yet exist, it will be created.
 
Here is an example program you can run on any version of CBM BASIC:
NEW

READY.
10 DEF FN Y(X) = 2*X+1
RUN

READY.
 
Now if you examine variable memory (the location varies by machine), you will see two variables: the function Y, and the floating-point variable X.  Below is a memory dump from the C128, but other machines will be similar (the pointers will be different, but point to the same thing).:
M 10400 1040F
>10400 D9 00 10 1C 09 04 32 58 00 00 00 00 00 00 ....
       ^^ ^^ function Y     ^^ ^^ floating X     (^ points to name)
M 10400 1040F
>10400 D9 00 10 1C 09 04 32 58 00 00 00 00 00 00 ....
  variable pointer ^^ ^^ -> -> -> ^^ ^^ ^^ ^^ ^^ (^ points to data)
M 10400 1040F
>10400 D9 00 10 1C 09 04 32 58 00 00 00 00 00 00 ....
             ^^ ^^ definition pointer
M 01C10 01C1F
>01C10 32 AC 58 AA 31 00 00 00 ....
       2  *  X  +  1
 
  Arrays  
The second class of variables, arrays, lets you read/write multiple values with a single name and one more or more index values.  There are several ways you can organize your multiple values.  The easiest, and very common form, is a simple "list", also known as a linear or 1-D array.  Common uses of this type include a list of names (like days of the week) or a mathematical vector.  This just needs an integer value to follow the array name.  The next most common type is a "table" or 2-D array.  This is useful to hold a mathematical matrix, or almost any table of data.  Think of a spreadsheet as a good analogy; if that doesn't mean anything to you, think of a calendar, organized into rows (weeks) and columns (days of the week).  A 2-D array needs two integers to follow the array name.  You can extend this idea even further, 3-D, 4-D, 5-D etc. arrays but they are extremely rare.  In theory BASIC limits you to 255 dimensions, but in reality you can't enter a line long enough to specify such a large array.
 
Any scalar type, except user function, can be used as an array.  Note that the type of a variable is conceptually a part of a variable's name.  Thus, you can have several (all different) variables called "N":
  • N (a scalar floating-point)
  • N% (a scalar integer)
  • N$ (a string)
  • FN N (a user-function)
  • N(0) [a 1-D floating-point array]
  • N%(0) [a 1-D integer array]
  • N$(0) [a 1-D string array]
Arrays are stored in a different part of memory from scalar variables, which is why you can have two variables called N, for example, if one is scalar and the other is an array.  The type (float/integer/string) is encoded in the high-bit of the variable's name for arrays just like for scalars (see above).  The difference is the what follows the name in memory (in brief, a descriptor and then the data).
 
The first thing following the name is an array descriptor, the size of which varies in length based on the number of dimensions.  The descriptor will be 3 + 2*D bytes in size (where D is the number of dimensions).  So a linear (1-d) array will have a 5-byte descriptor, a table (2-d) array will have a 7-byte descriptor, etc.  The first part of the descriptor is a two-byte size of the entire array: the two bytes of the name/type, plus the size of the descriptor, plus the size of all data bytes.  This size is stored in standard low-byte, high-byte format.  Following that is a single byte which tells the number of dimensions.  Next is a series of two-byte sizes; these tell the number of elements in each dimension of the array.  It is important to note these two-byte sizes are stored in non-standard high-byte-first format!  Also, they will be listed in the reverse order from the one used in a DIM statement. 
 
Following the array descriptor are the actual data bytes of the array elements.  Each element only occupies the actual number of bytes it needs (no wasted space like scalars, see above).  All elements of the first-listed dimension are stored first, and for multidimensional (2-d or more) arrays, additional sets of the first-listed dimension will appear.  (This is the opposite of the way some languages, like C, store array data.)  The number of bytes used for each element is:
Array Type Bytes per
Element
Notes
floating-point number 5
integer number 2
string 3 most machines (string descriptor only, not string data)
string 4 CBM-II series (string descriptor only, not string data)
 
 
So lets do an example.  DIM N%(10, 20) will create a 2-d "table" of integers.  This array actually has 11 x 21 = 231 elements because BASIC always allocates space for a "zeroth" element (in every dimension).  You should think of this as 11 columns of 21 rows, because that is how BASIC will store the data.  However, unless you are actually peeking at the internal storage of the array, you could equally well think of this as an array of 11 rows and 21 columns.  Because this is an integer type array, an integer needs only two bytes, and there are 231 elements, the size needed for the array data is 462 bytes (2 bytes/integer * 231 integers).  The size of the array descriptor is 3 + 2*D = 7 bytes.  The total size of the array is 2 (name/type) + 7 (array descriptor) + 462 (array data) = 471 bytes.
 
Below is a "listing" showing a memory dump after defining a few values in our example array.  The location in memory is based on the C128.  On other machines the data would be the same, it would just be located at a different address:
CLR: DIM N%(10,20)

READY.
N%(0,0) = 7 : REM first column, first row

READY.
N%(1,0) = 9 : REM second column, first row

READY.
N%(0,1) = 55: REM first column, second row

READY.
MONITOR

MONITOR
    PC  SR AC XR YR SP
; FB000 00 00 00 00 F8
M 10400 1040F
>10400 CE 80 D7 01 02 00 15 00 0B 00 07 00 09 00 00 00 ....
       name  size  Ds 20+1  10+1 N(0,0) N(1,0)
             471      DIM(10,20)

As you can see, the size is just as calculated above (471 bytes).  There are 2 dimensions in the array: 21 rows and 11 columns which is listed in RAM in the reverse order of the DIM(10,20) statement (and +1 because of the zeroth element) as described previously.  The first few items of the first row are visible.  In particular, we can see the values assigned to the first columns of that row: 7 and 9 (two bytes each because they're integer type).
 
But where is value 55?  It is located at the start of the next row, which is 11 elements or 22 bytes after the first data element.  Looking above we see the very first element is at address $409 (in Bank 1), which is address 1033 in decimal.  So 22 bytes later is 1055, or $41F in hexadecimal... let's take a peek!
M 1041F 10428
>1041F 00 37 00 00 00 00 00 00 ....
      N(0,1) N(1,1)

As you can see, the element at N(0,1) is correctly set to 55 ($37 hexadecimal).  The next value, N(1,1), was never set by us, so it has the BASIC default value of zero.
 
I guess this is as good a place as any to point out that the POINTER function will return the address of the first byte in a variable's actual data.  As just mentioned, the very first element, N(0,0), is at location 1033 in this example and the first element in the next row is at location 1055.  So let's see if POINTER agrees!
X
READY.
PRINT POINTER(N%(0,0))
 1033

READY.
PRINT POINTER(N%(0,1))
 1055

READY.
 
Although POINTER is really handy for interfacing BASIC programs with ML programs, it is only available in version 7.0.  In other versions, the general method is to have the ML program call the "find variable" routine in ROM, which varies by machine.
 

© H2Obsession, 2014, 2015
Comments