File processing

C standard says is that char is no bigger than short which is no bigger than int which is no bigger than long. This implies that it is possible that all these types are the same size. Further, the size of char is _not_ defined to be 8-bits. It is defined to be large enough to hold any member of the implementation's character set. This means that on some exotic implementation all integer types could be the same size and that could be something strange such as 24-bits.

The sizes of the building types in C++ are not that fixed. An int could be 8, 16, 32, 64 bits in size. On the most common 32-bit platforms char is 8-bits, short is 16-bits, int and long are both 32-bits.

Endianness: http://www.codeproject.com/KB/cpp/endianness.aspx

The point to be aware of is that writing binary data on a system using one endian type and reading on a system using another renders the data junk. The same applies when transferring binary data between such disparate systems on a network. The solution is to ensure the byte ordering of multi-byte values that are intended to be shared between systems is defined and each implementation ensures it converts to and from this format as required. For some systems this will be a simple do nothing operation, for others it will mean reversing bytes when reading or writing them.

The Intel(tm) family of processor stores the least significant byte first--the Little Endian style

Big endian means: the most significant byte takes the lowest address ('comes first'); little-endian is the other way round. This is for multi-byte values. The array-of-chars need no such treatment.

In "Big Endian" form, by having the high-order byte come first, you can always test whether the number

is positive or negative by looking at the byte at offset zero.

#define BIG_ENDIAN 0

#define LITTLE_ENDIAN 1

int TestByteOrder(){

short int word = 0x0001;

char *byte = (char *) &word;

return(byte[0] ? LITTLE_ENDIAN : BIG_ENDIAN);

}

#include <algorithm> //required for std::swap

#define ByteSwap5(x) ByteSwap((unsigned char *) &x,sizeof(x))

void ByteSwap(unsigned char * b, int n){

while (i<j){

std::swap(b[i], b[j]);

i++, j--;

}

The following function swaps the 4 bytes of an integer (32-bit platform) and can be used to

convert from the little-endian representation to big-endian and vice versa.

unsigned int swap32(unsigned int value){

return ((value & 0xFF000000) >> 24) |

(((value & 0x00FF0000) >> 16) << 8) |

(((value & 0x0000FF00) >> 8) << 16) |

((value & 0x000000FF) << 24);

}

A 16-bit version of a byte swap function:

unsigned short ByteSwap16 (unsigned short n16){

return (((n16 >> 8)) | (n16 << 8));

}

ifstream::read reads data to a character buffer--it expects char * as its first argument. When we want to read a long, we can't just pass the address of a long to it--the compiler doesn't know how to convert a long * to a char *. This is one of these cases when we have to force the compiler to trust us. We want to split the long into its constituent bytes (we're ignoring here the big endian/little endian problem). A reasonably clean way to do it is to use the reinterpret_cast. We are essentially telling the compiler to "reinterpret" a chunk of memory occupied by the long as a series of chars. We can tell how many chars a long contains by applying to it the operator sizeof.

#include <fstream>

using namespace std;

int main()

{

int x=5;

ofstream archive("coord.dat", ios::binary);

archive.write(reinterpret_cast(&x), sizeof (x));

archive.close();

// read

ifstream archive("coord.dat"); archive.read((reinterpret_cast(&x), sizeof(x));

}

class MP3_clip{

private:

std::time_t date;

std::string name;

int bitrate;

bool stereo;

public:

void serialize();

void deserialize();

};

void MP3_clip::serialize(){

int size=name.size();// store name's length

//empty file if it already exists before writing new data

ofstream arc("mp3.dat", ios::binary|ios::trunc);

arc.write(reinterpret_cast<char *>(&date),sizeof(date));

arc.write(reinterpret_cast<char *>(&size),sizeof(size));

arc.write(name.c_str(), size+1); // write final '\0' too

arc.write(reinterpret_cast<char *>(&bitrate), sizeof(bitrate));

arc.write(reinterpret_cast<char *>(&stereo), sizeof(stereo));

}

The implementation of deserialize() is a bit trickier, since we need to allocate a temporary buffer for the string:

void MP3_clip::deserialize(){

ifstream arce("mp3.dat");

int len=0;

char *p=0;

arc.read(reinterpret_cast <char *> (&date), sizeof(date));

arc.read(reinterpret_cast<char *> (&len), sizeof(len));

p=new char [len+1]; // allocate temp buffer for name

arc.read(p, len+1); // copy name to temp, including '\0'

name=p; // copy temp to data member

delete[] p;

arc.read(reinterpret_cast<char *> (&bitrate), sizeof(bitrate));

arc.read(reinterpret_cast<char *> (&stereo), sizeof(stereo));

}

<fstream> library defines the following open modes and file attributes:

ios::app // append

ios::ate // open and seek to file's end

ios::binary // binary mode I/O (as opposed to text mode)

ios::in // open for read

ios::out // open for write

ios::trunc // truncate file to 0 length

seekp() set position of put pointer (ostream):

tellp() get position of get pointer (istream):

seekg() set position of get pointer (istream):

ofstream fout("parts.txt");

fout.seekp(10); // advance 10 bytes from offset 0

cout<<"new position: "<<fout.tellp(); // display 10

You can use the following constants for repositioning a file's pointer:

ios::beg // position at file's beginning

ios::cur //current position, for example: ios::cur+5

ios::end // position at file's end

Reading and Writing Data

The fstream classes overload the << and >> operators for all the built-in datatypes as well as std::string and std::complex. The following example shows how to use these operators. First, we open a file, write two fields to it, rewind it and read the previously written fields:

fstream logfile("log.dat");

logfile<<time(0)<<"danny"<<'\n'; // write a new record

logfile.seekp(ios::beg); // rewind

logfile>>login>>user; // read previously written values

This is a good place to explain the various types of casts. You use

- const_cast--to remove the const attribute
- static_cast--to convert related types
- reinterpret_cast--to convert unrelated types

static_cast -the inverse of implicit conversion. Whenever type T can be implicitly converted to type U (in other words, T is-a U), you can use static_cast to perform the conversion the other way. For instance, a char can be implicitly converted to an int:

char c = '\n';

int i = c; // implicit conversion

Therefore, when you need to convert an int into a char, use static_cast:

int i = 0x0d;

char c = static_cast<char> (i);

Or, if you have two classes, Base and Derived: public Base, you can implicitly convert pointer to Derived to a pointer to Base (Derived is-a Base). Therefore, you can use static_cast to go the other way:

Base * bp = new Derived; // implicit conversion

Derived * dp = static_cast<Base *> (bp);

http://it.toolbox.com/blogs/programming-life/lets-serialize-6554

http://code.google.com/apis/protocolbuffers/docs/cpptutorial.html

http://xparam.sourceforge.net/guide/user.html

http://www.gamedev.net/community/forums/topic.asp?topic_id=467545

http://www.parashift.com/c++-faq-lite/serialization.html

Compression http://marknelson.us/2002/08/01/star-encoding-in-cpp

If you really wish to save data as 10-bits then you will have to devise ways to pack and unpack the individual values into a set of bytes. A obvious point to note that 4 10 bit words pack into 5 8 bit bytes. If you do this then then design and implement a class to handle this data type.

Tag-driven file format; position-oriented format

Reading huge files, memory-mapped file

http://codingplayground.blogspot.com/2009/03/memory-mapped-files-in-boost-and-c.html

http://en.wikipedia.org/wiki/Memory-mapped_file

strerror(errno)in code below can produce

Value too large for defined data type

if fopen() is used for big files, so use fopen64()

#include <stdio.h>

#include <string.h>

#include <errno.h>

int main ()

{

FILE * pFile;

pFile =fopen64("unexist.ent","r");

if (pFile == NULL)

printf ("Error opening file unexist.ent: %s\n",strerror(errno));

return 0;

}

http://en.wikibooks.org/wiki/Optimizing_C%2B%2B/General_optimization_techniques/Input/Output

File "memory_file.hpp":

#ifndef MEMORY_FILE_HPP#define MEMORY_FILE_HPP

Read-only memory-mapped file wrapper.

It handles only files that can be wholly loaded

into the address space of the process.

The constructor opens the file, the destructor closes it.

The "data" function returns a pointer to the beginning of the file,

if the file has been successfully opened, otherwise it returns 0.

The "length" function returns the length of the file in bytes,

if the file has been successfully opened, otherwise it returns 0.

*/ class InputMemoryFile {public: InputMemoryFile(const char *pathname); ~InputMemoryFile(); const void* data() const { return data_; } unsigned long length() const { return length_; }private: void* data_; unsigned long length_;#if defined(__unix__) int file_handle_;#elif defined(_WIN32) typedef void * HANDLE; HANDLE file_handle_; HANDLE file_mapping_handle_;#else #error Only Posix or Windows systems can use memory-mapped files.#endif};#endif

File "memory_file.cpp":

#include "memory_file.hpp"#if defined(__unix__)#include <fcntl.h>#include <unistd.h>#include <sys/mman.h>#elif defined(_WIN32)#include <windows.h>#endif

InputMemoryFile::InputMemoryFile(const char *pathname): data_(0),

length_(0),

#if defined(__unix__) file_handle_(-1){ file_handle_ = open(pathname, O_RDONLY); if (file_handle_ == -1) return; struct stat sbuf; if (fstat(file_handle_, &sbuf) == -1) return; data_ = mmap(0, sbuf.st_size, PROT_READ, MAP_SHARED, file_handle_, 0); if (data_ == MAP_FAILED) data_ = 0; else length_ = sbuf.st_size;#elif defined(_WIN32) file_handle_(INVALID_HANDLE_VALUE),

file_mapping_handle_(INVALID_HANDLE_VALUE){ file_handle_ = ::CreateFile(pathname, GENERIC_READ,

FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0); if (file_handle_ == INVALID_HANDLE_VALUE) return; file_mapping_handle_ = ::CreateFileMapping( file_handle_, 0, PAGE_READONLY, 0, 0, 0); if (file_mapping_handle_ == INVALID_HANDLE_VALUE) return; data_ = ::MapViewOfFile( file_mapping_handle_, FILE_MAP_READ, 0, 0, 0); if (data_) length_ = ::GetFileSize(file_handle_, 0);#endif} InputMemoryFile::~InputMemoryFile() {#if defined(__unix__) munmap(data_, length_); close(file_handle_);#elif defined(_WIN32) ::UnmapViewOfFile(data_); ::CloseHandle(file_mapping_handle_); ::CloseHandle(file_handle_);#endif}

File "memory_file_test.cpp":

#include "memory_file.hpp"#include <iostream>#include <iterator>

int main() { // Write to console the contents of the source file. InputMemoryFile imf("memory_file_test.cpp"); if (imf.data()) copy((const char*)imf.data(),

(const char*)imf.data() + imf.length(),

std::ostream_iterator<char>(std::cout)); else std::cerr << "Can't open the file";}

Calculating statistics about a text file

http://www.meetingcpp.com/index.php/br/items/word-counting-in-c11-lessons-learned.html

#include <iostream>

#include <fstream>

#include <cstdlib>

#include <cctype>

using namespace std;

void countStuff(istream& in, int& chars, int& words, int& lines) {

char cur = '\0';

char last = '\0';

chars = words = lines = 0;

while (in.get(cur)) {

if (cur == '\n' || (cur == '\f' && last == '\r'))

lines++;

else

chars++;

if (!std::isalnum(cur) && // This is the end of a

std::isalnum(last)) // word

words++;

last = cur;

}

if (chars > 0) { // Adjust word and line

if (std::isalnum(last)) // counts for special

words++; // case

lines++;

}

int main(int argc, char** argv) {

if (argc < 2)

return(EXIT_FAILURE);

ifstream in(argv[1]);

if (!in)

exit(EXIT_FAILURE);

int c, w, l;

countStuff(in, c, w, l);

cout << "chars: " << c << '\n';

cout << "words: " << w << '\n';

cout << "lines: " << l << '\n';

}

Counting word frequencies

#include <iostream>

#include <fstream>

#include <map>

#include <string>

typedef std::map<std::string, int> StrIntMap;

void countWords(std::istream& in, StrIntMap& words)

{

std::string s;

while (in >> s)

{ ++words[s]; }

}

int main(int argc, char** argv) {

if (argc < 2)

return(EXIT_FAILURE);

std::ifstream in(argv[1]);

if (!in)

exit(EXIT_FAILURE);

StrIntMap w;

countWords(in, w);

for (StrIntMap::iterator p = w.begin( ); p != w.end( ); ++p)

{

std::cout << p->first << " occurred " << p->second << " times.\n";

}

Split a (CSV) delimited string

#include <string>

#include <vector>

#include <functional>

#include <iostream>

using namespace std;

void split(const string& s, char c, vector<string>& v)

{

string::size_type i = 0;

string::size_type j = s.find(c);

while (j != string::npos) {

v.push_back(s.substr(i, j-i));

i = ++j;

j = s.find(c, j);

if (j == string::npos)

v.push_back(s.substr(i, s.length( )));

}

int main( ) {

vector<string> v;

string s = "Account Name|Address 1|Address 2|City";

split(s, '|', v);

for (int i = 0; i < v.size( ); ++i) {

cout << v[i] << '\n';

}

Reading in a delimited file

#include <iostream>

#include <fstream>

#include <string>

#include <vector>

using namespace std;

void split(const string& s, char c, vector<string>& v) {

int i = 0;

int j = s.find(c);

while (j >= 0) {

v.push_back(s.substr(i, j-i));

i = ++j;

j = s.find(c, j);

if (j < 0) {

v.push_back(s.substr(i, s.length( )));

}

void loadCSV(istream& in, vector<vector<string>*>& data) {

vector<string>* p = NULL;

string tmp;

while (!in.eof( )) {

getline(in, tmp, '\n'); // Grab the next line

p = new vector<string>( );

split(tmp, ',', *p); // Use split () defined above

data.push_back(p);

cout << tmp << '\n';

tmp.clear( );

}

int main(int argc, char** argv) {

if (argc < 2)

return(EXIT_FAILURE);

ifstream in(argv[1]);

if (!in)

return(EXIT_FAILURE);

vector<vector<string>*> data;

loadCSV(in, data);

// Go do something useful with the data...

for (vector<vector<string>*>::iterator p = data.begin( ); p != data.end( ); ++p) {

delete *p; // Be sure to de-

} // reference p!

}

Word frequencies -- using map

#include <iostream>

#include <map>

#include <string>

using namespace std;

int main() {

map<string, int> freq; // map of words and their frequencies

string word; // input buffer for words.

while (cin >> word) {

freq[word]++;

}

//--- Write the count and the word.

map<string, int>::const_iterator iter;

for (iter=freq.begin(); iter != freq.end(); ++iter) {

cout << iter->second << " " << iter->first << endl;

}

return 0;

}

// map/wordfreq.cpp - Word frequencies using set and map.

// Fred Swartz 2001-12-11, 2004-02-29

//

// Words to ignore are read from a file and saved in a set<string>.

// Words to count are read from cin and counted in a map<string, int>.

#include <iostream>

#include <fstream>

#include <map>

#include <set>

#include <string>

using namespace std;

int main() {

set<string> ignore; // Words to ignore.

map<string, int> freq; // Map of words and their frequencies

string word; // Used to hold input word.

//-- Read file of words to ignore.

ifstream ignoreFile("ignore.txt");

while (ignoreFile >> word) {

ignore.insert(word);

}

//-- Read words/tokens to count from input stream.

while (cin >> word) {

if (ignore.find(word) == ignore.end()) {

freq[word]++; // Count this. It's not in ignore set.

}

//-- Write count/word. Iterator returns key/value pair.

map<string, int>::const_iterator iter;

for (iter=freq.begin(); iter != freq.end(); ++iter) {

cout << iter->second << " " << iter->first << endl;

}

return 0;

}

Reading an entire text file into a std::string

std::ifstream in("some.file");

std::istreambuf_iterator<char> beg(in), end;

std::string str(beg, end);

std::ifstream in("some.file");

std::ostringstream tmp;

tmp << in.rdbuf();

std::string str = tmp.str();

Sorting

#include <iostream>

#include <algorithm>

using namespace std;

#define SIZE 5

int main()

{

int array[SIZE] = {20, 11, 13, 6, -9};

sort(array, array+SIZE);

for (int i=0; i<7; i++) {

cout << array[i] << " ";

}

return 0;

}

Example

Given: delimiter-separated file ( coma, space, tab. etc)

First column is unique ( row #) (we can always generate the first unique column with

awk '{print NR, $0 }' file

Python

To do:

- read data from given column #, or column name

- build world frequency dictionary ( for given column #)

- sort file by given column

- return row for given row #

- return all rows then column value in given column# = provided value

import csv

import pprint

import cProfile

def csv2dict(file, delim):

reader = csv.reader(open(file, 'rb'), delimiter=delim)

col_names = reader.next()

#create a dictionary such that {column_name: index position}

col_names = dict([(col.lower(),col_names.index(col))

for col in col_names])

return [dict([(col, row[col_names[col]]) for col in col_names])

for row in reader]

#-------------------------------------

file="input.txt"

# naive way , without cvs class

for line in open(file):

title, year, director = line.split(",")

print year, title

#use cvs class

reader = csv.reader(open(file))

for title, year, director in reader:

print year, title

#put in dictionary

csv2dict(file,',')

pp = pprint.PrettyPrinter(indent=4)

pp.pprint(csv2dict(file,','))

#profile

cProfile.run("csv2dict(file, ',')")

#--------------

f = open(file, 'rt')

try:

reader = csv.DictReader(f)

for row in reader:

print row

finally:

f.close()

Page updated

Google Sites

Report abuse