File Parsing
http://www.aidanf.net/software/elie_an_adaptive_information_extraction_system
Google Page Rank:
http://www.ams.org/featurecolumn/archive/pagerank.html
http://www.rose-hulman.edu/~bryan/googleFinalVersionFixed.pdf
Suffix Array:
http://sary.sourceforge.net/docs/suffix-array.html
File I/O
http://www.xs4all.nl/~waterlan/dosdir.txt DOS/UNIX file routines
http://www.angelfire.com/country/aldev0/cpphowto/cpp_BinaryFileIO.html
http://www.codersource.net/cpp_file_io_binary.html
http://www.codeguru.com/forum/showthread.php?t=269648
http://en.allexperts.com/q/C-1040/writing-struct-variable-size.htm
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2709412/
http://code.google.com/apis/protocolbuffers/docs/overview.html Google Protocol Buffer
http://www.cv.nrao.edu/fits/traffic/scidataformats/faq.html
File formats HDF5, netCDF http://en.wikipedia.org/wiki/Hierarchical_Data_Format
http://www.hdfgroup.org/projects/biohdf/ BioHDF
C standard says is that char is no bigger than short which is no bigger than int which is no bigger than long. This implies that it is possible that all these types are the same size. Further, the size of char is _not_ defined to be 8-bits. It is defined to be large enough to hold any member of the implementation's character set. This means that on some exotic implementation all integer types could be the same size and that could be something strange such as 24-bits.
The sizes of the building types in C++ are not that fixed. An int could be 8, 16, 32, 64 bits in size. On the most common 32-bit platforms char is 8-bits, short is 16-bits, int and long are both 32-bits.
Endianness: http://www.codeproject.com/KB/cpp/endianness.aspx
The point to be aware of is that writing binary data on a system using one endian type and reading on a system using another renders the data junk. The same applies when transferring binary data between such disparate systems on a network. The solution is to ensure the byte ordering of multi-byte values that are intended to be shared between systems is defined and each implementation ensures it converts to and from this format as required. For some systems this will be a simple do nothing operation, for others it will mean reversing bytes when reading or writing them.
The Intel(tm) family of processor stores the least significant byte first--the Little Endian style
Big endian means: the most significant byte takes the lowest address ('comes first'); little-endian is the other way round. This is for multi-byte values. The array-of-chars need no such treatment.
In "Big Endian" form, by having the high-order byte come first, you can always test whether the number
is positive or negative by looking at the byte at offset zero.
#define BIG_ENDIAN 0
#define LITTLE_ENDIAN 1
int TestByteOrder(){
short int word = 0x0001;
char *byte = (char *) &word;
return(byte[0] ? LITTLE_ENDIAN : BIG_ENDIAN);
}
#include <algorithm> //required for std::swap
#define ByteSwap5(x) ByteSwap((unsigned char *) &x,sizeof(x))
void ByteSwap(unsigned char * b, int n){
register int i = 0;
register int j = n-1;
while (i<j){
std::swap(b[i], b[j]);
i++, j--;
}
}
The following function swaps the 4 bytes of an integer (32-bit platform) and can be used to
convert from the little-endian representation to big-endian and vice versa.
unsigned int swap32(unsigned int value){
return ((value & 0xFF000000) >> 24) |
(((value & 0x00FF0000) >> 16) << 8) |
(((value & 0x0000FF00) >> 8) << 16) |
((value & 0x000000FF) << 24);
}
A 16-bit version of a byte swap function:
unsigned short ByteSwap16 (unsigned short n16){
return (((n16 >> 8)) | (n16 << 8));
}
ifstream::read reads data to a character buffer--it expects char * as its first argument. When we want to read a long, we can't just pass the address of a long to it--the compiler doesn't know how to convert a long * to a char *. This is one of these cases when we have to force the compiler to trust us. We want to split the long into its constituent bytes (we're ignoring here the big endian/little endian problem). A reasonably clean way to do it is to use the reinterpret_cast. We are essentially telling the compiler to "reinterpret" a chunk of memory occupied by the long as a series of chars. We can tell how many chars a long contains by applying to it the operator sizeof.
#include <fstream>
using namespace std;
int main()
{
int x=5;
ofstream archive("coord.dat", ios::binary);
archive.write(reinterpret_cast(&x), sizeof (x));
archive.close();
// read
ifstream archive("coord.dat"); archive.read((reinterpret_cast(&x), sizeof(x));
}
class MP3_clip{
private:
std::time_t date;
std::string name;
int bitrate;
bool stereo;
public:
void serialize();
void deserialize();
};
void MP3_clip::serialize(){
int size=name.size();// store name's length
//empty file if it already exists before writing new data
ofstream arc("mp3.dat", ios::binary|ios::trunc);
arc.write(reinterpret_cast<char *>(&date),sizeof(date));
arc.write(reinterpret_cast<char *>(&size),sizeof(size));
arc.write(name.c_str(), size+1); // write final '\0' too
arc.write(reinterpret_cast<char *>(&bitrate), sizeof(bitrate));
arc.write(reinterpret_cast<char *>(&stereo), sizeof(stereo));
}
The implementation of deserialize() is a bit trickier, since we need to allocate a temporary buffer for the string:
void MP3_clip::deserialize(){
ifstream arce("mp3.dat");
int len=0;
char *p=0;
arc.read(reinterpret_cast <char *> (&date), sizeof(date));
arc.read(reinterpret_cast<char *> (&len), sizeof(len));
p=new char [len+1]; // allocate temp buffer for name
arc.read(p, len+1); // copy name to temp, including '\0'
name=p; // copy temp to data member
delete[] p;
arc.read(reinterpret_cast<char *> (&bitrate), sizeof(bitrate));
arc.read(reinterpret_cast<char *> (&stereo), sizeof(stereo));
}
<fstream> library defines the following open modes and file attributes:
ios::app // append
ios::ate // open and seek to file's end
ios::binary // binary mode I/O (as opposed to text mode)
ios::in // open for read
ios::out // open for write
ios::trunc // truncate file to 0 length
seekp() set position of put pointer (ostream):
tellp() get position of get pointer (istream):
seekg() set position of get pointer (istream):
ofstream fout("parts.txt");
fout.seekp(10); // advance 10 bytes from offset 0
cout<<"new position: "<<fout.tellp(); // display 10
You can use the following constants for repositioning a file's pointer:
ios::beg // position at file's beginning
ios::cur //current position, for example: ios::cur+5
ios::end // position at file's end
Reading and Writing Data
The fstream classes overload the << and >> operators for all the built-in datatypes as well as std::string and std::complex. The following example shows how to use these operators. First, we open a file, write two fields to it, rewind it and read the previously written fields:
fstream logfile("log.dat");
logfile<<time(0)<<"danny"<<'\n'; // write a new record
logfile.seekp(ios::beg); // rewind
logfile>>login>>user; // read previously written values
This is a good place to explain the various types of casts. You use
const_cast--to remove the const attribute
static_cast--to convert related types
reinterpret_cast--to convert unrelated types
static_cast -the inverse of implicit conversion. Whenever type T can be implicitly converted to type U (in other words, T is-a U), you can use static_cast to perform the conversion the other way. For instance, a char can be implicitly converted to an int:
char c = '\n';
int i = c; // implicit conversion
Therefore, when you need to convert an int into a char, use static_cast:
int i = 0x0d;
char c = static_cast<char> (i);
Or, if you have two classes, Base and Derived: public Base, you can implicitly convert pointer to Derived to a pointer to Base (Derived is-a Base). Therefore, you can use static_cast to go the other way:
Base * bp = new Derived; // implicit conversion
Derived * dp = static_cast<Base *> (bp);
http://it.toolbox.com/blogs/programming-life/lets-serialize-6554
http://code.google.com/apis/protocolbuffers/docs/cpptutorial.html
http://xparam.sourceforge.net/guide/user.html
http://www.gamedev.net/community/forums/topic.asp?topic_id=467545
http://www.parashift.com/c++-faq-lite/serialization.html
If you really wish to save data as 10-bits then you will have to devise ways to pack and unpack the individual values into a set of bytes. A obvious point to note that 4 10 bit words pack into 5 8 bit bytes. If you do this then then design and implement a class to handle this data type.
Tag-driven file format; position-oriented format
http://codingplayground.blogspot.com/2009/03/memory-mapped-files-in-boost-and-c.html
http://en.wikipedia.org/wiki/Memory-mapped_file
strerror(errno)in code below can produce
Value too large for defined data type
if fopen() is used for big files, so use fopen64()
#include <stdio.h>
#include <string.h>
#include <errno.h>
int main ()
{
FILE * pFile;
pFile =fopen64("unexist.ent","r");
if (pFile == NULL)
printf ("Error opening file unexist.ent: %s\n",strerror(errno));
return 0;
}
http://en.wikibooks.org/wiki/Optimizing_C%2B%2B/General_optimization_techniques/Input/Output
File "memory_file.hpp":
#ifndef MEMORY_FILE_HPP#define MEMORY_FILE_HPP
/*
Read-only memory-mapped file wrapper.
It handles only files that can be wholly loaded
into the address space of the process.
The constructor opens the file, the destructor closes it.
The "data" function returns a pointer to the beginning of the file,
if the file has been successfully opened, otherwise it returns 0.
The "length" function returns the length of the file in bytes,
if the file has been successfully opened, otherwise it returns 0.
*/ class InputMemoryFile {public: InputMemoryFile(const char *pathname); ~InputMemoryFile(); const void* data() const { return data_; } unsigned long length() const { return length_; }private: void* data_; unsigned long length_;#if defined(__unix__) int file_handle_;#elif defined(_WIN32) typedef void * HANDLE; HANDLE file_handle_; HANDLE file_mapping_handle_;#else #error Only Posix or Windows systems can use memory-mapped files.#endif};#endif
File "memory_file.cpp":
#include "memory_file.hpp"#if defined(__unix__)#include <fcntl.h>#include <unistd.h>#include <sys/mman.h>#elif defined(_WIN32)#include <windows.h>#endif
InputMemoryFile::InputMemoryFile(const char *pathname): data_(0),
length_(0),
#if defined(__unix__) file_handle_(-1){ file_handle_ = open(pathname, O_RDONLY); if (file_handle_ == -1) return; struct stat sbuf; if (fstat(file_handle_, &sbuf) == -1) return; data_ = mmap(0, sbuf.st_size, PROT_READ, MAP_SHARED, file_handle_, 0); if (data_ == MAP_FAILED) data_ = 0; else length_ = sbuf.st_size;#elif defined(_WIN32) file_handle_(INVALID_HANDLE_VALUE),
file_mapping_handle_(INVALID_HANDLE_VALUE){ file_handle_ = ::CreateFile(pathname, GENERIC_READ,
FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0); if (file_handle_ == INVALID_HANDLE_VALUE) return; file_mapping_handle_ = ::CreateFileMapping( file_handle_, 0, PAGE_READONLY, 0, 0, 0); if (file_mapping_handle_ == INVALID_HANDLE_VALUE) return; data_ = ::MapViewOfFile( file_mapping_handle_, FILE_MAP_READ, 0, 0, 0); if (data_) length_ = ::GetFileSize(file_handle_, 0);#endif} InputMemoryFile::~InputMemoryFile() {#if defined(__unix__) munmap(data_, length_); close(file_handle_);#elif defined(_WIN32) ::UnmapViewOfFile(data_); ::CloseHandle(file_mapping_handle_); ::CloseHandle(file_handle_);#endif}
File "memory_file_test.cpp":
#include "memory_file.hpp"#include <iostream>#include <iterator>
int main() { // Write to console the contents of the source file. InputMemoryFile imf("memory_file_test.cpp"); if (imf.data()) copy((const char*)imf.data(),
(const char*)imf.data() + imf.length(),
std::ostream_iterator<char>(std::cout)); else std::cerr << "Can't open the file";}
http://www.meetingcpp.com/index.php/br/items/word-counting-in-c11-lessons-learned.html
#include <iostream>
#include <fstream>
#include <cstdlib>
#include <cctype>
using namespace std;
void countStuff(istream& in, int& chars, int& words, int& lines) {
char cur = '\0';
char last = '\0';
chars = words = lines = 0;
while (in.get(cur)) {
if (cur == '\n' || (cur == '\f' && last == '\r'))
lines++;
else
chars++;
if (!std::isalnum(cur) && // This is the end of a
std::isalnum(last)) // word
words++;
last = cur;
}
if (chars > 0) { // Adjust word and line
if (std::isalnum(last)) // counts for special
words++; // case
lines++;
}
}
int main(int argc, char** argv) {
if (argc < 2)
return(EXIT_FAILURE);
ifstream in(argv[1]);
if (!in)
exit(EXIT_FAILURE);
int c, w, l;
countStuff(in, c, w, l);
cout << "chars: " << c << '\n';
cout << "words: " << w << '\n';
cout << "lines: " << l << '\n';
}
#include <iostream>
#include <fstream>
#include <map>
#include <string>
typedef std::map<std::string, int> StrIntMap;
void countWords(std::istream& in, StrIntMap& words)
{
std::string s;
while (in >> s)
{ ++words[s]; }
}
int main(int argc, char** argv) {
if (argc < 2)
return(EXIT_FAILURE);
std::ifstream in(argv[1]);
if (!in)
exit(EXIT_FAILURE);
StrIntMap w;
countWords(in, w);
for (StrIntMap::iterator p = w.begin( ); p != w.end( ); ++p)
{
std::cout << p->first << " occurred " << p->second << " times.\n";
}
}
#include <string>
#include <vector>
#include <functional>
#include <iostream>
using namespace std;
void split(const string& s, char c, vector<string>& v)
{
string::size_type i = 0;
string::size_type j = s.find(c);
while (j != string::npos) {
v.push_back(s.substr(i, j-i));
i = ++j;
j = s.find(c, j);
if (j == string::npos)
v.push_back(s.substr(i, s.length( )));
}
}
int main( ) {
vector<string> v;
string s = "Account Name|Address 1|Address 2|City";
split(s, '|', v);
for (int i = 0; i < v.size( ); ++i) {
cout << v[i] << '\n';
}
}
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
using namespace std;
void split(const string& s, char c, vector<string>& v) {
int i = 0;
int j = s.find(c);
while (j >= 0) {
v.push_back(s.substr(i, j-i));
i = ++j;
j = s.find(c, j);
if (j < 0) {
v.push_back(s.substr(i, s.length( )));
}
}
}
void loadCSV(istream& in, vector<vector<string>*>& data) {
vector<string>* p = NULL;
string tmp;
while (!in.eof( )) {
getline(in, tmp, '\n'); // Grab the next line
p = new vector<string>( );
split(tmp, ',', *p); // Use split () defined above
data.push_back(p);
cout << tmp << '\n';
tmp.clear( );
}
}
int main(int argc, char** argv) {
if (argc < 2)
return(EXIT_FAILURE);
ifstream in(argv[1]);
if (!in)
return(EXIT_FAILURE);
vector<vector<string>*> data;
loadCSV(in, data);
// Go do something useful with the data...
for (vector<vector<string>*>::iterator p = data.begin( ); p != data.end( ); ++p) {
delete *p; // Be sure to de-
} // reference p!
}
#include <iostream>
#include <map>
#include <string>
using namespace std;
int main() {
map<string, int> freq; // map of words and their frequencies
string word; // input buffer for words.
while (cin >> word) {
freq[word]++;
}
//--- Write the count and the word.
map<string, int>::const_iterator iter;
for (iter=freq.begin(); iter != freq.end(); ++iter) {
cout << iter->second << " " << iter->first << endl;
}
return 0;
}
#include <iostream>
#include <fstream>
#include <map>
#include <set>
#include <string>
using namespace std;
int main() {
set<string> ignore; // Words to ignore.
map<string, int> freq; // Map of words and their frequencies
string word; // Used to hold input word.
//-- Read file of words to ignore.
ifstream ignoreFile("ignore.txt");
while (ignoreFile >> word) {
ignore.insert(word);
}
//-- Read words/tokens to count from input stream.
while (cin >> word) {
if (ignore.find(word) == ignore.end()) {
freq[word]++; // Count this. It's not in ignore set.
}
}
//-- Write count/word. Iterator returns key/value pair.
map<string, int>::const_iterator iter;
for (iter=freq.begin(); iter != freq.end(); ++iter) {
cout << iter->second << " " << iter->first << endl;
}
return 0;
}
1)
std::ifstream in("some.file");
std::istreambuf_iterator<char> beg(in), end;
std::string str(beg, end);
2)
std::ifstream in("some.file");
std::ostringstream tmp;
tmp << in.rdbuf();
std::string str = tmp.str();
#include <iostream>
#include <algorithm>
using namespace std;
#define SIZE 5
int main()
{
int array[SIZE] = {20, 11, 13, 6, -9};
sort(array, array+SIZE);
for (int i=0; i<7; i++) {
cout << array[i] << " ";
}
return 0;
}
Given: delimiter-separated file ( coma, space, tab. etc)
First column is unique ( row #) (we can always generate the first unique column with
awk '{print NR, $0 }' file
Python
To do:
- read data from given column #, or column name
- build world frequency dictionary ( for given column #)
- sort file by given column
- return row for given row #
- return all rows then column value in given column# = provided value
import csv
import pprint
import cProfile
def csv2dict(file, delim):
reader = csv.reader(open(file, 'rb'), delimiter=delim)
col_names = reader.next()
#create a dictionary such that {column_name: index position}
col_names = dict([(col.lower(),col_names.index(col))
for col in col_names])
return [dict([(col, row[col_names[col]]) for col in col_names])
for row in reader]
#-------------------------------------
file="input.txt"
# naive way , without cvs class
for line in open(file):
title, year, director = line.split(",")
print year, title
#use cvs class
reader = csv.reader(open(file))
for title, year, director in reader:
print year, title
#put in dictionary
csv2dict(file,',')
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(csv2dict(file,','))
#profile
cProfile.run("csv2dict(file, ',')")
#--------------
f = open(file, 'rt')
try:
reader = csv.DictReader(f)
for row in reader:
print row
finally:
f.close()