So, given a spare moment I wrote carved out a little chunk of code--probably not the absolute smallest in number of lines, but compact none the less. I'm sure a version in Perl would end up being far more compact, but then it would be written in Perl.
The code ends up looking simple and neat, which is one of things I love about using the STL.
See for yourself. The encoding calculation is below, with what is really just a scan for the frequency of occurrence of an ascii character.
vector ct_coll(256); for (int i = 0; i < input.length(); ++i) { ct_coll[input[i]]._c = input[i]; ++ct_coll[input[i]]; } //now sort and build tree multiset freq_coll; copy(ct_coll.begin(),ct_coll.end(),inserter(freq_coll,freq_coll.begin())); while (freq_coll.size() > 1) { freq_coll.insert(Data(new Data(*freq_coll.begin()),new Data(*++freq_coll.begin()))); freq_coll.erase(freq_coll.begin());freq_coll.erase(freq_coll.begin()); } //assign value Data d = *freq_coll.begin(); assign_code(d);
And that's pretty much it. The chunk above does the following:
* count the frequency of occurrence in a vector container
* sorts by occurrence
* builds the binary tree from the bottom up
* recurses from the top down and builds the huffman code
Finito!
These two lines deal with the tree data structure:
freq_coll.insert(Data(new Data(*freq_coll.begin()),new Data(*++freq_coll.begin()))); freq_coll.erase(freq_coll.begin());freq_coll.erase(freq_coll.begin());
What is happending here is a new Data object is made up of the first two elements of the freq_coll collection. This collection is sorted from least frequent to most frequent occurrance. Therefore, the two least frequent elements are combined to create a new binary node, which is then inserted back into the freq_coll collection. When the new Data object is inserted back into the tree this object's key is the sum of the two Data elements frequency value.
These two nodes are then removed since they will still remain at the top of the freq_coll tree (since the new Data object is the sum of the two frequencies of occurrence).
I also wanted to remove the assignment of the character from the frequency loop (first line in the loop below), since this is only needed to be done once, and ideally it should be done (or hidden) when initialized, but I couldn't come up with a way to define this in compact form.
for (int i = 0; i < input.length(); ++i) { ct_coll[input[i]]._c = input[i]; ++ct_coll[input[i]]; }Boost didn't appear to have anything either--sure I could initialize in another loop, or explicitly set each 255 characters. But that doesn't reduce the overall line count. The full source is below:
#include <stdlib.h> #include <stdio.h> #include <string> #include <algorithm> #include <vector> #include <set> #include <iostream> using namespace std; class Data { public: Data() : _c(0),_f(0),_code(0),_r(NULL),_l(NULL) {} Data(char c) : _c(c),_f(0),_code(0),_r(NULL),_l(NULL) {} Data(char c,int f) : _c(c),_f(f),_code(0),_r(NULL),_l(NULL) {} Data(Data *d1,Data *d2) : _c(0),_f((d1->_f)+(d2->_f)),_code(0),_r(d1),_l(d2) {} Data& operator++() {_f++;} friend ostream& operator<<(ostream& output, const Data& p); char _c; int _f; string _readable_code; int _code; Data *_r; Data *_l; }; ostream& operator<<(ostream& output, const Data& d) { char buf[256]; sprintf(buf,"Data: '%c' %d [%s],%X\n",d._c,d._f,d._readable_code.c_str(),d._code); output << buf; return output; // for multiple << operators. } class freq_cmp { public: bool operator()(const Data i1, const Data i2) const { //reverse sort, less likely to most likely return (i1._f < i2._f); } }; void assign_code(Data &d) { if (d._r) { d._r->_readable_code = d._readable_code + string("1"); d._r->_code = (d._code) << 1; d._r->_code |= 1; assign_code(*d._r); } if (d._l) { d._l->_readable_code = d._readable_code + string("0"); d._l->_code = (d._code) << 1; assign_code(*d._l); } } void print_data(Data &d) { if (d._c != 0) { cout << d; } if (d._r) { print_data(*d._r); } if (d._l) { print_data(*d._l); } } void usage() { cout << "huff -f" << endl; cout << " -f [file] configuration file" << endl; cout << " -h this help" << endl; cout << endl; } int main(int argc, char* argv[]) { string input; int ch; char *file = NULL; while ((ch = getopt(argc, argv, "f:h")) != -1) { switch (ch) { case 'f': file = optarg; break; case 'h': usage(); exit(0); } } if (file == NULL) { exit(1); } FILE *fp = fopen(file,"r"); if (fp) { char str[1025]; while (fgets(str, 1024, fp)) { input += string(str); } fclose(fp); } if (input.empty()) { exit(1); } //Encoding starts here vector<Data> ct_coll(256); for (int i = 0; i < input.length(); ++i) { ct_coll[input[i]]._c = input[i]; ++ct_coll[input[i]]; } //now sort and build tree multiset<Data,freq_cmp> freq_coll; copy(ct_coll.begin(),ct_coll.end(),inserter(freq_coll,freq_coll.begin())); while (freq_coll.size() > 1) { freq_coll.insert(Data(new Data(*freq_coll.begin()),new Data(*++freq_coll.begin()))); freq_coll.erase(freq_coll.begin());freq_coll.erase(freq_coll.begin()); } //assign value Data d = *freq_coll.begin(); assign_code(d); //Encoding done //TODO: now need to convert stream... //and print... print_data(d); }
No comments:
Post a Comment