Monday, July 25, 2011

SRILM prunes n-gram when n>=3 by default

Recently, I have used the ngram-count tool of SRILM to find n-grams of a corpus.

However, I have found that when n>=3, the tool will discard low-frequency n-grams by default.

In fact we can find the n-grams using the -write option of the tool, which is a better choice if you only care about n-grams, not the probabilities.

Tuesday, July 5, 2011

does sed support lookahead or lookbehind on Linux?

after investigating for a while, finally I found sed does not support lookahead or lookbehind assertions.

Based on http://sed.sourceforge.net/sedfaq6.html, the modified sed, which is named as ssed, can support it in its Perl mode.