Starting with this page:
http://ultimania.org/sen/hiki.cgi
Here is the Java port of mecab: https://sen.dev.java.net/
First grab the library at:
wget https://sen.dev.java.net/files/documents/1373/31864/sen-1.2.2.1.zip
unzip sen-1.2.2.1.zip
cd sen-1.2.2.1
then, install the jdk and export JAVA_HOME to jre directory and build with
ant
Attempting to compile--missing the apache logging java libraries:
sen/sen-1.2.2.1/src/java$ javac ProcessorDemo.java
./net/java/sen/StringTagger.java:42: package org.apache.commons.logging does not exist
import org.apache.commons.logging.Log;
These are downloaded from commons.apache.org/logging:
commons-logging-1.1.1-src.tar.gz
tar xvfz commons-logging-1.1.1-src.tar.gz
At this point building will work after pointing the classpath to the apache logging project and the sen jar file.
slioch@build:~/sen_test/sen-1.2.2.1# javac demo/ProcessorDemo.java -cp ./commons-logging-1.1.1-src/src/java;./lib/sen.jar
Now time for the some runtime fun. Create a file to run out command (run). The two arguments to the Demo routine are the english test file as input to the processor, and the character encoding type (utf-8 in this case). The "-Dsen.home=." flag tells the jre to provide the executable with the location to find the Japanese dictionaries.
java -Dsen.home=. -Xmx300m -cp lib/sen.jar:lib/commons-logging.jar:./build/classes/ ProcessorDemo japanese-file.txt utf-8
Which results in the following:
slioch@build:~/sen_test/sen-1.2.2.1# ./run random-jap.txt utf-8
Dec 13, 2010 2:21:58 PM net.java.sen.Dictionary INFO: token file = ./dic/token.sen
java.lang.IllegalArgumentException: Tokenizer Class: net.java.sen.ja.JapaneseTokenizer is invalid.
at net.java.sen.StringTagger.init(StringTagger.java:158)
at net.java.sen.StringTagger.
at net.java.sen.StringTagger.getInstance(StringTagger.java:133)
at net.java.sen.StreamTagger.
at ProcessorDemo.main(ProcessorDemo.java:72)
We're missing the Japanese token file which should be at ./dic/token.sen. This is because the dictionary hasn't been built yet (and for some reason the built at the root of the project doesn't automatically compile this):
cd dic && ant
Dec 14, 2010 9:02:39 PM net.java.sen.Tokenizer loadConnectCost
INFO: time to load connect cost file = 183[ms]
記号-空白 0 1 581
従 未知語 1 2 31577
( 記号-括弧開 2 3 33354
じ 助動詞 3 4 43354
ゅ 未知語 4 5 81354
) 記号-括弧閉 5 6 82759
四 名詞-数 6 7 84986
位 名詞-接尾-助数詞 7 8 86706
下 名詞-接尾-一般 8 9 89574
( 記号-括弧開 9 10 90926
い 動詞-自立 10 11 94046
の 助詞-連体化 11 12 96216
げ 名詞-接尾-一般 12 13 101578
) 記号-括弧閉 13 14 102937
......
A morpheme per line.
Thanks for your explanation here.
ReplyDeleteI too am trying to do this and am having some trouble creating token.sen
The line:
cd dic && ant
didn't do it for me.
Can you give me a few details more about how to build this file. I have ant installed correctly.
Hi Paul,
ReplyDeleteWhen you cd into the dic directory--you should see:
>ls
build.xml
da.sen
ipa2mecab.pl
matrix.sen
compound.pl
dic.csv
ipadic-2.6.0
posInfo.sen
connect.csv
dictionary.properties
ipadic-2.6.0.tar.gz
token.sen
And see something like this when you build:
ant
Buildfile: build.xml
prepare-proxy:
prepare-archive:
prepare-dics0:
prepare-dics:
download:
melt:
prepare:
dics0:
create:
[java] [INFO] MkSenDic - (1/7): reading connection matrix ...
[java] [INFO] MkSenDic - connection file = connect.csv
[java] [INFO] MkSenDic - charset = EUC_JP
[java] [INFO] MkSenDic - (2/7): building type dictionary ...
[java] [INFO] MkSenDic - (3/7): writing conection matrix (5 x 1281 x 701 = 4489905) ...
[java] [INFO] MkSenDic - (4/7): reading morpheme information ...
[java] [INFO] MkSenDic - load dic: dic.csv
[java] [INFO] MkSenDic - 50000...
[java] [INFO] MkSenDic - 100000...
[java] [INFO] MkSenDic - 150000...
[java] [INFO] MkSenDic - 200000...
[java] [INFO] MkSenDic - 250000...
[java] [INFO] MkSenDic - 300000...
[java] [INFO] MkSenDic - 350000...
[java] [INFO] MkSenDic - (5/7): sorting lex...
[java] [INFO] MkSenDic - (6/7): writing token...
[java] [INFO] MkSenDic - key size = 378227
[java] [INFO] MkSenDic - (7/7): building Double-Array (size = 325254) ...
[java] [INFO] DoubleArrayTrie - save time = 0.661[s]
[java] [INFO] MkSenDic - total time = 69[ms]
BUILD SUCCESSFUL
Total time: 1 minute 11 seconds
Wow, thanks for the prompt feedback!
ReplyDeleteI was timing out on the tar download and I had the perl pointing to the wrong place in the build.xml
It works now. What a great utility!
While I've got you on line, how do I pipe in say, a text file and pipe out the output to another file?
Hmmm, I suppose I'll be using it in a Java program and there are a few examples there...
ReplyDeleteI'll have a look at those.
Thanks again for the clear explanation.
You just need to edit this file "ProcessorDemo.java". Inside of the tagger.hasNext() loop you'll see the translation.
ReplyDeletewhile (tagger.hasNext()) {
Token token = tagger.next();
System.out.println(token.getSurface() + "\t" + token.getPos()
+ "\t" + token.start() + "\t" + token.end() + "\t"
+ token.getCost() + "\t" + token.getAddInfo());
}
You can modify this print statement and pipe this to a file, or just modify the print to write to a file.