Software Engineering Crunch and more...: Lost in (Japanese) translation

Tuesday, December 14, 2010

Lost in (Japanese) translation

A while back I needed to support automated conversion of English text into Japanese script with some Java code. And fortunately, there was a project available that support just such a conversion. So, while not particularly difficult this did require some configuration voodoo and the usual troubleshooting to finally get a library up and running. The steps presented here are pretty simple, but it took a bit longer when setting this up initially (primarily because most of the docs/hints in working with this library were in Japanese)--so hopefully it will save somebody sometime in the future. The library used was the sen java library which is a port of the Mecab C libraries to support translation.

Starting with this page:

http://ultimania.org/sen/hiki.cgi

Here is the Java port of mecab: https://sen.dev.java.net/
First grab the library at:

wget https://sen.dev.java.net/files/documents/1373/31864/sen-1.2.2.1.zip

unzip sen-1.2.2.1.zip

cd sen-1.2.2.1

then, install the jdk and export JAVA_HOME to jre directory and build with

ant

Attempting to compile--missing the apache logging java libraries:

sen/sen-1.2.2.1/src/java$ javac ProcessorDemo.java

./net/java/sen/StringTagger.java:42: package org.apache.commons.logging does not exist

import org.apache.commons.logging.Log;

These are downloaded from commons.apache.org/logging:

commons-logging-1.1.1-src.tar.gz

tar xvfz commons-logging-1.1.1-src.tar.gz

At this point building will work after pointing the classpath to the apache logging project and the sen jar file.

slioch@build:~/sen_test/sen-1.2.2.1# javac demo/ProcessorDemo.java -cp ./commons-logging-1.1.1-src/src/java;./lib/sen.jar

Now time for the some runtime fun. Create a file to run out command (run). The two arguments to the Demo routine are the english test file as input to the processor, and the character encoding type (utf-8 in this case). The "-Dsen.home=." flag tells the jre to provide the executable with the location to find the Japanese dictionaries.

java -Dsen.home=. -Xmx300m -cp lib/sen.jar:lib/commons-logging.jar:./build/classes/ ProcessorDemo japanese-file.txt utf-8

Which results in the following:

slioch@build:~/sen_test/sen-1.2.2.1# ./run random-jap.txt utf-8

Dec 13, 2010 2:21:58 PM net.java.sen.Dictionary
INFO: token file = ./dic/token.sen
java.lang.IllegalArgumentException: Tokenizer Class: net.java.sen.ja.JapaneseTokenizer is invalid.
at net.java.sen.StringTagger.init(StringTagger.java:158)
at net.java.sen.StringTagger.(StringTagger.java:95)
at net.java.sen.StringTagger.getInstance(StringTagger.java:133)
at net.java.sen.StreamTagger.(StreamTagger.java:92)
at ProcessorDemo.main(ProcessorDemo.java:72)

We're missing the Japanese token file which should be at ./dic/token.sen. This is because the dictionary hasn't been built yet (and for some reason the built at the root of the project doesn't automatically compile this):

cd dic && ant

This will build the missing token.sen dictionary file needed. Now we are off and running again.

Dec 14, 2010 9:02:39 PM net.java.sen.Tokenizer loadConnectCost
INFO: time to load connect cost file = 183[ms]
　    記号-空白    0    1    581
従    未知語      1    2    31577
（    記号-括弧開 2    3    33354
じ    助動詞      3    4    43354
ゅ    未知語      4    5    81354
）    記号-括弧閉 5    6    82759
四    名詞-数     6    7    84986
位    名詞-接尾-助数詞    7    8    86706
下    名詞-接尾-一般    8    9    89574
（    記号-括弧開    9    10    90926
い    動詞-自立      10    11    94046
の    助詞-連体化    11    12    96216
げ    名詞-接尾-一般 12    13    101578
）    記号-括弧閉    13    14    102937
......

A morpheme per line.

5 comments:

PaulSeptember 15, 2011 at 6:33 PM
Thanks for your explanation here.
I too am trying to do this and am having some trouble creating token.sen
The line:
cd dic && ant
didn't do it for me.
Can you give me a few details more about how to build this file. I have ant installed correctly.
ReplyDelete
Replies
Michael LarsonSeptember 15, 2011 at 6:46 PM
Hi Paul,

When you cd into the dic directory--you should see:

>ls
build.xml
da.sen
ipa2mecab.pl
matrix.sen
compound.pl
dic.csv
ipadic-2.6.0
posInfo.sen
connect.csv
dictionary.properties
ipadic-2.6.0.tar.gz
token.sen

And see something like this when you build:

ant
Buildfile: build.xml

prepare-proxy:

prepare-archive:

prepare-dics0:

prepare-dics:

download:

melt:

prepare:

dics0:

create:
[java] [INFO] MkSenDic - (1/7): reading connection matrix ...
[java] [INFO] MkSenDic - connection file = connect.csv
[java] [INFO] MkSenDic - charset = EUC_JP
[java] [INFO] MkSenDic - (2/7): building type dictionary ...
[java] [INFO] MkSenDic - (3/7): writing conection matrix (5 x 1281 x 701 = 4489905) ...
[java] [INFO] MkSenDic - (4/7): reading morpheme information ...
[java] [INFO] MkSenDic - load dic: dic.csv
[java] [INFO] MkSenDic - 50000...
[java] [INFO] MkSenDic - 100000...
[java] [INFO] MkSenDic - 150000...
[java] [INFO] MkSenDic - 200000...
[java] [INFO] MkSenDic - 250000...
[java] [INFO] MkSenDic - 300000...
[java] [INFO] MkSenDic - 350000...
[java] [INFO] MkSenDic - (5/7): sorting lex...
[java] [INFO] MkSenDic - (6/7): writing token...
[java] [INFO] MkSenDic - key size = 378227
[java] [INFO] MkSenDic - (7/7): building Double-Array (size = 325254) ...
[java] [INFO] DoubleArrayTrie - save time = 0.661[s]
[java] [INFO] MkSenDic - total time = 69[ms]

BUILD SUCCESSFUL
Total time: 1 minute 11 seconds
ReplyDelete
Replies
PaulSeptember 15, 2011 at 7:35 PM
Wow, thanks for the prompt feedback!
I was timing out on the tar download and I had the perl pointing to the wrong place in the build.xml
It works now. What a great utility!
While I've got you on line, how do I pipe in say, a text file and pipe out the output to another file?
ReplyDelete
Replies
PaulSeptember 15, 2011 at 7:44 PM
Hmmm, I suppose I'll be using it in a Java program and there are a few examples there...
I'll have a look at those.

Thanks again for the clear explanation.
ReplyDelete
Replies
Michael LarsonSeptember 15, 2011 at 8:35 PM
You just need to edit this file "ProcessorDemo.java". Inside of the tagger.hasNext() loop you'll see the translation.

while (tagger.hasNext()) {
Token token = tagger.next();
System.out.println(token.getSurface() + "\t" + token.getPos()
+ "\t" + token.start() + "\t" + token.end() + "\t"
+ token.getCost() + "\t" + token.getAddInfo());
}

You can modify this print statement and pipe this to a file, or just modify the print to write to a file.
ReplyDelete
Replies

Add comment

Software Engineering Crunch and more...

Tuesday, December 14, 2010

Lost in (Japanese) translation

5 comments:

Followers

About Me

Labels

My Blog List