Tuesday, December 14, 2010

Lost in (Japanese) translation

A while back I needed to support automated conversion of English text into Japanese script with some Java code. And fortunately, there was a project available that support just such a conversion. So, while not particularly difficult this did require some configuration voodoo and the usual troubleshooting to finally get a library up and running. The steps presented here are pretty simple, but it took a bit longer when setting this up initially (primarily because most of the docs/hints in working with this library were in Japanese)--so hopefully it will save somebody sometime in the future. The library used was the sen java library which is a port of the Mecab C libraries to support translation.

Starting with this page:

http://ultimania.org/sen/hiki.cgi



Here is the Java port of mecab:    https://sen.dev.java.net/
First grab the library at:

wget https://sen.dev.java.net/files/documents/1373/31864/sen-1.2.2.1.zip
unzip sen-1.2.2.1.zip
cd sen-1.2.2.1

then, install the jdk and export JAVA_HOME to jre directory and build with

ant


Attempting to compile--missing the apache logging java libraries:

sen/sen-1.2.2.1/src/java$ javac ProcessorDemo.java

./net/java/sen/StringTagger.java:42: package org.apache.commons.logging does not exist
import org.apache.commons.logging.Log;


These are downloaded from commons.apache.org/logging:

commons-logging-1.1.1-src.tar.gz
tar xvfz commons-logging-1.1.1-src.tar.gz


At this point building will work after pointing the classpath to the apache logging project and the sen jar file.

slioch@build:~/sen_test/sen-1.2.2.1# javac demo/ProcessorDemo.java -cp ./commons-logging-1.1.1-src/src/java;./lib/sen.jar


Now time for the some runtime fun. Create a file to run out command (run). The two arguments to the Demo routine are the english test file as input to the processor, and the character encoding type (utf-8 in this case). The "-Dsen.home=." flag tells the jre to provide the executable with the location to find the Japanese dictionaries.

java -Dsen.home=. -Xmx300m -cp lib/sen.jar:lib/commons-logging.jar:./build/classes/ ProcessorDemo japanese-file.txt utf-8


Which results in the following:

slioch@build:~/sen_test/sen-1.2.2.1# ./run random-jap.txt utf-8
Dec 13, 2010 2:21:58 PM net.java.sen.Dictionary
INFO: token file = ./dic/token.sen
java.lang.IllegalArgumentException: Tokenizer Class: net.java.sen.ja.JapaneseTokenizer is invalid.
at net.java.sen.StringTagger.init(StringTagger.java:158)
at net.java.sen.StringTagger.(StringTagger.java:95)
at net.java.sen.StringTagger.getInstance(StringTagger.java:133)
at net.java.sen.StreamTagger.(StreamTagger.java:92)
at ProcessorDemo.main(ProcessorDemo.java:72)


We're missing the Japanese token file which should be at ./dic/token.sen. This is because the dictionary hasn't been built yet (and for some reason the built at the root of the project doesn't automatically compile this):

cd dic && ant 


This will build the missing token.sen dictionary file needed. Now we are off and running again. 


Dec 14, 2010 9:02:39 PM net.java.sen.Tokenizer loadConnectCost
INFO: time to load connect cost file = 183[ms]
     記号-空白    0    1    581   
従    未知語      1    2    31577   
(    記号-括弧開  2    3    33354   
じ    助動詞      3    4    43354   
ゅ    未知語      4    5    81354   
)    記号-括弧閉  5    6    82759   
四    名詞-数     6    7    84986   
位    名詞-接尾-助数詞    7    8    86706   
下    名詞-接尾-一般    8    9    89574   
(    記号-括弧開    9    10    90926   
い    動詞-自立      10    11    94046   
の    助詞-連体化    11    12    96216   
げ    名詞-接尾-一般  12    13    101578   
)    記号-括弧閉    13    14    102937   
......




A morpheme per line.

5 comments:

  1. Thanks for your explanation here.
    I too am trying to do this and am having some trouble creating token.sen
    The line:
    cd dic && ant
    didn't do it for me.
    Can you give me a few details more about how to build this file. I have ant installed correctly.

    ReplyDelete
  2. Hi Paul,

    When you cd into the dic directory--you should see:

    >ls
    build.xml
    da.sen
    ipa2mecab.pl
    matrix.sen
    compound.pl
    dic.csv
    ipadic-2.6.0
    posInfo.sen
    connect.csv
    dictionary.properties
    ipadic-2.6.0.tar.gz
    token.sen



    And see something like this when you build:

    ant
    Buildfile: build.xml

    prepare-proxy:

    prepare-archive:

    prepare-dics0:

    prepare-dics:

    download:

    melt:

    prepare:

    dics0:

    create:
    [java] [INFO] MkSenDic - (1/7): reading connection matrix ...
    [java] [INFO] MkSenDic - connection file = connect.csv
    [java] [INFO] MkSenDic - charset = EUC_JP
    [java] [INFO] MkSenDic - (2/7): building type dictionary ...
    [java] [INFO] MkSenDic - (3/7): writing conection matrix (5 x 1281 x 701 = 4489905) ...
    [java] [INFO] MkSenDic - (4/7): reading morpheme information ...
    [java] [INFO] MkSenDic - load dic: dic.csv
    [java] [INFO] MkSenDic - 50000...
    [java] [INFO] MkSenDic - 100000...
    [java] [INFO] MkSenDic - 150000...
    [java] [INFO] MkSenDic - 200000...
    [java] [INFO] MkSenDic - 250000...
    [java] [INFO] MkSenDic - 300000...
    [java] [INFO] MkSenDic - 350000...
    [java] [INFO] MkSenDic - (5/7): sorting lex...
    [java] [INFO] MkSenDic - (6/7): writing token...
    [java] [INFO] MkSenDic - key size = 378227
    [java] [INFO] MkSenDic - (7/7): building Double-Array (size = 325254) ...
    [java] [INFO] DoubleArrayTrie - save time = 0.661[s]
    [java] [INFO] MkSenDic - total time = 69[ms]

    BUILD SUCCESSFUL
    Total time: 1 minute 11 seconds

    ReplyDelete
  3. Wow, thanks for the prompt feedback!
    I was timing out on the tar download and I had the perl pointing to the wrong place in the build.xml
    It works now. What a great utility!
    While I've got you on line, how do I pipe in say, a text file and pipe out the output to another file?

    ReplyDelete
  4. Hmmm, I suppose I'll be using it in a Java program and there are a few examples there...
    I'll have a look at those.

    Thanks again for the clear explanation.

    ReplyDelete
  5. You just need to edit this file "ProcessorDemo.java". Inside of the tagger.hasNext() loop you'll see the translation.

    while (tagger.hasNext()) {
    Token token = tagger.next();
    System.out.println(token.getSurface() + "\t" + token.getPos()
    + "\t" + token.start() + "\t" + token.end() + "\t"
    + token.getCost() + "\t" + token.getAddInfo());
    }


    You can modify this print statement and pipe this to a file, or just modify the print to write to a file.

    ReplyDelete