Sunday, October 27, 2013

Simple Chinese Word Segmentation Lib for Flash AS3 - SCWS Ported to Flash Using CrossBridge

Unlike English sentences, in a Chinese sentence, there is no space between two words (http://en.wikipedia.org/wiki/Text_segmentation). This can cause lots of trouble for processing the language on computer.

SCWS is a simple Chinese word segmentation C lib. I just ported it to Flash using CrossBridge - the latest open source version of FlasCC. You can use the pre-build swc library "libscws.swc" in your Flash/AS3 projects.

The SCWS lib depends on an extra ".xdb" dictionary file and a ".ini" rule file, which can be downloaded at http://www.xunsearch.com/scws/download.php. However, the CrossBridge's file system is not as simple as the old Alchemy(See this post, and simplified code), so I use the class by twistedjoe from http://forums.adobe.com/thread/1147910, which doesn't require any genfs processing on the files.

There is almost no modification of the original C source files, except for the file "lock.c", I commented the line to pass the gcc complains:

//#warning no proper flock supported

To use the swc library, you must set compiler options "enable strict mode" to false! Otherwise, the AS3 compiler will throw error "Error: Call to a possibly undefined method addEventListener through a reference with static type CrossBridge.libscws.vfs:URLLoaderVFS".

There are two main functions in the AS3 library: "initialize_SCWS_AS3()" and "scws_send_text_AS3()".
For using the "libscws.swc", firstly, load the dictionary file and the rule file and supply them to the C module. This can be done in common CorssBridge/FlasCC routine: use a URLLoaderVFS's "loadManifest" function to load the manifest file, which contains the files' names and paths.(See the demo's source code for more details, for the manifest file, https://github.com/twistedjoe/flascc-URLLoaderVFS gives more information.) After the dictionary file and the rule file were loaded, call "initialize_SCWS_AS3()", which will initialize the library for use. Then you can call the function "scws_send_text_AS3(input:String):String", with the text to be processed as the parameter, and it will return the processed text, with space as delimiter.

Here is the demo(Input the texts at the bottom, Return Key for sending to the console.):



Full source code of the demo and the lib:
https://flaswf.googlecode.com/svn/trunk/LibSCWS

Links:
http://www.xunsearch.com/scws/
http://nlp.stanford.edu/software/segmenter.shtml
http://ictclas.org/index.html
http://technology.chtsai.org/mmseg/
http://www.coreseek.cn/opensource/
https://github.com/fxsjy/jieba

No comments:

Post a Comment

Sponsors