In this post, I want to explain how to create a text analytics application in BlueMix using UIMA, and share sample code to show how to get started.
First, some background if you’re unfamiliar with the jargon.
What is UIMA?
UIMA (Unstructured Information Management Architecture) is an Apache framework for building analytics applications for unstructured information and the OASIS standard for content analytics.
It’s perhaps better known for providing the architecture for the question answering system IBM Watson.
What is BlueMix?
It’s in open beta at the moment, so you can sign up and have a play.
I’ve never used BlueMix before, or Cloud Foundry for that matter, so this was a chance for me to write my first app for it.
A UIMA “Hello World” for BlueMix
I’ve written a small sample to show how UIMA and BlueMix can work together. It provides a REST API that you can submit text to, and get back a JSON response with some attributes found in the text (long words, capitalised words, and strings that look like email addresses).
The “analytics” that the app is doing is trivial at best, but this is just a Hello World. For now my aim isn’t to produce a useful analytics solution, but to walk through the configuration needed to define a UIMA analytics pipeline, wrap it in a REST API using Wink, and deploy it as a BlueMix application.
When I get a chance, I’ll write a follow-up post on making something more useful.
You can try out the sample on BlueMix as it’s deployed to bluemix.net
The source is on GitHub at github.com/dalelane/bluemixuima.
In the rest of this post, I’ll walk through some of the implementation details.
Runtimes and services
Creating an application in BlueMix is already well documented so I won’t reiterate those steps, other than to say that as Apache UIMA is a Java SDK and framework, I use the Liberty for Java runtime.
I’m not using any of the services in this simple sample.
The app is bundled up in a war file, which is what we deploy. This is specified in manifest.yml.
I’m deploying from eclipse, too, using the Cloud Foundry plugins for eclipse.
The type system is defined in an XML descriptor file and specifies the different annotations that can be created by this pipeline, and the attributes that they have.
Running JCasGen in eclipse on that descriptor generates Java classes representing those types.
The pipeline is also defined in XML descriptors: one overall aggregate descriptor which imports three primitive descriptors for each of the three annotators in my sample pipeline : one to find email addresses, one to find capitalised words and one to find long words.
Note that the imports in the aggregate descriptor need to be relative so that they keep working once you deploy to BlueMix.
These XML descriptor files are all added to the war file by being included in the build.xml with a fileset include.
Each of the primitive descriptor files specifies the fully qualified class name for the Java implementation of the annotator.
There are three annotators in this sample. (XML files with names starting “primitiveAeDescriptor”).
Each uses a regular expression to find things to annotate in the text. This isn’t intended to be an indication that this is how things should be done, just that it makes for a simple and stateless demonstration without any additional dependencies.
The UIMA pipeline is defined in a single Java class.
It creates a CAS pool to make it easier to handle multiple concurrent requests, and avoid the overhead of creating a CAS for every request.
Once the CAS has passed through the pipeline, the annotations are immediately copied out of the CAS into a POJO, so that the CAS can be returned to the pool.
The war file deployed to BlueMix contains a web.xml which specifies the servlet that implements the REST API.
The list of API endpoints is a list of classes that Wink uses. There is only one API endpoint, so only one class listed.
Everything is defined using annotations, and Wink handles turning the response into a JSON payload.
I think that’s pretty much it.
It’s live at uimahelloworld.ng.bluemix.net.
Like I said, it’s very simple. The Java itself isn’t particularly complex. My reason for sharing it was to provide a boilerplate config for defining a UIMA analytics pipeline, wrapping it in a REST API, and deploying it to BlueMix.
Once you’ve got that working, you can do text analytics in BlueMix as complex as whatever you can dream up for your annotators.
When I get time, I’ll write a follow-up post sharing what that could look like.