Watson News Companion

newscompanion screenshotWe recently ran a hackathon at work: people within IBM were invited to try building a mobile app aimed at consumers using Watson services. It was a fun chance to try out some new ideas, as well as to build something using our APIs – dogfooding is always a good thing.

I worked on a hack with David which we submitted on Wednesday. This is what we came up with, and how we built it.

The idea

A mobile app that will help users to digest the news by explaining references in stories and providing greater context.

Background

It’s difficult to find the time nowadays to properly read and understand what’s going on in the world. We rarely have the time to sit and read through a newspaper. Instead, we might quickly read news stories online from our smartphones and tablets. But that often makes it difficult to understand the broader context that a story is in. There might be references in the story to people, places, organisations or events that are unfamiliar.

Watson could help. It could be an assistant as you read the news, explaining unfamiliar references and the broader context.

Features

Our Watson News Companion demo is a mobile news reader app that:

  • anticipates questions and suggests areas where it can help improve understanding
  • provides answers to questions without needing the users to lose their place in the story
  • allow the user to dig deeper with their own follow-up questions


A video walkthrough of the hack

Implementation

The hack was built as a mobile web app using the MEAN stack: using express as the framework on a Node.js platform, storing some information in a MongoDB and building the UI with AngularJS.

It uses RSS feeds from news websites to fetch content, which are shown in a simple newsreader app built using Ratchet.

The contents of the story is run through the Watson Relationship Extraction API to pick out the people, places, organisations, and other entities mentioned in the story.

The API output includes co-references, to identify the multiple mentions of the same entity. These are combined and reviewed, and together with the type of the entity are used to generate likely questions about the entity.

These questions are sent to the Watson Question and Answer API. For questions which are returned with answers with a high confidence, annotations are added inline to the news story. Pressing the annotation brings up a sliding panel at the bottom of the screen with the answer to the question. The links and footer annotations are built using bigfoot.

Every screen in the app also includes an “Ask Watson” button which lets the user enters any free text question to let them dig deeper into what they’ve read.

Could you build this?

This was a proof-of-concept built in a hurry, so we’re not calling this a finished app. But everything we used is freely available – both to people inside IBM and the public.

We developed on an instance of Bluemix (our Cloud Foundry-based development platform) available internally within IBM. You can sign up for free to a public instance of Bluemix at bluemix.net.

The technologies used to build the hack are all freely available : Node.js, MongoDB, AngularJS, Ratchet, bigfoot, jQuery.

A beta version of the Watson Relationship Extraction API is freely available for apps hosted on Bluemix.

A beta version of the Watson Question and Answer API is freely available for apps hosted on Bluemix. But this is a demo instance of Watson that has only read a small number of general healthcare documents. That’s not a useful corpus for our hack, so to record our demo we stood up our own instance – using an untrained instance of Watson which we gave a small subset of Wikipedia to read. We used the Question and Answer API on this instance instead of the Bluemix one. For people outside IBM to do this, they need to sign up to join the Watson Ecosystem. This is also free, but there are criteria for who is eligible at this point, and an application process to go through.


Tackling Cancer with Machine Learning

For a recent Hack Day at work I spent some time working with one of my colleagues, Adrian Lee, on a little side project to see if we could detect cancer cells in a biopsy image.  We've only spent a couple of days on this so far but already the results are looking very promising with each of us working on a distinctly different part of the overall idea.

We held an open day in our department at work last month and I gave a lightening talk on the subject which you can see on YouTube:


There were a whole load of other talks given on the day that can be seen in the summary blog post over on the ETS (Emerging Technology Services) site.



Talking about IBM Watson (again)

As I mentioned in May, I was lucky to be able to go to Thinking Digital this year and talk about what we’re doing with Watson.

I’ve just noticed that they’ve made a video of my talk available. I haven’t dared watch it (does anyone like watching videos of themselves?), but I figured I should share it anyway!


Monkigras 2014: Sharing craft

After Monkigras 2013, I was really looking forward to Monkigras 2014. The great talks about developer culture and creating usable software, the amazing buzz and friendliness of the event, the wonderful lack of choice over which talks to go to (there’s just one track!!), and (of course) the catering:

coffeecheese

The talks at Monkigras 2014

The talks were pretty much all great so I’m just going to mention the talks that were particularly relevant to me.

Rafe Colburn from Etsy talked about how to motivate developers to fix bugs (IBMers, read ‘defects’) when there’s a big backlog of bugs to fix. They’d tried many strategies, including bug rotation, but none worked. The answer, they found, was to ask their support team to help prioritise the bugs based on the problems that users actually cared about. That way, the developers fixing the bugs weren’t overwhelmed by the sheer numbers to choose from. Also, when they’d done a fix, the developers could feel that they’d made a difference to the user experience of the software.

rafe
Rafe Colburn from Etsy

While I’m not responsible for motivating developers to fix bugs, my job does involve persuading developers to write articles or sample code for WASdev.net. So I figure I could learn a few tricks.

A couple of talks that were directly applicable to me were Steve Pousty‘s talk on how to be a developer evangelist and Dawn Foster‘s on taking lessons on community from science fiction. The latter was a quick look through various science fiction themes and novels applied to developer communities, which was a neat idea though I wished I’d read more of the novels she cited. I was particularly interested in Steve’s talk because I’d seen him speak last year about how his PhD in Ecology had helped him understand communities as ecosystems in which there are sometimes surprising dependencies. This year, he ran through a checklist of attributes to look for when hiring a developer evangelist. Although I’m not strictly a developer evangelist, there’s enough overlap with my role to make me pay attention and check myself against each one.

dawn
Dawn Foster from Puppet Labs

One of the risks of TED Talk-style talks is that if you don’t quite match up to the ‘right answers’ espoused by the speakers, you could come away from the event feeling inadequate. The friendly atmosphere of Monkigras, and the fact that some speakers directly contradicted each other, meant that this was unlikely to happen.

It was still refreshing, however, to listen to Theo Schlossnagle basically telling people to do what they find works in their context. Companies are different and different things work for different companies. Similarly, developers are people and people learn in different ways so developers learn in different ways. He focused on how to tell stories about your own failures to help people learn and to save them from having to make the same mistakes.

Again, this was refreshing to hear because speakers often tell you how you should do something and how it worked for them. They skim over the things that went wrong and end up convincing you that if only you immediately start doing things their way, you’ll have instant success. Or that inadequacy just kicks in like when you read certain people’s Facebook statuses. Theo’s point was that it’s far more useful from a learning perspective to hear about the things that went wrong for them. Not in a morbid, defeatist way (that way lies only self-pity and fear) but as a story in which things go wrong but are righted by the end. I liked that.

theo
Theo Schlossnagle from Circonus

Ana Nelson (geek conference buddy and friend) also talked about storytelling. Her point was more about telling the right story well so that people believe it rather than believing lies, which are often much more intuitive and fun to believe. She impressively wove together an argument built on various fields of research including Psychology, Philosophy, and Statistics. In a nutshell, the kind of simplistic headlines newspapers often publish are much more intuitive and attractive because they fit in with our existing beliefs more easily than the usually more complicated story behind the headlines.

ana
Ana Nelson from Brick Alloy

The Gentle Author spoke just before lunch about his daily blog in which he documents stories from local people. I was lucky enough to win one of his signed books, which is beautiful and engrossing. Here it is with my swagbag:

After his popular talk last year, Phil Gilbert of IBM returned to give an update on how things are going with Design@IBM. Theo’s point about context of a company being important is so relevant when trying to change the culture of such a large company. He introduced a new card game that you can use to help teach people what it’s like to be a designer working within the constraints of a real software project. I heard a fair amount of interest from non-IBMers who were keen for a copy of the cards to be made available outside IBM.

wildducksgame
Phil Gilbert’s Wild Ducks card game

On the UX theme, I loved Leisa Reichelt‘s talk about introducing user research to the development teams at GDS. While all areas of UX can struggle to get taken seriously, user research (eg interviewing participants and usability testing) is often overlooked because it doesn’t produce visual designs or code. Leisa’s talk was wonderfully practical in how she related her experiences at GDS of proving the worth of user research to the extent that the number of user researchers has greatly increased.

And lastly I must mention Project Andiamo, which was born at Monkigras 2013 after watching a talk about laser scanning and 3D printing old railway trains. The project aims to produce medical orthotics, like splints and braces, by laser scanning the patient’s body and then 3D printing the part. This not only makes the whole process much quicker and more comfortable, it is at a fraction of the cost of the way that orthotics are currently made.

projectandiamo
Samiya Parvez & Naveed Parvez of Project Andiamo

If you can help in any way, take a look at their website and get in touch with them. Samiya and Naveed’s talk was an amazing example of how a well-constructed story can get a powerful message across to its listeners:

After Monkigras 2014, I’m now really looking forward to Monkigras 2015.


 

The post Monkigras 2014: Sharing craft appeared first on LauraCowen.co.uk.

At ThingMonk 2013

I attended ThingMonk 2013 conference partly because IBM’s doing a load of work around the Internet of Things (IoT). I figured it would be useful to find out what’s happening in the world of IoT at the moment. Also, I knew that, as a *Monk production, the food would be amazing.

What is the Internet of Things?

If you’re reading this, you’re familiar with using devices to access information, communicate, buy things, and so on over the Internet. The Internet of Things, at a superficial level, is just taking the humans out of the process. So, for example, if your washing machine were connected to the Internet, it could automatically book a service engineer if it detects a fault.

I say ‘at a superficial level’ because there are obviously still issues relevant to humans in an automated process. It matters that the automatically-scheduled appointment is convenient for the householder. And it matters that the householder trusts that the machine really is faulty when it says it is and that it’s not the manufacturer just calling out a service engineer to make money.

This is how James Governor of RedMonk, who conceived and hosted ThingMonk 2013, explains IoT:

What is ThingMonk 2013?

ThingMonk 2013 was a fun two-day conference in London. On Monday was a hackday with spontaneous lightning talks and on Tuesday were the scheduled talks and the evening party. I wasn’t able to attend Monday’s hackday so you’ll have to read someone else’s write-up about that (you could try Josie Messa’s, for instance).

The talks

I bought my Arduino getting started kit (which I used for my Christmas lights energy project in 2010) from Tinker London so I was pleased to finally meet Tinker’s former-CEO, Alexandra Dechamps-Sonsino, at ThingMonk 2013. I’ve known her on Twitter for about 4 years but we’d never met in person. Alex is also founder of the Good Night Lamp, which I blogged about earlier this year. She talked at ThingMonk about “the past, present and future of the Internet of Things” from her position of being part of it.

Alexandra Deschamps-Sonsino, @iotwatch
Alexandra Deschamps-Sonsino, @iotwatch

I think it was probably Nick O’Leary who first introduced me to the Arduino, many moons ago over cups of tea at work. He spoke at ThingMonk about wiring the Internet of Things. This included a demo of his latest project, NodeRED, which he and IBM have recently open sourced on GitHub.

Nick O'Leary talks about wiring the Internet of Things
Nick O’Leary talks about wiring the Internet of Things

Sadly I missed the previous day when it seems Nick and colleagues, Dave C-J and Andy S-C, won over many of the hackday attendees to the view that IBM’s MQTT and NodeRED are the coolest things known to developerkind right now. So many people mentioned one or both of them throughout the day. One developer told me he didn’t know why he’d not tried MQTT 4 years ago. He also seemed interested in playing with NodeRED, just as soon as the shock that IBM produces cool things for developers had worn off.

Ian Skerrett from Eclipse talked about the role of Open Source in the Internet of Things. Eclipse has recently started the Paho project, which focuses on open source implementations of the standards and protocols used in IoT. The project includes IBM’s Really Small Message Broker and Roger Light’s Mosquitto.

Ian Skerrett from Eclipse
Ian Skerrett from Eclipse

Andy Piper talked about the role of signals in the IoT.

IMG_1546

There were a couple of talks about people’s experiences of startups producing physical objects compared with producing software. Tom Taylor talked about setting up Newspaper Club, which is a site where you can put together and get printed your own newspaper run. His presentation included this slide:

IMG_1534Matt Webb talked about producing Little Printer, which is an internet-connected device that subscribes to various sources and prints them for you on a strip of paper like a shop receipt.

IMG_1550Patrick Bergel made the very good point in his talk that a lot of IoT projects, at the moment, are aimed at ‘non-problems’. While fun and useful for learning what we can do with IoT technologies, they don’t really address the needs of real people (ie people who aren’t “hackers, hipsters, or weirdos”). For instance, there are increasing numbers of older people who could benefit from things that address problems social isolation, dementia, blindness, and physical and cognitive impairments. His point was underscored throughout the day by examples of fun-but-not-entirely-useful-as-is projects, such as flying a drone with fruit. That’s not to say such projects are a waste of time in themselves but that we should get moving on projects that address real problems too.

IMG_1539The talk which chimed the most with me, though, was Claire Rowland‘s on the important user experience UX issues around IoT. She spoke about the importance of understanding how users (householders) make sense of automated things in their homes.

IMG_1587

The book

I bought a copy of Adrian McEwan‘s Designing The Internet of Things book from Alex’s pop-up shop, (Works)shop. Adrian’s a regular at OggCamp and kindly agreed to sign my copy of his book for me.

Adrian McEwan and the glamorous life of literary reknown.
Adrian McEwan and the glamorous life of literary reknown.

The food

The food was, as expected, amazing. I’ve never had bacon and scrambled egg butties that melt in the mouth before. The steak and Guinness casserole for lunch was beyond words. The evening party was sustained with sushi and tasty curry.

Thanks, James!

The post At ThingMonk 2013 appeared first on LauraCowen.co.uk.

Why am I still at IBM?

Ten years ago.

6 August 2003.

I was a recent University graduate, arriving at IBM’s R&D site in Hursley for the first time. I remember arriving in Reception.


Reception – the view that greeted me when I arrived

Ten years.

It was a Wednesday.

I’m still at the same company. I’m still at the same site. I still do the same drive to work, more or less.

For a *decade*.

How did that happen?

It was never The Plan. The Plan (as cynical as it sounds in hindsight) was that I’d stay for two or three years. I figured that would be long enough to get experience, and then I’d leave to work at a small nimble start-up which was where all the “cool” work was.

The Plan never happened. A few years passed, and then another few… I kept saying that I’d leave “later” and before I knew it a ten year milestone has kind of snuck up on me.

I think I’m more surprised than anyone. I’ve never been at any place this long. I was at Uni for five years. The longest I was at any school was four years.

It’s a serious commitment, and one I never realised that I had made. I’ve not even been married for as long as I’ve been with IBM.

So why? Why am I still here?

I live here.

It’s been varied

I’ve spent ten years working for the same company, but I’ve had several jobs in this time.

I’ve been a software developer. I’ve been a test engineer. I’ve been a service engineer, fixing problems with customer systems. I’ve worked as a consultant, advising clients about technology through presentations and running workshops. I’ve done services work building prototypes and first-of-a-kind pilot systems for clients.

I’ve written code to run on tiny in-car embedded systems and apps that ran on mobile phones. I’ve worked as a System z Mainframe developer. I’ve written front-end UI code, and I’ve written heavy-duty server jobs that took hours to run (even when they weren’t supposed to).


IBM Hursley

It’s still challenging

I’ve worked on middleware technology, getting some of the biggest computer systems in the world to communicate with each other, reliably, securely and at scale. I’ve used analytics to get insight from massive amounts of data. I’ve worked on large-scale fingerprint and voiceprint systems. I’ve used natural language processing to build systems that attempt to interpret unstructured text. I’ve used machine learning to create systems that can be trained to perform work.

I’m still learning new stuff and still regularly have to figure out how to do stuff that I have no idea how to at the start.


Some views of the grounds around the office

I get to do more than just a “day job”

I do random stuff outside the day job. I’ve helped organise week long schools events to teach kids about science and technology. I’ve mentored teams of University students on summer-long residential innovation projects. I’ve prepared and delivered training courses to school kids, school teachers and charity leaders. I’ve written an academic paper and presented it at a peer-reviewed research conference. And lots more.

I’m a developer, but that doesn’t mean I’ve spent ten years churning out code 40 hours a week. There’s always something new and different.


Hursley House – where I normally work when I have customers visiting

I work on stuff that matters

Tim O’Reilly has been talking for years about the importance of working on stuff that matters.

“Work on something that matters to you more than money”

If you’ve not heard any of his talks around this, I’d recommend having a look. There are lots of examples of his slides, talks, blog posts and interviews around.

I can’t do his message justice here, but I just want to say that he describes a big part of how I feel very well. I want to work on stuff that I can be proud of. Not just technically proud of, although that’s important too. But the pride of doing something that will make a difference.

Working for a massive company gives me chances to do that. I’ve worked on projects for governments, and police forces, and Universities. I’ve done work that I can be proud of.

For the last couple of years, I’ve been working on Watson. It’s a very cool collection of technologies, and watching the demo of it competing on a US game show has a geeky thrill that doesn’t get old. But that’s not the most exciting bit. Watson could be a turning point. This could change how we do computing. If you look at what we’re trying to do with Watson in medicine, we’re trying to transform how we deliver healthcare. This stuff matters. It’s exciting to be a part of.


These are the views that surround the site

I like the lifestyle

Hursley is a campus-style site. It’s miles from the nearest town, and surrounded by fields and farms. It’s quiet and has loads of green open space.

My commute is a ten minute drive through a village and fields.

I don’t have to wear a suit, and I don’t stand out coming to work in a hoodie and combat trousers. Flexitime has been the norm for most of my ten years, and I am free to plan a work day that suits me. When I need to be out of the office by 3pm to get the kids from school, I can.

My kids are at a school half-way between home and the office, so I can do the school run on the way to work. As the school is only five minutes from work, I often nip out to see them do something in an assembly, or have lunch with them.

This is a nice aspect of the school – that parents are welcome to join their kids for lunch, and have a school dinner with them and their friends in the school canteen. But still… it’s pretty cool, and if I didn’t work just up the road from them, I wouldn’t be able to do it.

Once a month, I bring them to work in the morning before school starts for a cooked breakfast in the Clubhouse with the rest of my team.

All of this and a lot more tiny aspects like it add up to a lifestyle that I like.

More train tickets
Some of my train tickets from the last few years

I get to see the world

I enjoy travelling. I love seeing new places.

But I’d hate a job where I lived out of a suitcase and never saw the kids.

I’ve managed to find a nice balance. I travel, but usually on short trips and not too often.

In 2006, I worked at IBM’s La Gaude site near Nice. In 2007, Singapore, Malaysia, Philippines and Paris. In 2008, I worked in Copenhagen, Paris and Hamburg. In 2009, I worked in Munich many times, and Rotterdam. In 2010, Stockholm. In 2011, Tel Aviv and Haifa in Israel, Austin in Texas, Paris and Berlin. Last year, I worked in Zurich and Littleton, Massachusetts.

This year, I’ve been in Rio de Janeiro and Littleton again, and it looks like I’ll be in Lisbon in December.

Plus working around the UK. It’s less glamorous, but it’s still interesting to go to new places. I’ve worked in loads of places, like Edinburgh, York, Swansea, Malvern, Warwick, Portsmouth, Cheshire, Northampton, Guildford… I occasionally have to work in London, although I tend to moan about it. And I spent a few months working in Farnborough. I think I moaned about that, too. :-)

Travelling is a great opportunity. I couldn’t afford to have been to all the places that IBM has sent me if I had to pay for it myself.

The officemy deskCarnage
My office today (left), compared with some of the other desks I’ve had around Hursley

The pay is amazing!

Hahahahaha… no.

See above.


Grace at my desk at a family fun day at work in 2008

Will I be here for another ten years?

I’m trying to explain why I’m happy and enjoying what I do. I’m not saying I couldn’t get exactly the same or better somewhere else. Because I don’t know. Other than a year I spent as an intern at Motorola I’ve never worked anywhere else. For all I know, the grass might be greener somewhere.

Will I still be here in another ten years? I dunno… I do worry if that’s unambitious. I wonder if I should try somewhere else. I wonder if only ever working for one company is giving me an institutionalised and insular view of the world.

I keep getting emails from LinkedIn about all the people I know who have new jobs. There are a bunch of people I used to work with at IBM who have not only left to work at other companies, but have since left those companies and gone on to something even newer. While I’m still here.

Am I destined to be one of those IBMers who works at Hursley forever? That’s a scary thought.

For now, I’m enjoying what I do, so that’s good enough for me.

Happy 10th anniversary to me.


Speech to Text

Apologies to the tl;dr brigade, this is going to be a long one... 

For a number of years I've been quietly working away with IBM research on our speech to text programme. That is, working with a set of algorithms that ultimately produce a system capable of listening to human speech and transcribing it into text. The concept is simple, train a system for speech to text - speech goes in, text comes out. However, the process and algorithms to do this are extremely complicated from just about every way you look at it – computationally, mathematically, operationally, evaluationally, time and cost. This is a completely separate topic and area of research from the similar sounding text to speech systems that take text (such as this blog) and read it aloud in a computerised voice.

Whenever I talk to people about it they always appear fascinated and want to know more. The same questions often come up. I'm going to address some of these here in a generic way and leaving out those that I'm unable to talk about here. I should also point out that I'm by no means a speech expert or linguist but have developed enough of an understanding to be dangerous in the subject matter and that (I hope) allows me to explain things in a way that others not familiar with the field are able to understand. I'm deliberately not linking out to the various research topics that come into play during this post as the list would become lengthy very quickly and this isn't a formal paper after all, Internet searches are your friend if you want to know more.

I didn't know IBM did that?
OK so not strictly a question but the answer is yes, we do. We happen to be pretty good at it as well. However, we typically use a company called Nuance as our preferred partner.

People have often heard of IBM's former product in this area called Via Voice for their desktop PCs which was available until the early 2000's. This sort of technology allowed a single user to speak to their computer for various different purposes and required the user to spend some time training the software before it would understand their particular voice. Today's speech software has progressed beyond this to systems that don't require any training by the user before they use it. Current systems are trained in advance in order to attempt to understand any voice.

What's required?
Assuming you have the appropriate software and the hardware required to run it on then you need three more things to build a speech to text system: audio, transcripts and a phonetic dictionary of pronunciations. This sounds quite simple but when you dig under the covers a little you realise it's much more complicated (not to mention expensive) and the devil is very much in the detail.

On the audial side you'll need a set of speech recordings. If you want to evaluate your system after it has been trained then a small sample of these should be kept to one side and not used during the training process. This set of audio used for evaluation is usually termed the held out set. It's considered cheating if you later evaluate the system using audio that was included in the training process – since the system has already “heard” this audio before it would have a higher chance of accurately reproducing it later. The creation of the held out set leads to two sets of audio files, the held out set and the majority of the audio that remains which is called the training set.

The audio can be in any format your training software is compatible with but wave files are commonly used. The quality of the audio both in terms of the digital quality (e.g. sample rate) as well as the quality of the speaker(s) and the equipment used for the recordings will have a direct bearing on the resulting accuracy of the system being trained. Simply put, the better quality you can make the input, the more accurate the output will be. This leads to another bunch of questions such as but not limited to “What quality is optimal?”, “What should I get the speakers to say?”, “How should I capture the recordings?” - all of which are research topics in their own right and for which there is no one-size-fits-all answer.

Capturing the audio is one half of the battle. The next piece in the puzzle is obtaining well transcribed textual copies of that audio. The transcripts should consist of a set of text representing what was said in the audio as well as some sort of indication of when during the audio a speaker starts speaking and when they stop. This is usually done on a sentence by sentence basis, or for each utterance as they are known. These transcripts may have a certain amount of subjectivity associated with them in terms of where the sentence boundaries are and potentially exactly what was said if the audio wasn't clear or slang terms were used. They can be formatted in a variety of different ways and there are various standard formats for this purpose from an XML DTD through to CSV.

If it has not already become clear, creating the transcription files can be quite a skilled and time consuming job. A typical industry expectation is that it takes approximately 10 man-hours for a skilled transcriber to produce 1 hour of well formatted audio transcription. This time plus the cost of collecting the audio in the first place is one of the factors making speech to text a long, hard and expensive process. This is particularly the case when put into context that most current commercial speech systems are trained on at least 2000+ hours of audio with the minimum recommended amount being somewhere in the region of 500+ hours.

Finally, a phonetic dictionary must either be obtained or produced that contains at least one pronunciation variant for each word said across the entire corpus of audio input. Even for a minimal system this will run into tens of thousands of words. There are of course, already phonetic dictionaries available such as the Oxford English Dictionary that contains a pronunciation for each word it contains. However, this would only be appropriate for one regional accent or dialect without variation. Hence, producing the dictionary can also be a long and skilled manual task.

What does the software do?
The simple answer is that it takes audio and transcript files and passes them through a set of really rather complicated mathematical algorithms to produce a model that is particular to the input received. This is the training process. Once system has been trained the model it generates can be used to take speech input and produce text output. This is the decoding process. The training process requires lots of data and is computationally expensive but the model it produces is very small and computationally much less expensive to run. Today's models are typically able to perform real-time (or faster) speech to text conversion on a single core of a modern CPU. It is the model and software surrounding the model that is the piece exposed to users of the system.

Various different steps are used during the training process to iterate through the different modelling techniques across the entire set of training audio provided to the trainer. When the process first starts the software knows nothing of the audio, there are no clever boot strapping techniques used to kick-start the system in a certain direction or pre-load it in any way. This allows the software to be entirely generic and work for all sorts of different languages and quality of material. Starting in this way is known as a flat start or context independent training. The software simply chops up the audio into regular segments to start with and then performs several iterations where these boundaries are shifted slightly to match the boundaries of the speech in the audio more closely.

The next phase is context dependent training. This phase starts to make the model a little more specific and tailored to the input being given to the trainer. The pronunciation dictionary is used to refine the model to produce an initial system that could be used to decode speech into text in its own right at this early stage. Typically, context dependent training, while an iterative process in itself, can also be run multiple times in order to hone the model still further.

Another optimisation that can be made to the model after context dependent training is to apply vocal tract length normalisation. This works on the theory that the audibility of human speech correlates to the pitch of the voice, and the pitch of the voice correlates to the vocal tract length of the speaker. Put simply, it's a theory that says men have low voices and women have high voices and if we normalise the wave form for all voices in the training material to have the same pitch (i.e. same vocal tract length) then audibility improves. To do this an estimation of the vocal tract length must first be made for each speaker in the training data such that a normalisation factor can be applied to that material and the model updated to reflect the change.

The model can be thought of as a tree although it's actually a large multi-dimensional matrix. By reducing the number of dimensions in the matrix and applying various other mathematical operations to reduce the search space the model can be further improved upon both in terms of accuracy, speed and size. This is generally done after vocal tract length normalisation has taken place.

Another tweak that can be made to improve the model is to apply what we call discriminative training. For this step the theory goes along the lines that all of the training material is decoded using the current best model produced from the previous step. This produces a set of text files. These text files can be compared with those produced by the human transcribers and given to the system as training material. The comparison can be used to inform where the model can be improved and these improvements applied to the model. It's a step that can probably be best summarised by learning from its mistakes, clever!

Finally, once the model has been completed it can be used with a decoder that knows how to understand that model to produce text given an audio input. In reality, the decoders tend to operate on two different models. The audio model for which the process of creation has just been roughly explained; and a language model. The language model is simply a description of how language is used in the specific context of the training material. It would, for example, attempt to provide insight into which words typically follow which other words via the use of what natural language processing experts call n-grams. Obtaining information to produce the language model is much easier and does not necessarily have to come entirely from the transcripts used during the training process. Any text data that is considered representative of the speech being decoded could be useful. For example, in an application targeted at decoding BBC News readers then articles from the BBC news web site would likely prove a useful addition to the language model.

How accurate is it?
This is probably the most common question about these systems and one of the most complex to answer. As with most things in the world of high technology it's not simple, so the answer is the infamous “it depends”. The short answer is that in ideal circumstances the software can perform at near human levels of accuracy which equates to in excess of 90% accuracy levels. Pretty good you'd think. It has been shown that human performance is somewhere in excess of 90% and is almost never 100% accuracy. The test for this is quite simple, you get two (or more) people to independently transcribe some speech and compare the results from each speaker, almost always there will be a disagreement about some part of the speech (if there's enough speech that is).

It's not often that ideal circumstances are present or can even realistically be achieved. Ideal would be transcribing a speaker with a similar voice and accent to those which have been trained into the model and they would speak at the right speed (not too fast and not too slowly) and they would use a directional microphone that didn't do any fancy noise cancellation, etc. What people are generally interested in is the real-world situation, something along the lines of “if I speak to my phone, will it understand me?”. This sort of real-world environment often includes background noise and a very wide variety of speakers potentially speaking into a non-optimal recording device. Even this can be a complicated answer for the purposes of accuracy. We're talking about free, conversational style, speech in this blog post and there's a huge different in recognising any and all words versus recognising a small set of command and control words for if you wanted your phone to perform a specific action. In conclusion then, we can only really speak about the art of the possible and what has been achieved before. If you want to know about accuracy for your particular situation and your particular voice on your particular device then you'd have to test it!

What words can it understand? What about slang?
The range of understanding of a speech to text system is dependent on the training material. At present, the state of the art systems are based on dictionaries of words and don't generally attempt to recognise new words for which an entry in the dictionary has not been found (although these types of systems are available separately and could be combined into a speech to text solution if necessary). So the number and range of words understood by a speech to text system is currently (and I'm generalising here) a function of the number and range of words used in the training material. It doesn't really matter what these words are, whether they're conversational and slang terms or proper dictionary terms, so long as the system was trained on those then it should be able to recognise them again during a decode.

Updates and Maintenance
For the more discerning reader, you'll have realised by now a fundamental flaw in the plan laid out thus far. Language changes over time, people use new words and the meaning of words changes within the language we use. Text-speak is one of the new kids on the block in this area. It would be extremely cumbersome to need to train an entire new model each time you wished to update your previous one in order to include some set of new language capability. The models produced are able to be modified and updated with these changes without the need to go back to a full standing start and training from scratch all over again. It's possible to take your existing model built from the set of data you had available at a particular point in time and use this to bootstrap the creation of a new model which will be enhanced with the new materials that you've gathered since training the first model. Of course, you'll want to test and compare both models to check that you have in fact enhanced performance as you were expecting. This type of maintenance and update to the model will be required to any and all of these types of systems as they're currently designed as the structure and usage of our languages evolve.

Conclusion
OK, so not necessarily a blog post that was ever designed to draw a conclusion but I wanted to wrap up by saying that this is an area of technology that is still very much in active research and development, and has been so for at least 40-50 years or more! There's a really interesting statistic I've seen in the field that says if you ask a range of people involved in this topic the answer to the question “when will speech to text become a reality” then the answer generally comes out at “in ten years time”. This question has been asked consistently over time and the answer has remained the same. It seems then, that either this is a really hard nut to crack or that our expectations of such a system move on over time. Either way, it seems there will always be something new just around the corner to advance us to the next stage of speech technologies.

Going Back to University



A couple of weeks ago I had the enormous pleasure of returning to Exeter University where I studied for my degree more years ago than seems possible.  Getting involved with the uni again has been something I've long since wanted to do in an attempt to give back something to the institution to which I owe so much having been there to get good qualifications and not least met my wife there too!  I think early on in a career it's not necessarily something I would have been particularly useful for since I was closer to the university than my working life in age, mentality and a bunch of other factors I'm sure.  However, getting a bit older makes me feel readier to provide something tangibly useful in terms of giving something back both to the university and to the current students.  I hope that having been there recently with work it's a relationship I can start to build up.

I should probably steer clear of saying exactly why we were there but there was a small team from work some of which I knew well such as @madieq and @andysc and one or two I hadn't come across before.  Our job was to work with some academic staff for a couple of days and so it was a bit of a departure from my normal work with corporate customers.  It's fantastic to see the university from the other side of the fence (i.e. not being a student) and hearing about some of the things going on there and seeing a university every bit as vibrant and ambitious as the one I left in 2000. Of course, there was the obligatory wining and dining in the evening which just went to make the experience all the more pleasurable.

I really hope to be able to talk a lot more about things we're doing with the university in the future.  Until then, I'm looking forward to going back a little more often and potentially imparting some words (of wisdom?) to some students too.

Monkigras 2013: Scaling craft

The work of William Morris, my GCSE history teacher said, was a bit of a moral dilemma. Morris was a British designer born during the Industrial Revolution. British (and then world) industry was moving rapidly towards mass production by replacing traditional, cottage-industry production processes with the more efficient, and therefore profitable, machines. One thing that suffered under this move to mass production was the focus on function and quantity over decoration and quality. Morris reacted against this by designing and producing decorations like wallpaper and textiles using the traditional craft techniques of skilled craftspeople. My history teacher’s point was that although Morris, a passionate socialist, was able to create high quality goods by using smaller-scale production methods, only wealthy people could afford to buy his designs; which was hardly equality in action. On the other hand, the skills of craftspeople were being retained, quality goods were being produced, and the craftspeople were getting paid for that quality of their work.

My pretty, handcrafted latte
My pretty, handcrafted latte

Monkigras 2013, in London last week, took on this theme of ‘scaling craft’ in the context of beer, coffee, and software. All parts of this trinity of software development can benefit hugely from a focus on quality over quantity. Before I went to Monkigras, I wasn’t really sure what to expect from a tech event advertised as having a lot of beer. It did have a lot of beer (and coffee) available but if you didn’t want it you could avoid it (several people I talked to said they didn’t usually drink beer). And no one seemed to get ridiculously drunk. And there were a lot of very cool talks.

The beer was also a fun analogy to apply to software development. Despite pubs in the UK closing hand over fist at the moment, microbreweries are on the rise. Microbrewing is about producing beer in small quantities on a commercial basis so that quality can be maintained whilst still viable as a business. One of the things we learnt from a brewer at Monkigras is that the taste of water varies according to where it comes from. Water is a major component of beer so if the taste of your water supply changes, the taste of your beer changes. To maintain the quality of the beer you brew, you must work within the natural resources available to you and not over-expand. Similarly, quality comes from skilled and knowledgeable people who need to be paid for their skill. If you take on cheaper staff and train them less so that you can make more profit, you will end up with a poorer quality product. You get the idea.

Handcrafting a wooden spoon.
Handcrafting a wooden spoon.

This principle applies to all areas of craft, whether it’s producing quality coffee, a quality wooden spoon, quality conference food, or organising a quality conference, you have to focus on quality and ensure that if you scale what you do so that it’s more readily available to more people, you don’t sacrifice quality at the same time. And, importantly, that you know when to stop. Bigger doesn’t necessarily mean better.

Software is misleadingly easy to produce. Unlike making physical objects, there is very little initial cost to producing software; you can make copies and then distribute them to customers over the Internet at very little cost. Initially, at least, it’s all in the skill of the craftspeople and their ability to identify their target users and market. If they can’t make what people will buy, they will go out of business very quickly. As software development companies get larger, the people who make the software inside the company become further removed from the selling of their software to their customers. So they become more focused on what they are close to, the technology but not who will use it.

Phil Gilbert on IBM Design Thinking
Phil Gilbert on IBM Design Thinking

Phil Gilbert, IBM’s new General Manager of Design, comes from a 30-year career in startups, most recently Lombardi, where design was core to their culture. IBM has a portfolio of 3000 software products so, when Lombardi was acquired by IBM, Phil set about simplifying the IBM Business Process Management portfolio of products, reducing 21 different products to just four and kicking off a cultural change to bring design and thinking about users to the centre of product development. Whilst praising IBM’s history of design and a recent server product design award, he also acknowledged at Monkigras: “We are rethinking everything at IBM. Our portfolio is a mess today and we need to get better”. Changing a culture like IBM’s isn’t easy but I’ve seen and experienced a big difference already. Phil’s challenge is to scale the high-quality user-focused design values of a startup to a century-old global corporation.

One of the things that struck me most at Monkigras, and appealed to me most as a social scientist, was the focus on the human side. Despite it being a developer conference, I remember seeing only one slide that contained code. The overriding theme was about people and culture, not technology; how to maintain quality by maintaining a culture that respects its craftspeople and how to retain both even if the organisation gets bigger, even if that naturally limits how much the organisation can grow. Personal analogy was also a big thing…

Laser-scanned model of the engine
Laser-scanned model of the engine

Cyndi Mitchell from Logspace talked about her family’s hog farm and working within the available resources. Shanley Kane from Basho used Dante’s spheres to describe best product management practices. Steve Citron-Pousty from RedHat use his background as an ecologist to manage communities and ‘developer ecosystems’ (don’t just call it an ecosystem; treat it like one). Diane Mueller from ActiveState talked about her 20%-time project to build a crowdsourced database of totem poles and the challenges of understanding what gets people to want to contribute to such projects. Elco Jacobs talked about his BrewPi project: automatically managing the temperature of his homebrewing fridge using a RaspberryPi based controller, and how he has open-sourced to build a community to kick start it as a potential small business. Rafe Colburn from Etsy more directly makes the link between craft and software engineering in his slides.

3D printer making a spoon
3D printer making a spoon

I don’t know much about William Morris so I don’t know which presentations he would have enjoyed or disagreed with. Morris was a preservationist and started the Society for the Protection of Ancient Buildings to ensure that old buildings get repaired and not restored to an arbitrary point in the past. So maybe he would have found laser-scanning and 3D printing interesting. Chris Thorpe is a model train geek and likes to hand-make his own models of real-life objects. He too is interested in alternatives to mass manufacturing and has started to look at how to make model kits. He uses a laser to scan the objects and a 3D printer to prototype the models. He can then send the model to a commercial company who can make it into kits for him to sell. He has recently used his laser-scanning technique to scan a rediscovered old Welsh railway engine to preserve it, virtually at least, in the state in which it was found.

I had a great time with lots of cool and fun people. Well done to @monkchips for scaling a conference to just the right level of intimacy and buzz. The last thing I saw before I left was the craftsman making a wooden spoon pitted in competition against the 3D printer making a plastic spoon.

You can find many of the slide presentations and more about the conference Lanyrd.

The post Monkigras 2013: Scaling craft appeared first on LauraCowen.co.uk.

developerWorks Days Zurich 2012

This week I had a day out of the office to go to Zurich to talk at this years IBM developerWorks Days. I had 2 sessions back to back in the mobile stream, the first an introduction to Android Development and the second on MQTT.

The slots were only 35mins long (well 45mins, but we had to leave 5 mins on each end to let people move round) so there was a limit to how much detail I could go into. With this in mind I decided the best way to give people a introduction to Android Development in that amount of time was to quickly walk through writing reasonably simple application. The application had to be at least somewhat practical, but also very simple so after a little bit of thinking about I settled on an app to download the latest image from the web comic XKCD. There are a number apps on Google Play that already do this (and a lot better) but it does show a little Activity GUI design. I got through about 95% of the app live on stage and only had to copy & paste the details for the onPostExecute method to clear the progress dialog and update the image in the last minute to get it to the point I could run it in the emulator.

Here are the slides for this session

And here is the Eclipse project for the Application I created live on stage:
http://www.hardill.me.uk/XKCD-demo-android-app.zip

The MQTT pitch was a little easier to set up, there is loads of great content on MQTT.org to use as a source and of course I remembered to include the section on the MQTT enabled mouse traps and twittering ferries from Andy Stanford-Clark.

Here are the slides for the MQTT session:

For the Demo I used the Javascript d3 topic tree viewer I blogged about last week and my Raspberry Pi running a Mosquitto broker and a little script to publish the core temperature, load and uptime values. The broker was also bridged to my home broker to show the feed from my weather centre and some other sensors.