Soccer Aloud Part I

February 12th 2015

I used to be content to read the football news. I had match reports, game analysis, transfer rumours, and general gossip to entertain me and there were a few websites and writers that I loved to read.

But I kept hitting situations where I wanted the info, but I didn't want to read. If I was partway through an article when I had to do the washing up, I'd have to wait to finish or risk drowning my phone. I'd be traveling to work and I couldn't use my hands to scroll and tap links, so I couldn't read. My eyes would get tired at the end of the day and I'd want to relax, but I still wanted to know what my pals at FourFourTwo had to say. I definitely couldn't read while I was working out.

Around the time of World Cup 2014, I got motivated to find a better way to get access to my favourite writing in a format that would work with my other activities. I'd heard that speech was the latest thing. Siri was taking off. Maybe there was something I could do with this amazing technology that we all have in our phones.

So I started trying the user features built in to the operating system to read web pages. I thought it would just work.

But it didn't.

It was a little hard to get the phone to speak a web page, but once I'd worked out how to do it, I was confused by the results. The voice quality was good, but I was having a really hard time understanding the speech. I lost the meaning of what was being read so easily. It was hard to understand. And it was fatiguing. Even when I could understand what was being said, I felt like I was getting punched in the ears. WTF!?

OK, I thought, I'll try some of the apps out there that claim to be able to read to you.

Still not good.

They had problems recognizing which parts of the web page made up the article I wanted to hear. Or they'd mispronounce things. Or they didn't understand punctuation. Or they didn't have an offline mode where I could queue up things I wanted to hear. And none of them understood what they were reading. They didn't know the difference between a score and a formation. They didn't know that "Man Utd" and "Manchester United" are the same. They sure didn't get that English speakers need to be able to read French, German, Italian, Portuguese, Spanish, African names. Even the best made me tired listening to the synthesized speech for any length of time. It just wasn't working.

I'm a programmer and while I didn't want to write code to listen to a website, I'd recently worked on an app that did some speech synthesis. I knew the basic ingredients to get text converted to speech. I knew that speech quality had really progressed. I knew that speech synthesizers just didn't have to sound like Stephen Hawking anymore. I knew that synthesized speech didn't have to be tiring. And I knew that it had to be possible to fix the problems I was seeing.

But how many problems were we talking about?

I did a quick estimate using an article that I knew was giving the phone a hard time. It had about 5 errors per paragraph or about 1 error every 10 words! No wonder I couldn't understand it. At first I was depressed. Many smart people have been working on speech synthesis for a long time and we're still at the stage where 10% of the speech is unintelligible or confusing. How am I going to make a dent in that?

I took a break. Hit the speed bag for a bit.

But the problem niggled at me. I'd heard good synthesized speech. Not every article had the problems to the same extent. I had some ideas about characteristics of human speech that I knew weren't being exploited by the other apps. What if I used that insight and combined it with fixes for all the errors in that article? Would that do it or would it still be exhausting to listen to? Was speech synthesis flawed or was it just a numbers game? Could a war of attrition against the errors produce something that would be enjoyable to listen to?

I formulated a theory: reduce the errors and listener fatigue will go down at the same time as understanding goes up.

So I tested it.

I created an app that could speak an article aloud. I added specific fixes for every oddity in that one problematic article. And I sprinkled in that one weird trick to mimic human speech that I had come up with. And then I listened to that web page. It was amazing. Like night and day. I listened once and then again and again. I could understand! I didn't feel like my ears were being brutalized! Theory confirmed.

And then I listened again. To the same article. And it was annoying. Seriously! What the hell? I'd just fixed all the errors. What now?

It turns out that when you fix the obvious errors, less obvious "errors" become apparent. If you think you might have to listen to something for an hour at a time, you get past listening only for out-and-out defects and you start thinking about fluency. So I made another pass on the article fixing things up. Correcting things that were already "correct". Then I made another pass. Then another. Each time the speech became easier to understand and more pleasant to listen to. It took five passes through that article before I was satisfied.

But I was satisfied. I'd proved to myself that I could take some football news and write an app that could read it aloud. And I'd found a focus for all future work: reduce listener fatigue by increasing the fluency of the speech. It wasn't enough to provide isolated pronunciations for single words, I had to listen to the rhythm and phrasing to keep it fluent.

Getting fluent speech for that one article was a lot of work, but I could see the possibilities. I accepted the challenge in front of me and I started developing an app. That app became Soccer Aloud.

Making the app read the next article fluently was no easier than the first article. It seemed like none of the fixes for the first article applied to the second. Same for the third article and the fourth and the fifth. But I was determined to push on. Each new article came out sounding great - that was enough. At some point, I don't know when, perhaps 50 articles in, I saw a couple of fixes from previous articles that actually applied to the current one. Perhaps things were going to get easier?