A couple people have asked me over the last few days what I’ve been doing all summer, and it’s taken me way too many words and I’ve probably not said anything comprehensible at all. Here’s a hopefully concise version of it.
So there are these Machine Translation (MT) systems out there. There are a lot of different kinds, depending on how far up the pyramid they go. A graphic would help here:
Okay. See, you start with an English sentence, and you can translate by just going word-for-word (like Babelfish : “the red house” -> “el rojo casa”).
Or you can do some analysis on morphology, meaning you deconstruct each word and then reassemble. (like “I walked” -> (I + walk-past-tense) -> “yo caminé”)
Or you can go further to analyze the syntax of the sentence, like “I walked to the store” -> Noun (I) + verb (walk-past-tense) + prepositional phrase (to the store) -> “yo caminé a la tienda.” This can be useful with languages like German, where word order sometimes flips around a bit.
Or you can analyze the sentence for its meaning, and this is where things get a little hairy. It’d be something like “I walked to the store” -> (first-person pronoun) + (moving on two feet) + (to the store) -> “yo caminé a la tienda.”
And then if you’re a real baller, you can translate everything into Interlingua, The One Universal Language (which doesn’t exist, but bear with me) and then reconstruct in the target language. This hasn’t worked very well, so far.
Anyway, the approach we’re using is still mostly word-for-word. It’s not as dumb as Babelfish, because it includes triples of words (“trigrams”) instead of just words, so it’ll know “the red house” -> “la casa roja”, not “el rojo casa.” So instead of a big word-to-word dictionary, we need a super-big dictionary of three-word phrases. You can’t just go to Webster and look that up… you have to build this yourself. So we need a lot of parallel phrases in English and, say, German. Where can we find that? Wikipedia. See on the lower right, where it says “this page in other languages”? Great! So if you look for something thrilling like, say, Subprime mortgage financial crisis, you can easily click on the “Francais” link and see that, in French, it’s “Crise des subprimes.” Now, whenever our translation system wants to translate an article containing “subprime mortgage financial crisis” into French, instead of looking for each word, it just goes, bam, crise des subprimes.
What have I done? Mostly extracted those page titles, using databases and other cool things, and found that there’s not too much interesting that we can do. Maybe we can use other bits of Wikipedia to find more translated phrases. Maybe we can use knowledge that we learn about these words to help us out further. (like if you know “Stevie Wonder” is a guy’s name, then we don’t have to translate it at all, so we won’t accidentally go with “maravilla del Stevie.")
Okay, what do I actually do? Mostly dick around with databases and other stupid things. Bang out some code. Read some papers, and then get confused. If you’ve never read academic papers, like Academic Papers, you’re in for a treat. Pick up whatever journal is preeminent in your field, or whatever conference is The Big One, and start reading. If you can follow it for more than like two pages, congratulations! You’re a grad student!
Seriously, though, I’m trying. Eventually this whole field will make sense. I’m figuring I’ll solve machine translation (and every other language-related problem) within the next three years.
I’m starting next year by pushing all this Wikipedia nonsense to the back burner. We’re going to try to develop a representation of language by going all the way up to the semantic level. Yeah. We’re teaching computers what sentences mean. That’s right. And we’re going to test it by using recipes. So by Spring 2008, I will not only have a badass thesis of like tens of pages, but I’ll also have this great recipe software I’ve always dreamed of.
Yeah yeah yeah yeah! Languages are it! The wave of the future! I’ve got new comics coming up check them out.
blog 2023 2022 2021 2020 2019 2018 2017 2016 2015 2014 2013 2012 2011 2010