Cyril Goutte

894days until
the liberashan

Not so news

L'homme qui a (pas) vu l'ours

posted ‎‎25 Sep 2009 11:33‎‎ by Cyril Goutte

Ce matin en quittant la piste cyclable qui passe derrière le CRTL, je me retrouve quasiment nez-à-nez avec une voiture de police circulant sur la piste cyclable.  Le conducteur m'apostrophe par la fenêtre: «Hé! T'aurais tu vu un ours, par hasard?».

Heu, non, désolé, pas d'ours ce matin.

The problem with PowerPoint...

posted ‎‎20 Aug 2009 18:34‎‎ by Cyril Goutte

The problem with Powerpoint is that PowerPoint presentations are awfully boring.  Mega yawn.  Wake-me-up-when-it's-over boring.

I think I may have seen one PowerPoint presentation that was "really inspiring and enthusiastic."  I suspect I never made one that was either.

Most of the other inspiring presentations I can think of were mostly projected pdf.


Karsh festival in Ottawa

posted ‎‎22 Jun 2009 06:47‎‎ by Cyril Goutte   [ updated ‎‎15 Jul 2009 19:33‎‎ ]

Ottawa is celebrating Yousuf Karsh by holding a Karsh festival from 12 June to 13 September 2009.

Karsh is a wonderful portrait photographer (one of his many claims to fame is the photograph of an angry Churchill, for which he presumably stole the guy's cigar for added menacing scowl).

The festival takes place in many locations.  For example, the Museum of Science and Technology, in addition to photographs, is displaying large format cameras.   Bring kids to show them what cameras looked like before they became cell phone features!

Translation quality matrix

posted ‎‎22 May 2009 12:55‎‎ by Cyril Goutte

I'm really slow but I just recently found this matrix of Machine Translation performance between 11 European languages (110 different systems).

Not surprisingly, the highest "quality" (at least as measured by BLEU) is Portuguese-to-Spanish.

More surprising: The difference between eg Italian->Spanish and Spanish->Italian.  Or the TER score of anything going to Greek (maybe a tokenization bug here?).

Anyway it's also a great entry point to various mono- and multi-lingual resources.

SMART workshop

posted ‎‎4 May 2009 07:44‎‎ by Cyril Goutte

The EU project known as SMART (Statistical Multilingual Analysis for Retrieval and Translation) will organise a scientific workshop next week in Barcelona in connection with the EAMT 2009 conference.

It should be a good way to learn more about the fancy new stuff that has been developed within and around that project.

I will give a talk on "Improving Statistical Machine Translation by learning the translation direction" in the afternoon.

Blogue

posted ‎‎30 Apr 2009 08:49‎‎ by Cyril Goutte

Le «blogue» d'une journaliste scientifique qui n'a pas peur d'aller un peu à rebrousse-poil du conformisme ambient: Valérie Borde.

On se demande juste: pourquoi «en colère»?  En France, quand on est en colère, on fait la grève, (ou au moins une petite manif).  Là, madame Borde fait des articles en plus.  Ils sont fous ces Québecois! (*)

Je vous quitte, je vais chercher du Tamiflu.


(*): avec mes excuses à Obelix.


Reviewing, judging and corruption

posted ‎‎30 Apr 2009 07:35‎‎ by Cyril Goutte   [ updated ‎‎22 May 2009 12:54‎‎ ]

A recent article in the independent UK, The murky music prize, makes the small concerns of scientists (including Machine Learning researchers) about reviewing seem pretty light in comparison.

Of course, it's a lot easier to fake a scientific paper than a live musical performance (insert Milli Vanilli joke here).

The Unreasonable Effectiveness of Data

posted ‎‎7 Apr 2009 07:36‎‎ by Cyril Goutte

In a recent paper published in IEEE Intelligent Systems, Alon Halevy, Peter Norvig and Fernando Pereira talk about The Unreasonable Effectiveness of Data in many natural language applications.

Of course, on a first look, it's very much (smart) Google folks preaching their own gospel, so to speak.  It's also very timely as the effectiveness of data has had impact for some time:
  • Computational Linguistics has gone from symbolic to statistical methods over the 1990s, and with that switch came an extreme reliance on large amounts of data.
  • There is widespread (imho correct but poorly supported) belief in Machine Translation that doubling the amount of data produces essentially a constant gain in BLEU score.
  • A recent lecturer at our institute was making the point that waiting for the web to grow was arguably a more efficient way to increase system performance than working on better models [but where would the fun be in that, right?]
Of course being interested in statistical learning, I'm not about to argue against the appropriateness of learning from data.

However, I find the over-reliance on data and its natural growth on the web slightly unconvincing, especially when it is used as an excuse for not working on the actual models.

Machine Translation seems particularly interesting because it is regularly evaluated in public bakeoffs or internal project evaluations.  There is a clear trend that increasing data does regularly increase performance.  Note, however, that, before the current state-of-the-art phrase-based MT models, were a set of word-based models collectively referenced as "IBM models".  It is very unlikely that, even with massive amounts of additional data, IBM models would outperform current phrase-based approaches.  (In fact some IBM models are used as preprocessing when learning phrase-based models.)

So: the availability of growing amounts of data, however unreasonably effective it may be, should not, imho, replace sound work on the modelling part.  It should, on the other hand, help us think of what features are important in a model.  Scaling is of crucial interest, as shown by the recent work on fast classifier training.  In that context, I believe it is also important to think about non-parametric models, ie models for which the effective model complexity grows with the amount of training data.

I am not a number (but maybe I should?)

posted ‎‎29 Mar 2009 06:42‎‎ by Cyril Goutte   [ updated ‎‎29 Mar 2009 07:28‎‎ ]

An interesting article on the Science website on assigning scientists a unique identification number.

Even in my case, with a fairly distinctive name, it turns out that publication search is not that straightforward:
  1. There are a number of other "C. Goutte", including at least one Caroline with fairly well-cited papers in the field of medicine.
  2. On one of my well-cited papers I'm credited only as "C. Goutte", making it difficult for search engines like Scholar to tie it to my full name.
Having a unique ID would make search a lot simpler... assuming the ID is truly unique (not the case yet), and that search engines can somehow extract it reliably!

The world in pictures

posted ‎‎25 Mar 2009 12:29‎‎ by Cyril Goutte   [ updated ‎‎30 Apr 2009 08:43‎‎ ]

Interesting series of pictures in the Christian Science Monitor: a tour of the world's most polluted cities.

Reminiscent of the exhibition "Imaging a shattering earth", presented last year at the CMCP and originally developed at the Oakland University Art Gallery. (great website with pictures, lots of related material and links to podcasts of a tour of the CMCP exhibition -- et donc disponible en Français aussi).

[The date on this post is all wrong: it was posted on 30 April 2009]

‹ Prev    1-10 of 10    Next ›