ImCor

As Kobus Barnard and I were (to our knowledge) the first to use images to disambiguation words, it isn't completely surprising that when we went to look for corpora that linked images to disambiguated text we were unable to find any. That being the case, I set out to build ImCor, a linking of the Corel image database with the SemCor corpus. You can download the original and expanded ImCor datasets below.

The details on the first version of ImCor were published here , in the eponymous section. Briefly, I wrote a program, soon to be available here (once it is rendered GUI hack-free), called CorpusBuilder which displayed images and text to a user and asked them whether an image was appropriate for an image and then to select text which specifically described the image. It then compiled together the results into an XML document, which contained the image and all "captions" which described it, fully disambiguated from SemCor.

We have subsequently built a test corpus in the same format from the Senseval3 English All-Words task data and the Corel images, which, again, will soon be available here. The corpus was built in the same way and is useful for comparing the performance of an image disambiguation algorithm against various state-of-the-art text-based algorithms. I hope it will be useful for more than this, and look forward to hearing from various people who have found it useful.