Background readings:
K. Barnard and D. Forsyth, "Learning the Semantics of Words and Pictures," International Conference on Computer Vision, vol 2, pp. 408-415, 2001, http://dx.doi.org/10.1109/ICCV.2001.937654
D. Blei and M. Jordan, "Modeling Annotated Data", SIGIR '03 Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, [Yangqing]
http://dx.doi.org/10.1145/860435.860460
T. Berg and D. Forsyth, "Animals on the Web", CVPR 2006, http://dx.doi.org/10.1109/CVPR.2006.57
Contemporary readings:
Chong Wang, D. Blei, Fei-Fei Li, "Simultaneous image classification and annotation," CVPR 2009, http://doi.ieeecomputersociety.org/10.1109/CVPRW.2009.5206800 (Jerry Zhang)
K. Saenko and T. Darrell, “Filtering Abstract Senses From Image Search Results”, NIPS 2009, http://books.nips.cc/papers/files/nips22/NIPS2009_1143.pdf
A. Farhadi, M. Hejrati , M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier and D. Forsyth, "Every Picture Tells a Story: Generating Sentences from Images", NIPS 2010, http://dx.doi.org/10.1007/978-3-642-15561-1_2 (Ning)
Abhinav Gupta, Praveen Srinivasan, Jianbo Shi and Larry S. Davis, Understanding Videos, Constructing Plots: Learning a Visually Grounded Storyline Model from Annotated Videos, In CVPR 2009. http://www.cs.cmu.edu/~abhinavg/papers/cvpr_2009.pdf
Beyond Nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers
Abhinav Gupta and Larry S. Davis In ECCV 2008 http://www.cs.cmu.edu/~abhinavg/papers/eccv_2008.pdf (Slides)