Guidelines for project ideas
Your project should be in the general area of multimodal communication, whether it is parsing, analyzing, searching, or visualizing. We are particularly interested in proposals that make a contribution to integrative cross-modal feature detection tasks. These are tasks that exploit two or even three different modalities, such as text and audio or audio and video, to achieve higher-level semantic interpretations or greater accuracy. You could work on one or more of these modalities.
We invite you to develop your own proposals in this broad and exciting field. We are interested in all aspects of human multimodal communication, such as the relation between verbal constructions and facial expressions, gestures, and auditory expressions. While we list some concrete proposals below, we are also opening up the space to hear your ideas: we want to listen to what you imagine may be possible.
You could focus on a very specific type of gesture, or facial expression, or sound pattern, or linguistic construction, train a classifier using machine learning, and use that classifier to identify the population of this feature in a large dataset. Our aim is to annotate our entire dataset of more than 400,000 news programs, so your application should include methods of locating as well as characterizing the feature or behavior you are targeting. Contact us for access to existing lists of features and sample clips. We will work with you to generate the training set you need, but note that you may need to factor in some time for this in your project proposal.
We are also open to proposals that focus on a particular topic in the news, examining a range of communicative strategies utilized within that particular topic. See for instance the idea "Multimodal rhetoric of climate change" below.
Red Hen provides a multi-leveled set of tools, and we are also interested in proposals that develop new search applications; see for instance the idea "Development of a Query Interface for Parsed Data" below. See also the ideas page of our partner vitrivr at the University of Basel; we are happy to provide joint mentoring and shared datasets.
Finally, we welcome visualization proposals; see for instance our Viz2016 project, visualizing some dimensions of the US Presidential elections. See also the ideas page of our partner, the Experimental Media Research Group (EMRG) at St. Lucas School of Arts in Antwerp; we are happy to provide joint mentoring and shared datasets.
When you plan your proposal, bear in mind that your project should result in a module that is installed on our high-performance computing cluster, fully tested, with clear instructions, and ready to be deployed to process a massive dataset. The architecture of your project should be designed so that it is clear and understandable for coders who come after you, and fully documented, so that you and others can continue to make incremental improvements. Your module should be accompanied by a python application programming interface or API that specifies the input and output, to facilitate the construction of the development of a unified multimodal processing pipeline for extracting information from text, audio, and video. We prefer projects that use C/C++ and python and run on Linux. For several of the ideas listed, it's useful to have prior experience with deep learning tools.
Your project should be scaled to the appropriate level of ambition, so that at the end of the summer you have a working product. Be realistic and honest with yourself about what you think you will be able to accomplish in the course of the summer. Provide a detailed list of the steps you believe are needed, the tools you propose to use, and a weekly schedule of milestones. Chose a task you care about, in an area where you want to grow. The most important thing is that you are passionate about what you are going to work on with us. We look forward to welcoming you to the Red Hen team!
1. Emotion detection and characterization
Develop and deploy emotion-detection tools in language, voice qualities, gestures, and/or facial expressions to achieve a more complex, nuanced, and integrated characterization of emotions. It will be useful to focus on a subset of emotions; the system should be constructed so that it can be extended.
The components may include natural language processing tools, audio frequency analysis, and/or deep learning techniques. The API should be a python script specifying audio/video and text input conditions and an output in JSON Lines annotations.
It may be an effective first step to focus on a single modality and use that to identify instances of the emotion you are targeting in a large dataset. Think strategically about how your project could help users locate and characterized instances of emotional expressions in very large datasets of television news, Youtube videos, or films.
2. Constructions for epistemic stance
Multimodal registers of communication often function to strengthen the credibility or cast doubt on what is being said, or to qualify what is being said as fantastic, wildly improbable, or merely slightly implausible. A proposal will focus on specific examples of evidentiary qualifications, but built in such a way that it can be extended.
The components may include linguistic elements, tone of voice, eye direction and head direction (such as a side eye), and gestures. The aim is to show how new meanings emerge from the combination of features. For some examples, see this video with comments.
As in the case of emotion, it may be an effective first step to focus on a single modality and use that to identify instances of the epistemic stance you are targeting in a large dataset. Think strategically about how your project could help users locate and characterized instances of epistemic constructions in very large datasets of television news, Youtube videos, or films.
3. MultiTask multimodal architecture for audio, video and text.
Speakers routinely utilize modulations in prosody -- in the intonation, emphasis, speed, and rhythm of speech -- to convey meaning, in ways that are typically not represented in the choice of words. Most of the tasks done for speech requires multi-model features, from both text and audio concurrency and semantic context. Therefore, it makes sense to model both audio and text in the same semantic space. At a later time, we also think of strategies for incorporating visual information in the multi-modal embeddings.
The network in the beginning be similar to Word2Vec architecture, which takes into account, attention from both audio ( be it prosody, MFCC features ) and text. Here, the end-result will be to develop multimodal feature space which can be used for various behavior analysis applications.
The project direction will be formulated as we progress. It will require familiarity with handling text/audio/visual data and designing neural architectures.
4. Multi-speaker speech-to-text
Current speech-to-text tools are excellent when trained to a specific speaker, but poor with untrained voices. Our audio pipeline, built by GSoC2015 students, provides efficient methods of forced alignment, speaker diarization, and speaker recognition. We welcome proposals for projects that extend these capabilities into a system for semi-supervised or automated speaker training for multi-speaker speech-to-text.
It would be possible, for instance, to develop a system that uses shows with good transcripts and named speakers to automate the training of classifiers for speaker recognition; this could be used to improve the speech-to-text. Existing Red Hen tools for speaker diarization and automated visual recognition of faces could be combined to improve results in shows with no transcripts. Another interesting cross-modal option is to combine computer vision lip reading with audio signal processing.
See also the ideas page of our partner CCExtractor on speech-to-text; we often do joint mentoring and provide access to shared datasets.
5. Multimodal rhetoric in television news
We invite proposals for studying how specific topics are presented in the news, across different networks and over time, using a combination of natural language processing, audio signal analysis, and computer vision.
As an example of a cross-modal learning project focused in a specific topic, consider the multimodal rhetoric of climate change. A recent study found that the coverage of climate change denial in the media had the effect of nullifying accurate information about climate change simply by sowing doubt, but that this effect can be countered by a kind of vaccine or inoculation:
- A general inoculation, consisting of a warning that “some politically-motivated groups use misleading tactics to try and convince the public that there is a lot of disagreement among scientists”.
- A detailed inoculation that picks apart the Oregon petition specifically. For example, by highlighting some of the signatories are fraudulent, such as Charles Darwin and members of the Spice Girls, and less than 1% of signatories have backgrounds in climate science.
The study shows that when this information is provided alongside the scientific claim, it has a big effect, preventing most of the nullification, though interestingly not all of it -- the misinformation is still present. The study provides detailed quantitative measures of how much attitudes shift in response to the inoculation. An attractive project would build on this study and attempt to quantify the rhetorical effect of the climate change coverage of the various networks, resulting in predictions of public opinion that can be tested using polls and surveys.
Use your mind to think of and develop ways in which interesting questions in the news can be quantified using computational and statistical tools of multimodal datamining. We will provide expert mentors in this cutting-edge research area.
6. Opening the digital silo: Multimodal television show segmentation
Libraries and research institutions in the humanities and social sciences often have large collections of legacy video tape recordings that they have digitized, but cannot usefully access -- this is known as the "digital silo" problem. Red Hen is working with several university libraries on this problem, and several of this year's ideas contribute to the solution. A basic task we need to solve is television program segmentation. The UCLA Library, for instance, is digitizing its back catalog of several hundred thousand hours of news recordings from the Watergate Hearings in 1973 to 2006. These digitized files have eight hours of programming that must be segmented at their natural boundaries.
We welcome proposals for a segmentation pipeline. An optimal approach is to use a combination of text, audio, and visual cues to detect the show and episode boundaries. Your project should assemble multiple cues associated with these boundaries, from recurring phrases, theme music, and opening visual sequences, and then develop robust statistical methods to locate the most probable spot where one show ends and another begins. We are open to your suggestions for how to solve this challenge.
A sub-task is to generate useful metadata for indexing the segmented shows. This includes the show name and the broadcast time, and also data such as the name of the anchor and guests. The proposal may include some manual annotation of boundaries, but the project can also use our born-digital collection for example boundary conditions. Each show typically has a recurring intro that can be used to locate the boundary and also identify the show.
When you think about how to develop this project, consider how you can build a system where a relatively naive user would be able to submit a collection of annotated boundary conditions, either through a step-by-step feature extraction method or through convolutional neural networks, to teach or train your system to handle a new dataset. The completed project should achieve robust results in a fully automated pipeline.
7. Optimizing deep learning pipelines
As teams of developers build deep learning pipelines, a common result is a tangle of modules that repeat certain basic tasks. For instance, the pipeline may contain multiple Caffe models or other neural networks, each performing a specific task from scratch. This is inefficient and wastes computational resources.
To help teams improve the efficiency of their deep learning pipelines, Red Hen welcomes proposals for optimizing strategies. We envisage the development of easily understandable procedures for establishing an information architecture where a single neural network evaluation is run for each unit of analysis, such as an image, creating a single feature vector. This vector should contain all of the different features that are needed by the various target applications in the pipeline, such as face recognition, object recognition, location recognition, and so on.
We look forward to engaging with you on how to handle pipelines that include video analysis, or a mix of audio and text processing, or distributed architectures where feature vectors could be shared between locations.
For this task, you should already be intimately familiar with the operation of convolutional neural networks and Caffe models for computer vision tasks. We will give you access to one of our high-performance computing pipelines.
8. Development of a Query Interface for parsed data
The task is to create a new and improved version of a graphical user interface for graph-based search on dependency-annotated data. The new version should have all functionality provided by the prototype plus a set of new features. The back end is already in place..
- add nodes to the query graph
- offer choice of dependency relation, PoS/word class based on the configuration in the database (the database is already there)
- allow for use of a hierarchy of dependencies (if supported by the grammatical model)
- allow for word/lemma search
- allow one node to be a "collo-item" (i.e. collocate or collexeme in a collostructional analysis)
- color nodes based on a finite list of colors
- paginate results
- export xls of collo-items
- create a JSON object that represents the query to pass it on to the back-end
- allow for removal of nodes
- allow for query graphs that are not trees
- allow for specification of the order of the elements
- pagination of search results should be possible even if several browser windows or tabs are open.
- configurable export to csv for use with R
- compatibility with all major Web Browsers (IE, Firefox, Chrome, Safari) [currently, IE is not supported]
- parse of example sentence can be used as the basis of a query ("query by example")
1. Go to http://www.treebank.info and play around with the interface (user: gsoc2015, password: redhen) [taz is a German corpus, the other two are English]
3. Think about html representation. We would like to have it HTML5/CSS3, but for the moment we are not sure whether we can meet the requirements without major work on <canvas> or whether we can have sensible widgets without having to dig into the <canvas> tag.
4. Contact Peter Uhrig to discuss details or ask for clarification on any point.
Medium to hard.
9. Modeling interpretive frames
Red Hen specializes in communicative learning, analyzed in terms of communicative effects and communicative
intent. Much of datamining is instead focused on establishing
basic categorical facts, in a positivist tradition. Consider for
instance the ICLR 2016 presentation Order-Embeddings
of Images and Language
: it treats the relation of images and
words in terms of basic category membership. Red Hen asks
questions of a different order: what is the communicative intent,
and what are the communicative effects, of the image of Donald
Trump, or of the woman walking her dog in the park? After all,
these artifacts are not "found objects," they are deliberately
crafted objects with a communicative intent and a communicative
effect. Positivist datamining treats the source and use of the
image as irrelevant, and completely ignores communicative intents
The relation between categorical facts and communicative intents
and effects is interestingly complex. On the one hand,
communicative intents and effects utilize and build on categorical
facts, in the basic act of turning a series of events into a news
story. On the other hand, communicative intent can disregard the
facts. Consider people of opposed political viewpoints regarding a series of events, or even the same
person in different discourse situations regarding that series of facts. These opposed people, or even the same person in different situations, can call up interpretive frames that highlight or emphasize, discount or ignore what look like facts from different interpretive frames. We see this in the current moment in discussions of presidential politics and policies. Different camps often seem certain of what they are asserting about facts, especially about the interpretation of facts, or the dismissal of facts as irrelevant, or the highlighting of facts as crucial, and there is consistency across these assertions for people using the same interpretive frames. Consider, for example, the photograph of Democrat Charles Shumer with Vladimir Putin. To one side, the fact shows that the investigation of Jeff Session's meeting with the Russian ambassador is merely a politically-motivated witch-hunt, while the other side is appalled that anyone would dream of imagining that these two things are comparable or even that the picture of Shumer is a "fact" of the sort that the other side imagines. The news often presents the world in a way that benefits some people more than others, and they typically do this by providing supporting evidence. What counts as evidence? That depends upon the interpretive frames used.
Some have argued that the media over time have built up a representation of the
world -- what we call "interpretive frames" — that is increasingly out of touch with the facts. An
interesting large-scale task would be to replicate this process
computationally — that is to say, to create a model of what the
world looks like according to what is shown in the media. Early
studies along these lines include the work of George Gerbner,
whose Cultivation Theory work asserts that the more television you
watch, the more dangerous you think your neighborhood or your city
is. So we could do large-scale studies of topics like crime,
immigration, standards of beauty, climate change, consumer
culture, and so on to map the modern mind, as it is generated by
the media. If we could do this along a few dimensions, it would be
sufficient to demonstrate the point. It would be best to investigate contrasting sets of interpretive frames in the media. Interpretive frames are used by everyone.
We can think of this as programming to generate interpretive
frames — conceptual structures that predispose us to look at the
world in a certain way and expect certain outcomes. Frames are Bayesian constructs, consisting of aggregated information that generates a
pattern of expectation. This pattern is probabilistic, but constrained within a range by expectations; see for
instance the concept of the Overton Window
. Bayesian reasoning is
key to understanding how communicative intent is realized and
generates the expected communicative effects. It's important to
realize that communicative effects are inferential, and based on
expectations. Recall, for instance, the story of "Silver
Blaze", where Sherlock Holmes infers who the thief is based on the
absence of something predicted: the dog that didn't bark. This is communicative learning. A
categorical datamining approach would see nothing; a communicative
effects approach would model how inferences are made from
violations of expectations.
We welcome proposals that make progress on detecting, characterizing, and modeling interpretive frames. Such proposals may make foundational contributions, or clearly defined incremental contributions to existing techniques. You need to convey a clear and concrete understanding of interpretive frames and communicative learning in your proposal; if these concepts are not clear to you, do not attempt this task.