Geographic Context Analysis of Volunteered Information

Implementation Details

Starting with the retrieval, the large number of potential UGC sources and their heterogeneous nature forced us to decide on the most promising and accessible ones. Among the candidates, there were the micro-bloggin service Twitter, the photo-sharing sites Flickr, Panoramio, Picasa, ImageShack and PhotoBucket, the video-sharing site YouTube, and the social network Facebook. We did not consider any sites that have a national focus, since our regional focus is on South-West Europe (i.e. France, Italy, Spain, Portugal).

For the prototype, we decided on Twitter and Flickr, because they have a well-documented API that allows detailed queries without any rate-limits, they have a large potential content base on forest fires requiring automatic processing, and represent textual and visual UGC. The other sites did not have a sufficiently developed or maintained API (e.g. ImageShack), would provide a redundant type of UGC (images), have a comparatively small content base (e.g. YouTube), or have more privacy restriction on shared content (e.g. Facebook). Security restrictions in the IT infrastructure at our disposal and the wish to have full control over the retrieval process led us to implement the retrieval using local Java scripts that are scheduled to run in regular intervals and query the Search API of Flickr and the Streaming API of Twitter. Concerning the storage of the retrieved UGC, the large volume to be expected from tests (5G/day) excluded the use of local DBMS (we experimented with PostgreSQL, MySQL or MongoDB), while the lack of spatial analysis capabilities excluded the use of cloud storage services (we tested Google Storage/Big Query). Since our local IT infrastructure had a supported Oracle DBMS available, we implemented the storage using this, with the retrieval scripts formatting and validating the UGC before putting it into Oracle.

As argued in the previous section, the enrichment of the retrieved UGC with additional information serves to assess both the credibility but also the relevance of it and has several sub-tasks. The ethical implications of actively retrieving more information on the source (user) and analyzing it let us forego this sub-phase for the moment.

The difficulties of existing document classification and natural language processing systems with short, unstructured text as found in Tweets and Flickr images descriptions, coupled with the relatively well-focused application case (forest fires) led us decide on another approach: Detecting keywords in a piece of UGC, and comparing this keyword occurrence with the knowledge gained from manually annotating a sample of high-potential Tweets for topicality, and analyzing the keyword occurrences of those about forest fires, and those that were no. Another advantage of this approach is its simplicity, speed and reliability: Any string-matching to find keywords is quick because of the short message length and the limited set of keywords to look for. Misspellings are not an immediate concern, because of the retrieval method based on keywords: Any misspelled keywords likely were not retrieved. Any changes to the topicality scoring are also easily and quickly adopted if the rely on the original set of keywords. The search for keywords and subsequent assignment of a topicality score is accomplished by a scheduled PLSQL job to take advantage of speed.

For determining any geographic context, finding the location is a prerequisite. For various reasons, the number of geo-referenced UGC is still low. Even with geo-referenced UGC, the coordinates known an unreliable source, since the geo-referencing depends on the hardware specifications of the device known, on the software used to report the UGC, on the option settings of the user, and on any geo-coding done by the social media platform. Further, they do not have to represent the location that the UGC is about. For these reasons, extracting place names in the content and geo-coding is indispensable. For the geo-reference process we tried several applications. One major challenge we encountered was the multilingualism of our data that comes from not English-speaking countries. Geo-referencing and Named Entity Recognition software were mainly developed for English text. One application that can deal with several languages is Yahoo!Placemaker. A major limitation besides the limited number of queries per hour is a required language identification of the input text (experiments using the wrong language led to wrong results). This could be solved in two different ways, both unsatisfactory from our point of view. The first possible approach is to use the Tweets metadata about the language, but our manual annotation of data found that many users living in not English speaking Countries have never changed the default setting that is English (EN). The second approach is to add a computational layer to automatically detect the language (a possible parser was Google Translate application) but due to the not structured nature of the Tweets (grammar is often ignored by the user due to the limited number of possible words) this approach produced not acceptable results and was discarded. Further, with the exception of Yahoo!, they are not capable to extract the place names from text, but need as input already well-formatted place names, and usually cannot deal with multiple place names per UGC. For these reasons, we implemented a simple search for place names based on string matching the words of UGC content with the Geographical Information System at the Commission (GISCO) database of place names for our area of interest (Spain, Portugal, France, Italy) at the most detailed resolution, which is LAU2 (formerly NUTS 5), again implemented as a scheduled PLSQL job.

Finally, and central to our approach, is giving the UGC a geographic context by looking up characteristics of the locations identified. In principle, these could be any characteristics found in SDIs or other spatial data bases. In the case of forest fires, of primary interest are distances to known hot spots or forest fires, the population density and predominant vegetation type. Since our geo-coding is based on the GISCO dataset, we used the latter as base to aggregate raster data sets on population density (DGUR-Degree of Urbanisation) and land cover (CORINE data 2006) through zonally spatial analysis. While this aggregation poses some problems, at the level of LAU2 hierarchy used, we noticed few inaccuracies (REF). The distance to hot spots was implemented using the latest data from the European Forest Fire Information System (EFFIS), downloaded at regular intervals, uploaded to Oracle and used in a spatial query.

The next phase of clustering the UGGC/VGI emulates the search for social confirmation (or rebuttal), and is crucial for detecting events. We had to rely on an external software since Oracle does not provide sufficient support for spatio-temporal clustering. After testing various software (CrimeStatIII, packages of R, ArcGIS, QGIS), we settled on using SatScan, since it has been well-published, is in use, and offers the widest variety of possible scan methods, including Space-Time Scan Statistics, Bernoulli and Discrete Poisson Models. In regular intervals, data is exported from Oracle and fed into SatScan using a command line call with various parameters. The outputs are parsed and uploaded into Oracle via a Python script.

The detection of events in a near-real time stream of information is a challenging task that needs further investigation. For the moment, we consider the detection of a cluster a likely event which can be investigated further by a human domain expert.

For the dissemination of results, various avenues are open, including broadcasting via social media, SMS, and web maps. For the latter, we submit the highly likely candidates and clusters to a web application developed by the EFFIS, which is currently not available outside the intranet

Page updated

Google Sites

Report abuse