HCIR Challenge

The 3rd HCIR Challenge focused on the problem of people and expertise finding. We are grateful to Mendeley for providing this year's corpus: a database of over a million researcher profiles with associated metadata including published papers, academic status, disciplines, awards, and more taken from Mendeley's network of 1.6M+ researchers and 180M+ academic documents. Participants built systems to enable efficient discovery of experts or expertise for applications such as collaborative research, team building, and competitive analysis. Note: Participants agreed to only use Mendeley's data for the purpose of the Challenge, and that you will destroy the data after the Challenge is complete. If you have any questions about working with Mendeley outside the context of the HCIR Challenge, please contact William Gunn <william.gunn@mendeley.com>.

Corpus Overview

The corpus consisted of seven files: profiles, publications, contacts, public_group_members, academic_status, disciplines, and public_groups.

  • profiles consists of about 1M rows, each containing six tab-separated columns:
    • id (integer) - used as join key in other tables
    • firstname (text)
    • lastname (text)
    • research_interests (text) - comma-separated list
    • main_discipline_id (integer) - corresponds to key in the disciplines file
    • biographical_info (text)
  • publications maps about 145k publication ids to JSON blobs, where:
    • id - SHA1 file hash of document at Mendeley
    • authors - list of authors as pairs of forename and surname.
    • Note that authors do not necessarily have profiles, and that matching -- where possible -- is an entity resolution problem. The first number in the publications file is the profile ID of (one of) the author(s), so it should be straightforward to associate a record in the profiles file to the publications of that author. Every publication in the publications list will have an author with a profile on Mendeley, but not every author in the publication list has a profile.
    • title (text)
    • year (integer)
    • published_in (text)
    • stats includes:
      • readers (integer) - total number of readers
      • academic_status - distribution of readers by academic status code (join via academic_status file)
      • discipline - distribution of readers by discipline code (join via disciplines file)
      • country - distribution of readers by country
  • contacts represents the social network as a collection of about 250k profile id pairs (join via profiles file). Order should be ignored, as the relationships are symmetric.
  • public_group_members represents group memberships as a collection of about 140k profile_id - group_id pairs (join via profiles and public_groups files)
  • academic_status maps integer keys (1 to 15) to academic status values (e.g., Post Doc).
  • disciplines maps integer keys (1 to 25) to academic discipline values (e.g., Physics).
  • public_groups maps group keys (about 35k groups) to group names (e.g., Upstate New York Archaeology)

You can get the list of categories and subcategories and IDs from the API:

Tasks

The Challenge tasks reflected the challenge of finding people that have particular subject-matter expertise. The task might also have constraints: for example, the desired expert might be expected to have a particular academic status or to not have a specified conflict of interest. We expected the incomplete information in the corpus to be both a challenge and an opportunity -- as with many exploratory search tasks, recall is at least as important as precision. And, as with all HCIR tasks, we expected participants' systems to robustly support human interaction.

Here are the examples tasks we published before participants froze their systems:

  1. Hiring
  2. Given a job description, produce a set of suitable candidates for the position. An example of a job description: http://www.linkedin.com/jobs?viewJob=&jobId=3004979.
  3. Assembling a Conference Program
  4. Given a conference's past history, produce a set of suitable candidates for keynotes, program committee members, etc. for the conference. An example conference could be HCIR 2013, where past conferences are described at http://hcir.info/.
  5. Finding People to deliver Patent Research or Expert Testimony
  6. Given a patent, produce a set of suitable candidates who could deliver relevant research or expert testimony for use in a trial. These people can be further segmented, e.g., students and other practitioners might be good at the research, while more senior experts might be more credible in high-stakes litigation. An example task would be to find people for http://www.articleonepartners.com/study/index/1658-system-and-method-for-providing-consumer-rewards.

For all of the tasks there was a dual goal of obtaining a set of candidates (ideally organized or ranked) and producing a repeatable and extensible search strategy.

After participants froze their systems, we updated the example tasks as follows:

  1. Hiring: http://www.linkedin.com/jobs?viewJob=&jobId=3674928.
  2. Assembling a Conference Program: http://strataconf.com/strata2013/.
  3. Finding People to deliver Patent Research or Expert Testimony: http://www.articleonepartners.com/study/index/1687-pay-per-use-distributed-computing-resources.

Note: participants were allowed to make some use of external sources is ok, as long as those weren't be a major part of what made their system successful -- especially if those external sources were proprietary. The goal was to exhibit a system, not to show off access to proprietary data. Participants were certainly allowed to rely on additional data from Mendeley (e.g. abstracts or group-ids as readers of articles), and could select experts from both the profiles file and the publication authors.

Participants

There were 5 Challenge entries:

The winner, selected by conference attendees votes, was "Exposing and exploring academic expertise with Virtu"(Luanne Freund, Kristof Kessler, Michael Huggett, Edie Rasmussen).