informatino security, amazon

Data Science

Submission Deadline: May 14, 2012
Extended Submission Deadline: May 21, 2012


Full details including data set description can be found in this PDF.

[UPDATE] See the Data Section for a validation data set.


By the deadline of submission, you are required to submit 3 artifacts:

  1. An executable, which can take an input file with the same format as in the validation set and output a file with the classification results. The output file should not contain any header. Each line represents a classified label according to the order of the input data. “1” should mean “access” and “0” for “no access”.
  2. A short manual. The manual should have 3 parts: 1) running environment, e.g. windows/linux, required packages such as matlab, R, etc. 2) tutorial how to run the program to get output; 3) metrics on the validation set, the metrics include classification error (required), confusion matrix (required), ROC (optional).
  3. A paper which has detailed description of your approach. The paper should conform to the requirement of the conference ( and follow the same style as the rest of the workshop papers ( The paper should consist of 2-4 pages but no more than 6 pages.

After the submission, your executable will be run against a hold-out dataset which is similar to the validation set. The classification error on the hold-out dataset is used to score each submission.

The submission with the minimum classification error will win the contest.


Validation Data Set:
Your validation process will provide a ‘YES’/’NO’ [action] for a [resource] given a [mgr_id, role_rollup_1, role_rollup_2, role_deptname, role_title, role_family_desc, role_family, role_code] tuple. This data set has the [action] column filled in so you can validate your model.

Data Set #1:
contains the set of attributes associated with each user; each column corresponds to a single attribute (the rows have a time dimension).
Data Set #2:
          contains the access transaction history (the rows have a time dimension).
Data Set #3:
contains the user access snapshot at the beginning and the end of the transaction history (the rows have a time dimension; either 2011-11-01 or 2010-11-01).


The objective of this competition is to build a model, learned using historical data, that will determine an employee's access needs such that manual access transactions (grants and revokes) are minimized as the employee's attributes change over time. This is a clustering/collaborative filtering exercise. The model will take an employee attribute record and a resource code and will return true if the employee should be given access this resource and false if the employee should not be given access to this resource.

The problem can be formulated as follows:

At time T, create a snapshot of STATUS(EMPLOYEE_ID, RESOURCE_ID), which is either 1 (access) or 0 (non-access). Build a system F, which models STATUS ~ {EMPLOYEE ATTRIBUTES, RESOURCE ATTRIBUTES}.

Therefore at time T, for each employee, we have an access profile PROFILE(EMPLOYEE_ID, T).

The measure of success is to minimize the cost of add/remove actions for all employees for a given time perdiod.

  • add action: a manual add_access during the test period results in a penalty if EMPLOYEE_ID-RESOURCE_ID or RESOURCE_ID is not in PROFILE(EMPLOYEE_ID)
  • remove action: a manual remove_access action results in a penalty if EMPLOYEE_ID-RESOURCE_ID or RESOURCE_ID is in PROFILE(EMPLOYEE_ID)

  • How do I sign up?
    • No formal sign up is required. Just download the data and make a submission according to the
      guidelines provided in the documentation.

  • Why does the data show that some employees report to a different manager at the same time?
    • You can ignore the employees that show they report to a different manager at the same time. There
      are some employees that serve multiple purposes for the company and are the source for these data
      points. This is real industry data; it doesn't always behave the way you think it should.