Mini Challenges 2019

Le but de cet enseignement est d'apprendre la science des données en s'amusant. La classe des M2 construit des mini-challenges et la classe des L2 les résout.
 
 Challenges (M2 students)  Abstract  Task  L2 groups
 Areal [raw data][preprocessed]

Recognize landscapes from satellite images.
This challenge views things from high! From areal images you'll figure out whether it is  a beach, chaparral, cloud, desert, forest, island, lake, meadow, mountain, river, sea_ice, snowberg, or wetland. Classifying terrains is important to control urban development, favor economic growth, and protect environment. [DIAPOS]
The problem is a multi-class classification problem. You must predict the categories of 13 classes.
There are two possible challenges: one from raw data and one from preprocessed data. In its raw data version, the challenge is to classify images characterized by 128*128 pixel maps.  In its preprocessed data version, the challenge is to classify vectors of 4096 high-level abstract features extracted with a pre-trained CNN.
 ORBITER, SATELLITE, SPUTNIK
 HADACA


 Health data challenge. The dataset is a set of patients which have been diagnosed at different stages of cancer. Your task is to improve the classification results regarding the stages of those patients. [DIAPOS]


 This is a multi-class classification problem. You must classify cancer stages among a specific population in one of 10 categories. The data is a matrix of (number of patients) lines * (number of features per patient) columns. Features correspond to methylation information related to the medical condition of each patient.   CANCER, CURE, HEALTH, DOCTOR, FACTOR
 L2RPN
https://codalab.lri.fr/competitions/398
 Learning to run a power network. The goal of this challenge is to control electricity transportation in power grids, while keeping people and equipment safe. This is the "gamification" of a serious problem: operating the grid is becoming increasingly complex because of the advent of less predictable renewable energies, the globalization of energy markets, growth in consumption and concurrent limitations on new line construction. [DIAPOS]  This is a reinforcement learning (RL) problem. You will have access to a simulator of a small scale grid. The designed RL agents should learn a policy keeping the power grid in security. The possible actions include switching a line status (in service or out-of-service) or changing the line interconnections.  ELECTRICITY, GRID
Persodata [raw
[preprocessed] 
https://codalab.lri.fr/competitions/401

Detect Fake paintings.
The goal of the challenge is to detect the fake paintings. We present you with real paintings and paintings generated by a computer program. Can you tell them appart? [DIAPOS]
 The problem is a binary classification problem. Each sample (image) is characterized by 200 features. You must predict whether the images are fake or real.

 PICASSO, KAHLO, VINCI, MONET
 Survival
https://codalab.lri.fr/competitions/383
 Influence of nutrition on live expectancy. Evaluate how nutrition affects longevity using data from NHANES (US National Health and Nutrition Examination Survey). [DIAPOS]  The problem is a regression problem (prediction of time of death) with censored data (some people leave the study or are still alive and the end of the study). The metric of evaluation is the concordance index.  GHOSTS, SURVIVERS





Acknowledgements: These challenges are hosted by CodaLab. We received a grant of the FCS Paris-Saclay and sponsorship of Microsoft Azure for Research and Google Research.

Auto-sklearn performances

Challenge

Score (validation set)

AREAL

 

HADACA

 

L2RPN

NA

PERSODATA

 

SURVIVAL

 


Sample competition (Python 3 version)  Abstract  Task
 Iris
iris
 This is the well known Iris dataset from Fisher's classic paper (Fisher, 1936). The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

The problem is a multi-class classification problem. Each sample (an Iris) is characterized by its sepal and petal width and length (4 features). You must predict the Iris categories: setosa, virginica, or versicolor. 


Mini Challenges 2018

Challenges (M2 students)  Abstract  Task  Videos (L2 student solutions) 
 VISION
https://codalab.lri.fr/competitions/108
Autonomous vehicles will become a common means of transportation very soon. However, obstacles remain to be overcome, in particular obstacle avoidance. This requires powerful computer vision algorithms. In this challenge you will contribute to solve the problem of recognizing animals and vehicles.  To illustrate this problematic, we propose to study the image source CIFAR-10 which groups entities that can interact with the vehicle environment like animals(cat, horse, dog, ...) and vehicles (bike, car, truck, ...). We preprocessed the images to you get to solve a multi-class classification problem from pre-computed features.

Your score is the balanced accuracy or BAC. It is the average of the error rates for the various classes. Make predictions the are vectors [0 0 ... 1 ... 0 0] with a 1 at the ith position if you want to predict you sample belongs to class i.
CAMERA
REGARD
IMAGE
 Over-prescription of opioid medicines presents a new public health problem because many people have become addicted. This challenge asks you to help predicting which doctors tend to over-prescribe such medicines.
 The data set contains a binary classification task. The target represents, for each medical prescription whether an opioid has been prescribed or not. The features represent, amongst others, the specialty of the doctor who made the prescription and the name of the non-opioid drugs present in this prescription.

Your score is the Gini or "normalized AUC": 2 AUC - 1. AUC stands for Area under ROC curve. Make numerical predictions for test samples that are larger for the positive class and smaller for the negative class (discriminant values). Random guesses give a score close to 0 while perfect predictions give a score of 1.
SANTE
MEDECINE
SECOURS
  FRIEND
https://codalab.lri.fr/competitions/112
 Predicting at which price a house will sell helps people selling their property at a fair price. This dataset contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015.   This is a regression problem. The dataset contains 19 house features plus the price and the id columns, along with 21613 observations.

Your score is the R-square = 1 - MSE / var(Y).It si 0 for the baseline method that predicts the average target value.  It is 1 for perfect guesses. It ca be negative if your predictions are worse than the average target value!
SOLIDARITE
AMITIE
FRATERNITE
EGALITE
  ECOLO
https://codalab.lri.fr/competitions/100
 Pollution, or the introduction of different forms of waste materials in our environment, has negative effects to the ecosystem we rely on. With modernization and development in our lives, pollution has reached its peak, giving rise to global warming and human illness.  This is a regression problem. The goal of this challenge is to predict the NOx levels in the air in Northern Taiwan, which is an indicator of pollution. The dataset is was initially provided by the Environmental Protection Administration, Executive Yuan, R.O.C.

Your score is the R-square = 1 - MSE / var(Y).It si 0 for the baseline method that predicts the average target value.  It is 1 for perfect guesses. It ca be negative if your predictions are worse than the average target value!
VERDURE
NATURE
  CREDIT
 This challenge deals with a fundamental task in the financial industry: credit scoring. In simple English, it means deciding whether to grant a credit to someone or not, depending on her/his historical financial record.  This is a binary classification problem. The data set contains 150000 instances separated on 2 classes, where each class refers to the seriousness of a client in two years.

Your score is the Gini or "normalized AUC": 2 AUC - 1. AUC stands for Area under ROC curve. Make numerical predictions for test samples that are larger for the positive class and smaller for the negative class (discriminant values). Random guesses give a score close to 0 while perfect predictions give a score of 1.
CROISSANCE .
HONETETE
AUDACE

Acknowledgements: These challenges were generated with ChaLab and are hosted by CodaLab. We received a grant of the FCS Paris-Saclay and sponsorship of Microsoft Azure for Research.
Auto-sklearn performances

Challenge

Score (validation set)

VISION

0.8153

BIOMED

0.7098

FRIEND

0.8507

ECOLO

0.8546

CREDIT

0.4499

Sample competition  Abstract  Task
 Iris
iris
 This is the well known Iris dataset from Fisher's classic paper (Fisher, 1936). The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

The problem is a multi-class classification problem. Each sample (an Iris) is characterized by its sepal and petal width and length (4 features). You must predict the Iris categories: setosa, virginica, or versicolor. 


Mini Challenges 2017

 Challenges (M2 students)  Abstract  Task  Solution (Video L2 students)
 Blue

Blue

 Activity of molecules against HIV

The problem is to relate molecular structure to activity to screen new compounds before actually testing them with High Throughput Screening (HTS) in vitro experiments. HTS is a method for massive scientific experimentation  used in drug discovery, linking the fields of biology and chemistry. This method  remains very costly process despite many recent technological advances in the field of biotechnology. This is why applying machine  learning methods would be of great benefit for the pharmaceutical industry to reduce the number of compounds that need to be tested. 
 The Objective of is to predict which compounds are active against the AIDS HIV infection. The dataset has two classes : active or inactive (Binary Classification). The variables represent properties of the molecule inferred from its structure.
Note: this project is running on the LRI server. In case of problem, a previous version on the main Codalab instance is available.
 Marine

 Cobalt

 Cyan
Cyan
 Lothlorien
This challenge aims at addressing the issue of resources access (website, drug purchase, violent movie, etc.) based on the age of a person. Indeed a lot of violent content is accessible on the internet and  45 % of children under 12 are not monitored by parental control. For this sake, we rely on the person's real-time image to estimate his age category. Facial aging effects are mainly correlated to bone movement and growth, skin wrinkles and reduction of muscle strength. Human observation lacking of accuracy, we want to find an automatic algorithm to make this distinction.
 A computer vision challenge is proposed for undergraduate students in which the challenger must predict the class of a person (major or minor) based on a picture of his/her face.
Note: the main Codalab instance of this challenge has been tested.
Note: this project is running on the LRI server. In case of problem, a previous version on the main Codalab instance is available.
 Cerulean

Turquoise
 Green
Green
 Ecocity
Help SimCity's mayor fight pollution and traffic jams by optimizing the city's bike rental system!
SimCity mayor has invested a lot of money to fight against pollution and reduce traffic jams. Her first action was the purchase of a bike rental system. To improve the system, she wishes to predict the number of bikes rented at each station at any moment of the day using weather data. 
 The challenge that is to use weather data (temperature, humidity, cloud cover) to predict the number of bikes rented at given station for a given day. To make the challenge more interesting, predictions are asked either in the morning or in the afternoon.
Note: this project is running on the LRI server. In case of problem, a previous version on the main Codalab instance is available.
 Grass-Pistachio
 Yellow
Yellow
 Movie recommendation

Currently, there are more and more music to listen, movies to watch and things to buy on the Internet.  Therefore, developing systems that help users find items they may like is crucial. Recommending items is different from "classical" machine learning, where you only have to predict a class given several features.  Recommendation implies using predictions to recommend suitable items (in this case movies) to the adequate people. In addition to that, this preferences can be sometimes evolve in time.  
 In this challenge, you will work on the famous Movielens dataset. The goal of this challenge is to predict for a user and a given film the score that is the most likely to be awarded by the user.
Note: There is also a LRI version. Warning: both versions were using different score. They should now both use a_metric = 1 - MAE/MAD.
 Gold

 Lemon

 Vanilla
 Orange
Orange
 Pick The Sneak Peek
In 2000, 60,234 titles between movies and TV shows were released, according to the IMDB source. In 2010, 165,830 titles and in 2016, 190,275 titles were filmed. We can only notice that the movie release industry is in perpetual increase and the databases aggregating the data are in need of more information to expand.
 This is a text processing challenge.
The idea is to facilitate the genre labeling of movies from their summaries and thus to help with categorization of the movies database.
Note: this project is running on the LRI server. In case of problem, a previous version on the main Codalab instance is available.
 Salmon

Tangerine


 Red

Red
 The Godfather returns!
After last year’s purge accomplished by Batman the Godfather has return and he's looking for new skills, the best criminals in SF, for crime organizations to prosper again and go back to gold age. To make sure about the recruits' abilities, records of their previous crimes in the San Francisco Bay Area are being investigated, background checks are being conducted on the candidates curriculum and a software is being developed to highlight criminals' potential.
 The goal is to design software to predict, for each criminal record, the category of crime. If the candidate's crime falls into the category that the Godfather needs, he will be recruited!
Note: No LRI implementation so far.
 Magenta

 Cherry

 Coral




Acknowledgements: These challenges were generated with ChaLab and are hosted by CodaLab. We received a grant of the FCS Paris-Saclay and sponsorship of Microsoft Azure for Research.
Auto-sklearn performances

Challenge

Score

Blue

0.4020

Cyan

0.5863

Green

0.5118

Yellow

0.6747

Orange

0.3509

Red

0.4850


Sample competition  Abstract  Task
 Iris
iris
 This is the well known Iris dataset from Fisher's classic paper (Fisher, 1936). The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.
You may download the Codalab bundle of this challenge, which serves as competition template (uploadable to Codalab). This example can also be created with the ChaLab wizard.
The problem is a multi-class classification problem. Each sample (an Iris) is characterized by its sepal and petal width and length (4 features). You must predict the Iris categories: setosa, virginica, or versicolor. 


Mini Challenges 2016
 Challenges (M2 students)  Abstract  Task  Solution (L2 students)

https://competitions-test.codalab.org/competitions/1104?secret_key=d2ee4b0c-c41a-4071-a9ed-fc93fc0c054e
Less time in hospital
Diabetes will be the seventh most common cause of death in 2030 according to the World Health Organization. In 2014, global prevalence of diabetes was estimated to be more than 9% among adults aged 18+ years. If most hospitals have the necessary medical equipment to treat this disease, some do not have these means. The task is a binary classification problem. Using the train set, it consists in predicting the length of stay for a patient given its diagnosis and its medications. This label consists in two categories : a stay inferior to 7 days or a stay greater or equal to 7 days.  Video Microbes 1
 Video Microbes 2
Restaurantsrestaurant We propose a challenge in restaurant recommendation to predict the rating for a particular user of any restaurant. We have very detailed information of the restaurants like geographical information, number of stars, reviews, etc and for each person a list of some restaurants he visited and his personal rate. The participants will work in two principal tasks:
Task 1: Select the most prevalent features in the three datasets:
Task 2: improving the prediction results using others methods and improving the training dataset with the data of Yelp.
 Video Fin Gourmets
 Eye robot

eye robot
Robots take more place in society everyday and soon they may be walking in the streets among us. There are a lot of problems that need to be solved before that and one of them is adaptation. An AI needs to adapt its vision of the world: when it sees an entity for the first time it should be able to tell if it is a domestic animal, a predator, a vehicle or maybe another robot? That is where transfer learning shows up: extracting general features from specific examples of a group allows to efficiently classify unknown entities. The idea of the challenge is to learn how to separate distinct classes of images. Precisely, we consider different superclasses, like "aquatic animals", each containing several classes, like "dolphin", and the goal is to tell this superclasses apart.  Video EyeRobot 1
 Video EyeRobot 2
 
batman Crimes in Gotham city
Batman fighting in the forefront to deliver the Gotham City from the evil crimes. And now he and his team want to create a system in order to increase their working efficiency. They have recent years’ crime data of Gotham City which is collected from GCPD and Batman’s database. The data including the location, the time and some other information of each crime. Some crimes have been solved, the others not. The main goal of this project is to help Batman develop this system. In other words, do the classification of crimes. You can treat it as a binary classification problem, to predict whether a crime can be solved or not. You can also first do the logistic regression to compute how likely a crime will be solved. Then Batman can define the priority for the crimes with this system.  Video Batman 1
 Video Batman 2
 Video Batman 3
 Textasie
ryan
In this project you will tackle the problem of Opinion Mining in movie reviews with a basic set of techniques used in text classification. Many sentiment-analysis methods for the classification of reviews use training and test-data based on star ratings provided by reviewers. However, when reading reviews it appears that the reviewer's ratings do not always give an accurate measure of the sentiment of the review. The objective of the challenge is to determine the polarity of an opinion from raw text. Since it's a challenge for starter you will only focus on classifying opinion to positive or negative. You can go further in detailing sentiments like happiness, sadness, satisfaction but this will not be our goal in this contest.   Video Textasie