Research Projects

PROBLEM KIT

For every research problem/topic we provide a set of components. We hope the problems can become "shared tasks".

  1. The problem/topic/task description (to answer question WHAT). What is the problem, including what is the input (and its definition and examples), what is the output (and its definition and examples) , in informal and formal description, and its limitations.

  2. Background, the importance of the problem/topic (to answer question WHY). This can be basis for formulating academic contribution, and contribution/benefit for society.

  3. Performance criterias (evaluation/experiment metric). If more than one metrics suitable, which one is the most important metric. This is part of answer of question WHAT.

  4. Baseline solutions (for comparison). A) Existing solutions (primarily state of the art). B) Upper bound (the "ideal" solutions), for text usually solutions by human (domain expert). This can be seen as disagreement level for "gold standard". C) Lower bound: trivial solutions (e.g. random).

  5. Dataset (especially test dataset). And description of the development of dataset. The dataset if publicly accessible.

Optional

  1. Objectives (tujuan-tujuan, target yang hendak dicapai/dihasilkan): that is what are the outcomes/deliveries/hasil. It can be algorithm/method (in academic article) or/and dataset (at least test dataset: usually improve available test datasets) or/and tool (at least code). With quantitative performance measurement for tool and dataset.

  2. Recommendation for baselines (pembanding untuk digunakan pada eksperimen).

  3. Related study, references and state of the art of solutions of the problem.

Solution: proposes solution (method) in a technical report .

Dengan adanya informasi tentang topik2 ini yg cukup lengkap dan jelas (dalam bentuk poin/komponen2), diharapkan problem2 tsb bisa bersama-sama terus secara berkelanjutan mencari solusinya yang lebih baik. Sehingga permasalahan terus diperbaiki menjadi lebih riil, dan hasilnya terus diperbaiki menjadi lebih baik.

Problem set ini diharapkan salah satunya bisa digunakan sebagai topik tugas akhir sarjana (atau mungkin bisa bagian dari tesis pasca sarjana). Problem set dan solusinya diharapkan terus diperbaiki secara berkelanjutan.

Kemudian hendak ditampilkam juga solusi2 (pustaka) atas setiap problem yang ada secara terutama yang dilakukanoleh tim HLT-lab.

.

C U R R E N T P R O J E C T S

Annotations for Al Qur'an and its Translations

Manually annotate the text of Qur'an (Arabic language, Arabic script) and its translations (especially Indonesian and English). This annotation includes entities (currently focusing at personal and community entities) and pronoun. Accommodate several versions (tafseer/interpretation)

File format (so it can be used by computer code), tools to manual annotations, documentations.

Current priority (2019-2021), development of manually annotator tools for entities (more detail..) and for pronouns (more detail..).

Building Encyclopedia, Thesaurus of Qur'anic Terms

Content of each entry

  • Thesaurus (synonyms) and related terms. More detail..

  • Dictionary (definition), list of verses which contain the term.

  • Main contents are taken from tafseers (commentaries) of Al Qur'an verses that contain or explain the term. More detail..

Related studies include multi-document summarization and semantic text similarity.

Semantic Text Similarity (STS)

For word/term, (short) setences, longer text..

For domain: general, islamic, ..

Current projects (2019-2021):

  • Kesamaan redaksi (text similarilty), especially for Al Qur'an More detail..

  • Alignment

Related: plagiarism detection, STS, ..


Building Personal Name Index from Islamic Corpus

Nama-nama yang ada di Al Qur;an, hadits, sejarah Nabi Muhammad SAW, sejarah Islam, dll.

Studi terkait, al. name disambiguation, entity linking.

Current priority (2019-2021) for hadith collection. More detail..

Automatic Building of Indonesian WordNet

Bbrp yang perlu dikerjakan:

  • Data disimpan dalam repository dalam format standar WordNet. Data tsb bisa diakses web dan API.

  • Pada proses pembangunan synset dengan pendekatan ekstraksi dari tesaurus dan kamus, data uji (gold standar) berupa sampel, himpunan kata tesaurus Bhs Indonesia atau KBBI.

  • Web based application untuk pencarian oleh user.

  • Aplikasi unt pembangunan WordNet bisa dijalankan untuk data seluruh teasurus dan kamus yang sudah disiapkan formatnya (bukan hanya untuk data uji). Oleh karena itu perlu proses penyiapan data sesuai format untuk input aplikasi pembangunan synset.

  • Pembuatan manual editor.

Prioritas utama saat ini adalah pembangunan himpunan sinonim. More detail..

Patterns of Nahwu (Syntax) and Sharaf (Morphology) for Al Qur'an

Objectives of this project: to find paterns of Arabic grammars to support tafseer (interpretation/commentary) of Al Qur'an, and to help non-native Arabic speakers (espescially for Indonesian) understand Arabic of Al Qur'an better.

Current priority (2019-2021): morphology for fi'il (verb) (more detail..) and tokenization. More detail..

Phonetic String Matching for Al Qur'an

Mayoritas muslim, termasuk di Indonesia tidak tebiasa menuliskan alfabet Arab, sehingga perlu ada sistem pencarian dengan query aksara latin. More detail..