Tool Instructions

CAMel Tools

CAMel Tools is a suite of tools developed at NYU Abu Dhabi for Arabic. The main page is here. Below you will find a suggested sequence of steps to install and use.

Reference: Obeid, Ossama, et al. "CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing." Proceedings of The 12th Language Resources and Evaluation Conference. 2020.

Here is what I did to install on MacOS. The first steps are to install pip if you do not yet have it. I did this in my home directory. (My machine name is Brandenburg-Owen and my login is rambow -- your promt will look similar unless you have changed it.)

rambow@Brandenburg-Owen ~ % curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 1882k 100 1882k 0 0 7212k 0 --:--:-- --:--:-- --:--:-- 7212

If you do not seem to have curl, follow the instructions here.

pip allows you to install Python modules easily. So now that you have pip, you can do this:

rambow@Brandenburg-Owen ~ % Library/Python/3.8/bin/pip install camel-tools

Now you need to download the databases:

rambow@Brandenburg-Owen ~ % camel_data light

If you want all tools (includes sentiment analysis etc -- not needed if you just want to do morphological analysis), say this instead:

rambow@Brandenburg-Owen ~ % camel_data full

And now you are ready to go. Here is the basic command for morphological analysis:

rambow@Brandenburg-Owen ~ % camel_morphology analyze

This creates an interactive environment. You type in the word or words you want to have analyzed and then you get the resuls (tons). The input must be Arabic alphabet.

أُعطي منزل والدي لابن أخي.

#WORD: أُعطي

diac:أَعْطَى lex:أَعْطَى_1 caphi:2_a_3_t._aa gloss:give;provide+he;it_<verb> bw:أَعْطَى/PV+(null)/PVSUFF_SUBJ:3MS pos:verb catib6:VRB ud:VERB root:ع.ط.# pattern:أَ1ْ2َى prc3:0 prc2:0 prc1:0 prc0:0 per:3 asp:p vox:a mod:i form_gen:m gen:m form_num:s num:s stt:na cas:na enc0:0 rat:n source:spvar stem:أَعْطَى stemcat:PV_0 stemgloss:give;provide d1seg:أَعْطَى d2seg:أَعْطَى d3seg:أَعْطَى atbseg:أَعْطَى d1tok:أَعْطَى d2tok:أَعْطَى d3tok:أَعْطَى atbtok:أَعْطَى bwtok:أَعْطَى pos_logprob:-1.023208 lex_logprob:-3.615221 pos_lex_logprob:-3.615221

(...)

Another, and perhaps easier way to use the analyzer, is not interactively. The Unix ocmmand echo simply outputs what is in the string argument, so it is useful to create inputs to commands.

rambow@Brandenburg-Owen ~ % echo "ابن اختي" | camel_morphology analyze | more

#WORD: ابن

diac:أُبْنَ lex:آب-u_1 caphi:2_u_b_n_a gloss:return+they_[fem.pl.]_<verb> bw:أُب/PV+نَ/PVSUFF_SUBJ:3FP pos:verb catib6:VRB ud:VERB root:#.#.ب pattern:أُ3ْنَ prc3:0 prc2:0 prc1:0 prc0:0 per:3 asp:p vox:a mod:i form_gen:f gen:f form_num:p num:p stt:na cas:na enc0:0 rat:n source:spvar stem:أُب stemcat:PV_C stemgloss:return d1seg:أُبْنَ d2seg:أُبْنَ d3seg:أُبْنَ atbseg:أُبْنَ d1tok:أُبْنَ d2tok:أُبْنَ d3tok:أُبْنَ atbtok:أُبْنَ bwtok:أُب_+نَ pos_logprob:-1.023208 lex_logprob:-5.22446 pos_lex_logprob:-5.22446

(...)

We can use another CAMel Tools command to skip the Arabic alphabet and do everything in Buckwalter transliteration. This will make it a lot easier to work with the analyzer:

rambow@Brandenburg-Owen ~ % echo "kutub" | camel_transliterate -s bw2ar | camel_morphology analyze | camel_transliterate -s ar2bw | more

#WORD: kutub

diac:kutub lex:kitAb_1 caphi:k_u_t_u_b gloss:books bw:kutub/NOUN pos:noun catib6:NOM ud:NOUN root:k.t.b pattern:1u2u3 prc3:0 prc2:0 prc1:0 prc0:0 per:na asp:na vox:na mod:na form_gen:m gen:m form_num:s num:p stt:i cas:u enc0:0 rat:i source:lex stem:kutub stemcat:N stemgloss:books d1seg:kutub d2seg:kutub d3seg:kutub atbseg:kutub d1tok:kutub d2tok:kutub d3tok:kutub atbtok:kutub bwtok:kutub pos_logprob:-0.4344233 lex_logprob:-3.511249 pos_lex_logprob:-3.511249

(...)

The features used in the output are documented here.

Tregex

Tregex is a version of egrep for trees in various treebank formats. It is from the Stanford NLP group, the start page is here. (A variant, Tsurgeon, allows you to change trees.) I downloaded the version under the first link on that page, it is the general Java installation. There is also a version for Mac, but it is older and no need to get that. Once you have put the download somewhere (I put it in /Users/rambow/bin/java/, for example), you just double-click on the file stanford-tregex-4.2.0.jar and you get a GUI.

I needed to download Java for my Mac. I did that here.

You can then choose files from the Files menu. The Arabic Treebank part 4 (as a tar file) is here. The guidelines are here. Download the whole directory. But for the Tregex tool, you will want to load the .tree files which are in LDC2010T13_atb1_v4_1/data/penntree/without-vowel (you can also load the version with diacritcis instead). I.e., do not load the integrated version.

Now you can do searches in the GUI. The commands are explained in README-tregex.txt which is in the same directory as the jar file. Here are some examples:

Find NPs: NP
Find idaafa constructions: @NP < /NOUN\+CASE/ < @NP A < B means that A immediately dominates B. The @ sign allows for variants of NP, like NP-SUBJ. The /NOUN\+CASE/ is a regular expression, since we don't actually care which case the noun is in, but we want to exclude proper nouns. Becasue we are in a regular expression, we need to escape the plus sign, which has a specific meaning (one or more copies) in regular expressions.

Page updated

Report abuse