Base Model Fine tuning
Bert model follows the origin fine tuning way
Train
lang=java
lr=5e-5
batch_size=32
beam_size=10
source_length=256
target_length=64
data_dir=../dataset
output_dir=model/$lang
train_file=$data_dir/$lang/train.jsonl
dev_file=$data_dir/$lang/valid.jsonl
epochs=40
pretrained_model=microsoft/graphcodebert-base ## microsoft/codebert-base ## roberta-base
python run.py --do_train --do_eval --model_type roberta --model_name_or_path $pretrained_model --train_filename $train_file --dev_filename $dev_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --train_batch_size $batch_size --eval_batch_size $batch_size --learning_rate $lr --num_train_epochs $epochs
Test
beam_size=10
batch_size=128
source_length=256
target_length=64
output_dir=model/$lang
data_dir=../dataset
dev_file=$data_dir/$lang/valid.jsonl
test_file=$data_dir/$lang/test.jsonl
test_model=$output_dir/checkpoint-best-bleu/pytorch_model.bin
python run.py --do_test --model_type roberta --model_name_or_path microsoft/graphcodebert-base --load_model_path $test_model --dev_filename $dev_file --test_filename $test_file --output_dir $output_dir --max_source_length $source_length --max_target_length $target_length --beam_size $beam_size --eval_batch_size $batch_size &
T5 model also follows the origin fine tuning way
python run_exp.py --model_tag codet5_base --task summarize --sub_task java
follow the preprocess in the previous related work
Split comment and code by python package javalang
Remove examples that codes cannot be parsed into an abstract syntax tree.
Remove examples that #tokens of code or documents is < 2 or > 512
Remove examples that documents contain special tokens
Remove the #paras in the documentations