Francesco Fontanella's personal website

The Avila data set

Download: link (the dataset is also available on UCI)

Data set description

The Avila data set has been extracted from 800 images of the “Avila Bible”, a giant Latin copy of the whole Bible produced during the XII century between Italy and Spain.

The palaeographic analysis of the manuscript has individuated the presence of 12 copyists. The pages written by each copyist are not equally numerous.

Each pattern contains 10 features and corresponds to a group of 4 consecutive rows.

The prediction task consists of associating each pattern to one of the 12 copyists (labeled as A, B, C, D, E, F, G, H, I, W, X, Y).

Data have been normalized by using the Z-normalization method and divided into two data sets: a training set containing 10430 samples and a test set containing 10437 samples.

Class distribution (training set)

A: 4286

B: 5

C: 103

D: 352

E: 1095

F: 1961

G: 446

H: 519

I: 831

W: 44

X: 522

Y: 266

Note that all the images of the Avila Bible are publicly available here

Attribute description

ID Name

F1 intercolumnar distance

F2 upper margin

F3 lower margin

F4 exploitation

F5 row number

F6 modular ratio

F7 interlinear spacing

F8 weight

F9 peak number

F10 modular ratio/ interlinear spacing

Class: A, B, C, D, E, F, G, H, I, W, X, Y

Citations

If you want to refer to the Avila data set in a publication, please cite the following paper:

C. De Stefano, F. Fontanella, M. Maniaci and A. Scotto di Freca, "A Method for Scribe Distinction in Medieval Manuscripts Using Page Layout Features", Lecture Notes in Computer Science, G. Maino and G. Foresti (eds.), Springer-Verlag, vol. 6978, pp. 393-402.

INFO & HELP

If you need any further information or help, do not hesitate to contact me, by sending an email to: fontanella at unicas.it