The corpus is now on GitHub.
Our sonnet corpus was crawled from sources in http://www.cervantesvirtual.com/ .
We removed HTML from the sources.
The sonnets and authors are distributed across periods like in the table below. For many authors, the exact dates of birth and death for authors are not available in the corpus, only the centuries are given.
Period names ending in .5 correspond to authors who lived in two centuries. E.g. period 15.5 covers authors born in the 15th and deceased in the 16th century.
The corpus now also contains sonnets from the 18th century. See the GitHub repo for more updated information