Coding projects

These are some of the projects in which I have attempted to connect linguistics with natural language processing and data science. I have arranged them in chronological order, from the newest to the oldest. For additional information on each of these and for code samples, you can check out my GitHub.

The following is an index of the projects listed in this page:

LingPeer

LingBuzz data analysis

ling_abstract_classifier

lingbuzz_scraper

WALSpy

LingPeer

LingPeer is a web app that suggests reviewers for papers in theoretical Linguistics based on data from LingBuzz. It is meant to streamline the work of editors in the field. Its functioning is based on a (non-traditional) ensemble of two multinomial naive Bayes models.

LingBuzz data analysis

This is an analysis of data sourced from LingBuzz (August 2023). The analysis was performed on a Jupyter Notebook. It includes the following topics: most downloaded manuscripts, authors with the most manuscripts in the repository, number of downloads per author, most frequent keywords, trends in subdisciplines over time, collaborations and co-authorship networks.

Notebook

ling_abstract_classifier

The ling_abstract_classifier script takes an abstract in theoretical linguistics and classifies it into one or more of the core linguistic subdisciplines (phonology, morphology, syntax, semantics). This project primarily serves as a proof of concept, aiming to test the applicability of the LingBuzz database in creating tools for linguists.

Github repository

lingbuzz_scraper

This is a script designed to scrape data from LingBuzz and then save the extracted information into a CSV file. The data can also be retrieved as a Pandas DataFrame.

Github repository

WALSpy

WALSpy is a collection of Python functions to explore the World Atlas of Linguistic Structures (WALS). It allows to retrieve information from the database as Pandas objects. It also includes some "experimental" functionality for plotting heatmaps, finding correlations between linguistic features, and predicting unreported properties of a language.

Github repository

Google Sites

Report abuse