One of the most useful works in corpus linguistics! Imagine the corpus you created is used by many people, such as BNC, COCA, among many others!
Indonesian Brown Corpus
why not create Indonesian Brown Corpus? We can derive the design directly from Brown corpus, and we can also learn from other corpora whose architectures are also derived from Brown Corpus such as LOB, BE2006, etc. This webpage concisely describes the Brown corpus. More details can be found in the the Brown Corpus Manual. It is a good place to start! If you are ready for more, here is a paper on BE2006; the corpus was created following the methodology used in designing Brown Corpus.
It is also useful to create a corpus whose data is obtained from literary works. I myself am really interested in constructing children literature corpus. This is because the language is quite 'easy', thus might be used by learners of English as a second, foreign or additional language.
You might be aware of the presence of Google Translate, Siri/Alexa, and Grammarly. Perhaps you are one of the users!
These applications use corpora data in the background. The system 'learns' from corpora data. Once the 'knowledge' is acquired, the system uses this to carry out what they are designed to do, such as automatic translation, conversation agent, or spelling and grammar checker.
There are a number of Indonesian corpora data available online, and you can help improve the quality of the data (grammar, semantic, corpus architecture, annotation scheme etc).
Fix tagging errors
I used this corpus to create the Indonesian parameter file for TreeTagger. The program, with the Indonesian parameter file, can automatically analyse a text written in Indonesian and label parts of speech (e.g. nouns, verbs, adjectives, etc) tags to words in the text. I do believe that the quality of the tagging can still be improved by fixing some errors in the corpus!
Add semantic annotations
I am now adding some semantic information to the corpus, and to a dictionary. The aim is to create a system that can label semantic tags automatically to Indonesian texts!
The semantic tags adhere to USAS tagset, developed at Lancaster University. I am happy to welcome you to this project!
There are indeed some great corpus query programs that you can install on your computer such as AntConc, LancsBox, and WordSmith, among many others. However, some people may still prefer web-based programs such as Sketch Engine, CQPweb, or English-Corpora. These three programs have proven to be reliable when working with big corpora, let's say above 100 million words.
Of these three, CQPweb is a free program. These are three areas of CQPweb, that I can think of, and might be your interests as well.
install CQPweb on a server
will be of interest to an institution that wants to store big corpora on its server, and share them.
requirement: linux, PHP, mySQL
do CQPweb admin tasks
Index corpus files and their metadata
requirement: knowledge of a programming language, particularly PHP, will be a benefit
debug CQPweb, improve or add new functionality(ies)
find bugs in CQPweb and report to the developer
improve existing functionalities (make the processing faster, XML visualisation, etc)
add new functionalities currently unavailable
requirement: Linux, PHP, MySQL, Javascript
beautify CQPweb interface
CQPweb interface can still be improved to look more 'attractive' and more user-friendly
Requirement: html, CSS, Javascript
Perhaps you've some books with bombastic titles such as '300 most common words in English', or 'a list of 500 must-know-English-words', among many others. Well, you know what, there are actually a number of wordlists that are built using rigorous research frameworks! Michael West's General Service List (GSL) and Averil Coxhead's Academic Wordlist (AWL). If interested, you can also read the New General Service List project here, used for a number of purposes (business English, TOEIC, spoken English, etc).
While a lot of people, even non-native speakers of English, study these word lists, unfortunately not many created for Indonesian. One that I know is the high-frequency Indonesian Wordlist created by my colleague, Dr. Deny Kwary who passed away a few years ago.
You can actually take part in creating Indonesian wordlists. Let me know if you are interested in this!
Some of you may think that automatic text analysis systems like POS taggers or morphological analysers are always built by programmers/computer scientists. No. If you want, you can also create an annotation system, even if you have no or very little knowledge of programming. Of course, it is helpful if you have some basic computing knowledge, but if you don't, please don't get discouraged! If you know how to operate Notepad and Microsoft excel, then you are ready to go! Your linguistic knowledge is the requirement. Let us create tagging systems for languages in Indonesia, such as Javanese, Sundanese, Balinese, etc!