Home‎ > ‎


Marbella Technology


Finding gold from the sea of unstructured financial data

A well-developed financial system is critical for reallocating capital to enhance the competitiveness, growth and innovation of an economy, and efficient financial information environment is essential for its success. Along with the passage of the Sarbanes-Oxley Act of 2002 and the 2009 amendment to Regulation S-K, public companies in US are mandated by the congress and SEC to disclose more and more information as for their financial performance and corporate governance. While the financial information available to the market increases exponentially and provides unprecedented opportunities to improve market efficiency, about 80% of them are narrative unstructured data which requires Natural Language Processing technology to extract valuable information for both practitioners and academia. This project aims to do meaningful exploration in this direction.

In this project, we propose a web application to detect the structure of companies’ corporate governance guidelines, extract their contents on a section level, analyses these contents, and output machine readable data. We believe that this proposal fits in well with our purpose mentioned above for three reasons. First, corporate governance guidelines represent a valuable and solid source of comprehensive corporate governance information disclosed by public firms. In the aftermath of several high profile corporate scandals such as Enron, Tyco, and others, corporate governance received considerable attention and led the congress passed the Sarbanes-Oxley Act of 2002. In 2004, NYSE mandated that listed companies adopt and disclose on their websites corporate governance guidelines, in which the key areas of corporate governance must be addressed so that investors could better understand each firm’s governance practice and consideration. Second, corporate governance guidelines are largely unstructured and require advanced natural language processing technologies. As NYSE put it, “no single set of guidelines would be appropriate for every company, but certain key areas of universal importance” should be addressed. Thus, different firms are likely to disclose their governance guidelines with similar but different structures. While presenting a hurdle for corporate governance research, it provides a significant opportunity of cross-disciplinary research cooperation. Third, several recent governance reforms and academic research have called for more direct investigation of firms’ governance practice disclosure, which can be significantly facilitated by this project.


Students will use python and NLTK to apply a variety of statistical natural language processing algorithms (name entity recognition, topic modeling, opinion mining, etc) to the large scale of unstructured data mentioned above. Proficiency in at least on programming language and basic understanding of algorithm is expected, background in machine learning and artificial intelligence is a plus.



The Team:






Project description

We are applying statistical machine learning/natural language processing (include the start-of-the-art deep learning technology) to discover hidden structure/valuable information in large scale of corporate governance documents. It will be a great opportunity to collaborate with business school researchers and NLP experts on campus, and gain practical experience industry value the most. Basic programming skills are required, familiar with Java/C++/Python is desired, background in machine learning, natural language processing processing is a bonus. For more information, please contact Jason Dou at jasondou@marbellahk.com .