Home‎ > ‎


Marbella Technology


Finding gold from the sea of unstructured financial data

A well-developed financial system is critical for reallocating capital to enhance the competitiveness, growth and innovation of an economy, and efficient financial information environment is essential for its success. Along with the passage of the Sarbanes-Oxley Act of 2002 and the 2009 amendment to Regulation S-K, public companies in US are mandated by the congress and SEC to disclose more and more information as for their financial performance and corporate governance. While the financial information available to the market increases exponentially and provides unprecedented opportunities to improve market efficiency, about 80% of them are narrative unstructured data which requires Natural Language Processing technology to extract valuable information for both practitioners and academia. This project aims to do meaningful exploration in this direction.

In this project, we propose a web application to detect the structure of companies’ corporate governance guidelines, extract their contents on a section level, analyses these contents, and output machine readable data. We believe that this proposal fits in well with our purpose mentioned above for three reasons. First, corporate governance guidelines represent a valuable and solid source of comprehensive corporate governance information disclosed by public firms. In the aftermath of several high profile corporate scandals such as Enron, Tyco, and others, corporate governance received considerable attention and led the congress passed the Sarbanes-Oxley Act of 2002. In 2004, NYSE mandated that listed companies adopt and disclose on their websites corporate governance guidelines, in which the key areas of corporate governance must be addressed so that investors could better understand each firm’s governance practice and consideration. Second, corporate governance guidelines are largely unstructured and require advanced natural language processing technologies. As NYSE put it, “no single set of guidelines would be appropriate for every company, but certain key areas of universal importance” should be addressed. Thus, different firms are likely to disclose their governance guidelines with similar but different structures. While presenting a hurdle for corporate governance research, it provides a significant opportunity of cross-disciplinary research cooperation. Third, several recent governance reforms and academic research have called for more direct investigation of firms’ governance practice disclosure, which can be significantly facilitated by this project.


Students will use python and NLTK to apply a variety of statistical natural language processing algorithms (name entity recognition, topic modeling, opinion mining, etc) to the large scale of unstructured data mentioned above. Proficiency in at least on programming language and basic understanding of algorithm is expected, background in machine learning and artificial intelligence is a plus.



The Team:


Jason Dou is a Chinese computer scientist dedicated to tackle important social, business, and technical challenges on our planet. Currently he is exploring the idea "Business + Artificial Intelligence" by several promising projects involving collaboration cross departments, campuses, and oceans. He has strong interest cultivating next generation of leading scholars, star engineers, business practitioners, and policy makers. His undergraduate colleagues have been placed to top PhD program and leading industry research lab in China and the U.S.. His non-stealth mode projects include defending democracy by Markov chain, natural language processing and design for e-rulemaking, topic modeling to understand Chinese politics, robot lawyer, optimization for healthcare, etc. Please go to jasondou.org for more information.

Jason pursues power, influence, and friendship in and outside academia. Part of his 2017 new year resolution is to become the youngest editor-in-chief of journal of heuristics. Recently he gets a bit anxious since "the clock is ticking" to shoot on Forbes 30 under 30. He is not into the "publication game" too much, while his name does appear on decent journal like Proceeding of National Academy of Science. He founded his first company headquartered at Trump Tower, 40 wall street before coming to Miami, where he enjoys the sunshine and the beach.







Project description

We are applying statistical machine learning/natural language processing (include the start-of-the-art deep learning technology) to discover hidden structure/valuable information in large scale of corporate governance documents. It will be a great opportunity to collaborate with business school researchers and NLP experts on campus, and gain practical experience industry value the most. Basic programming skills are required, familiar with Java/C++/Python is desired, background in machine learning, natural language processing processing is a bonus. For more information, please contact Jason Dou at jasondou@marbellahk.com .