NEW: DocEng´2021 Competition on Extractive Text Summarization

The CNN-corpus is a test corpus for single document extractive text summarization of news articles. The current version of the CNN-corpus encompasses 3,000 and 1,117 texts in English and Spanish, respectively. Moreover, each of them has an abstractive and an extractive summary. The corpus allows quantitative and qualitative assessments of extractive summarization strategies.



List of publications of the UFRPE/UFPE/Colorado State University summarization group:

  • Vale, R., Lins, R. D., & Ferreira, R. (2020, September). An Assessment of Sentence Simplification Methods in Extractive Text Summarization. In Proceedings of the ACM Symposium on Document Engineering 2020 (pp. 1-9).

  • Lins, R. D., Oliveira, H., Cabral, L., Batista, J., Tenorio, B., Ferreira, R., ... & Simske, S. J. (2019, September). The cnn-corpus: A large textual corpus for single-document extractive summarization. In Proceedings of the ACM Symposium on Document Engineering 2019 (p. 16). ACM. Link

  • Lins, R. D., Oliveira, H., Cabral, L., Batista, J., Tenorio, B., Salcedo, D. A., ... & Simske, S. J. (2019, September). The CNN-Corpus in Spanish: a Large Corpus for Extractive Text Summarization in the Spanish Language. In Proceedings of the ACM Symposium on Document Engineering 2019 (p. 38). ACM. Link

  • Lins, R. D., Mello, R. F., & Simske, S. (2019, September). DocEng'19 Competition on Extractive Text Summarization. In Proceedings of the ACM Symposium on Document Engineering 2019 (pp. 1-2). Link

  • Oliveira, H., Ferreira, R., Lima, R., Lins, R. D., Freitas, F., Riss, M., & Simske, S. J. (2016). Assessing shallow sentence scoring techniques and combinations for single and multi-document summarization. Expert Systems with Applications, 65, 68-86. Link

  • Cabral, L., Lima, R., Lins, R., Neto, M., Ferreira, R., Simske, S., & Riss, M. (2015, October). Automatic summarization of news articles in mobile devices. In 2015 Fourteenth Mexican International Conference on Artificial Intelligence (MICAI) (pp. 8-13). IEEE. Link

  • Ferreira, R., Lins, R. D., Cabral, L., Freitas, F., Simske, S. J., & Riss, M. (2015, September). Automatic Document Classification using Summarization Strategies. In Proceedings of the 2015 ACM Symposium on Document Engineering (pp. 69-72). ACM. Link

  • Batista, J., Ferreira, R., Tomaz, H., Ferreira, R., Dueire Lins, R., Simske, S., ... & Riss, M. (2015, September). A quantitative and qualitative assessment of automatic text summarization systems. In Proceedings of the 2015 ACM Symposium on Document Engineering (pp. 65-68). ACM. Link

  • Silva, G., Ferreira, R., Lins, R. D., Cabral, L., Oliveira, H., Simske, S. J., & Riss, M. (2015, September). Automatic text document summarization based on machine learning. In Proceedings of the 2015 ACM Symposium on Document Engineering (pp. 191-194). ACM. Link

  • Ferreira, R., de Souza Cabral, L., Freitas, F., Lins, R. D., de França Silva, G., Simske, S. J., & Favaro, L. (2014). A multi-document summarization system based on statistics and linguistic treatment. Expert Systems with Applications, 41(13), 5780-5787. Link

  • Ferreira, R., de Souza Cabral, L., Lins, R. D., e Silva, G. P., Freitas, F., Cavalcanti, G. D., ... & Favaro, L. (2013). Assessing sentence scoring techniques for extractive text summarization. Expert systems with applications, 40(14), 5755-5764. Link

  • Sodré, L. C., de Oliveira, H. T. A., & Pessoa–PB–Brazil, J. Evaluating Regression Algorithms for Automatic Text Summarization in Brazilian Portuguese. Link

  • de Brito Gomes, L. B., de Oliveira, H. T. A., & Pessoa–PB–Brasil, J. A Multi-document Summarization System for News Articles in Portuguese using Integer Linear Programming. Link

  • Antunes, J., Lins, R. D., Lima, R., Oliveira, H., Riss, M., & Simske, S. J. (2018). Automatic cohesive summarization with pronominal anaphora resolution. Computer Speech & Language, 52, 141-164. Link

  • de Oliveira, H. T. A., Lins, R. D., Lima, R., Freitas, F., & Simske, S. J. (2018, October). A Concept-Based ILP Approach for Multi-document Summarization Exploring Centrality and Position. In 2018 7th Brazilian Conference on Intelligent Systems (BRACIS) (pp. 37-42). IEEE. Link

  • Garcia, R., Lima, R., Espinasse, B., & Oliveira, H. (2018, April). Towards coherent single-document summarization: an integer linear programming-based approach. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing (pp. 712-719). ACM. Link

  • Oliveira, H., Lins, R. D., Lima, R., & Freitas, F. (2017, November). A regression-based approach using integer linear programming for single-document summarization. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI) (pp. 270-277). IEEE. Link

  • Oliveira, H., Lima, R., Lins, R. D., Freitas, F., Riss, M., & Simske, S. J. (2016, September). Assessing concept weighting in integer linear programming based single-document summarization. In Proceedings of the 2016 ACM Symposium on Document Engineering (pp. 205-208). ACM. Link

  • Oliveira, H., Lima, R., Lins, R. D., Freitas, F., Riss, M., & Simske, S. J. (2016, October). A concept-based integer linear programming approach for single-document summarization. In 2016 5th Brazilian Conference on Intelligent Systems (BRACIS) (pp. 403-408). IEEE. Link

Please fill the following form to access 2/3 (2,000 and 734 documents for English and Spanish, respectively) of the CNN Summarization Corpus.