Hybrid Tokenization Strategy for Turkish Abstractive Text Summarization
Künye
N. Z. Kayalı., S. İlhan Omurca. (2024). Hybrid Tokenization Strategy for Turkish Abstractive Text Summarization. 2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP) 11, 1-10 s.Özet
Text summarization is a significant topic in natural language processing. Tokenization approaches are important in this regard as they underpin text recognition and processing. The aim of this paper is to research the efficiency of different tokenization approaches when summarizing Turkish texts and their combinations impact on summarization performance. Whitespace, ULM, BPE and WordPiece tokenization methods are mixed in different ways with pre-trained BERTurk, mT5 and mBART models on MLSUM dataset. We evaluate every tokenization method’s performance as well as all possible combinations based on generated summaries and ROUGE scores. Our results show that if we combine some strategies of tokenization together and use it as a hybrid method, the accuracy and consistency of the summaries will be significantly enhanced. This study gives useful hints about how to optimize models for Turkish in terms of text summarization and emphasizes on selecting suitable tokenization strategies.