A Turkish Dataset and BERTurk-Contrastive Model for Semantic Textual Similarity
Subject Areas : Natural Language Processing
Somaiyeh Dehghan
1
*
,
Mehmet Fatih Amasyali
2
1 - Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
2 - Department of Computer Engineering, Yildiz Technical University, Istanbul, Turkey
Keywords: Semantic Textual Similarity, Contrastive Learning, Deep Learning, BERT, BERTurk, Turkish Language,
Abstract :
Semantic Textual Similarity (STS) is an important NLP task that measures the degree of semantic equivalence between two texts, even if the sentence pairs contain different words. While extensively studied in English, STS has received limited attention in Turkish. This study introduces BERTurk-contrastive, a novel BERT-based model leveraging contrastive learning to enhance the STS task in Turkish. Our model aims to learn representations by bringing similar sentences closer together in the embedding space while pushing dissimilar ones farther apart. To support this task, we release SICK-tr, a new STS dataset in Turkish, created by translating the English SICK dataset. We evaluate our model on STSb-tr and SICK-tr, achieving a significant improvement of 5.92 points over previous models. These results establish BERTurk-contrastive as a robust solution for STS in Turkish and provide a new benchmark for future research.
[1] T. Mikolov, K. Chen, G. S. Corrado, J. Dean, “Efficient estimation of word representations in vector space,” In Proceedings of the 2013 International Conference on Learning Representations, 2013.
[2] J. Pennington, R. Socher, C. Manning, “Glove: Global vectors for word representation,” In Proceedings of the 2014 Conference on Empirical Methods in NLP (EMNLP), pp. 1532–1543. 2014.
[3] J. Devlin, JM.-W. Chang, K. Lee, K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, In Proceedings of the 2019 Conference of the American Chapter of the Association for Computational Linguistics, Vol. 1, pp. 4171—4186, 2019.
[4] H. Cheng, S. Yat, "A Text Similarity Measurement Combining Word Semantic Information with TF-IDF Method", Chinese Journal of Computers, 2011.
[5] S. Albitar, S. Fournier, B. Espinasse, "An Effective TF/IDF-based Text-to-Text Semantic Similarity Measure for Text Classification", Web Information Systems Engineering, pp. 105-114, 2014.
[6] J. Chandra, A. Santhanam, A. Joseph, "Artificial Intelligence based Semantic Text Similarity for RAP Lyrics," 2020 International Conference on Emerging Trends in Information Technology and Engineering, pp. 1-5, 2020.
[7] E. Hindocha, V. Yazhiny, A. Arunkumar, P. Boobalan, "Short-text Semantic Similarity using GloVe word embedding", International Research Journal of Engineering and Technology (IRJET), Volume: 06, Issue: 04, Apr 2019.
[8] S. Chakraborty, “An Efficient Sentiment Analysis Model for Crime Articles’ Comments using a Fine-tuned BERT Deep Architecture and Pre-Processing Techniques”, Journal of Information Systems and Telecommunication (JIST), Vol. 45, pp. 1-11, 2024.
[9] J. Nagesh, “Hierarchical Weighted Framework for Emotional Distress Detection using Personalized Affective Cues,” Journal of Information Systems and Telecommunication (JIST), Vol. 38, pp. 89-101, 2022 [10] P. Kavehzadeh, “Deep Transformer-based Representation for Text Chunking”, Journal of Information Systems and Telecommunication (JIST), Vol. 43, pp. 176-184, 2023.
[11] S. Dehghan, B. Yanıkoğlu, “Evaluating ChatGPT’s Ability to Detect Hate Speech in Turkish Tweets,” In Proceedings of the 7th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2024), pages 54–59, St. Julians, Malta. Association for Computational Linguistics, 2024.
[12] S. Dehghan, B. Yanıkoğlu, “Multi-domain Hate Speech Detection Using Dual Contrastive Learning and Paralinguistic Features,” In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 11745–11755, Torino, Italia. ELRA and ICCL, 2024.
[13] S. Dehghan, M. U. Şen, B. Yanıkoğlu, “Dealing with annotator disagreement in hate speech classification,” Preprint, arXiv:2502.08266, 2025.
[14] F. B. Fikri, K. Oflazer, B. Yanıkoğlu, “Anlamsal Benzerlik için Türkçe Veri Kümesi (Turkish Dataset for Semantic Similarity)”, In Proceedings of the 29th IEEE Conference on Signal Processing and Communications Applications, Istanbul, Turkey, 2021.
[15] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, “Semeval-2012 task 6: A pilot on semantic textual similarity,” In *SEM 2012: The First Joint Conference on Lexical and Computational Semantics (SemEval 2012), Association for Computational Linguistics, pp. 385–393, 2012.
[16] E. Agirre, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, “*sem 2013 shared task: Semantic textual similarity,” in In Second Joint Conference on Lexical and Computational Semantics (*SEM), Vol. 1, pp. 32–43, 2013.
[17] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, R. Mihalcea, G. Rigau, J. Wiebe, “Semeval-2014 task 10: Multilingual semantic textual similarity,” Association for Computational Linguistics, pp. 81–91, 2014.
[18] E. Agirre, C. Banea, C. Cardie, D. Cer, M. Diab, A. Gonzalez-Agirre, W. Guo, I. Lopez-Gazpio, M. Maritxalar, R. Mihalcea, G. Rigau, L. Uria, J. Wiebe, “Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability,” Association for Computational Linguistics, pp. 252–263, 2015.
[19] E. Agirre, C. Banea, D. Cer, M. Diab,A. Gonzalez-Agirre, R. Mihalcea, G. Rigau, J. Wiebe, “Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation,” Association for Computational Linguistics, pp. 497–511, 2016.
[20] D. Cer, M. Diab, E. Agirre, I. Lopez-Gazpio, L. Specia, “Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation,” In Proceedings of the 11th International Workshop on Semantic Evaluation, pp. 1–14, 2017.
[21] M. Marelli, S. Menini, M. Baroni, L. Bentivogli, R. Bernardi, R. Zamparelli, “A sick cure for the evaluation of compositional distributional semantic models,” in In Proceedings of the International Conference on Language Resources and Evaluation (LREC), pp. 216–223, 2014.
[22] E. Budur, R. Özçelik, T. Güngör, “Data and Representation for Turkish Natural Language Inference”, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, Nov. 2020.
[23] E. Yıldıztepe, V. Uzun, "Olasılıksal Yöntemler ile Türkçe Metinlerin Anlamsal Benzerliğinin Belirlenmesi", Sinop Üniversitesi Fen Bilimleri Dergisi, Sinop Uni J Nat Sci 3 (2): 66-78, 2018.
[24] T. Gao, X. Yao, D. Chen, “SimCSE: Simple Contrastive Learning of Sentence Embeddings”, In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
[25] S. Dehghan, M.F. Amasyali, “SupMPN: Supervised Multiple Positives and Negatives Contrastive Learning Model for Semantic Textual Similarity”, Applied Sciences, 12:9659, 2022.
[26] S. Dehghan, M.F. Amasyali, "SelfCCL: Curriculum Contrastive Learning by Transferring Self-Taught Knowledge for Fine-Tuning BERT", Applied Sciences, Vol. 13(3):1913, 2023.
[27] A. Conneau, D. Kiela, “SentEval: An evaluation toolkit for universal sentence representations” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC), Miyazaki, Japan, 7--12 May, 2018.
[28] B. Koçer Güldalı, K. U. İşisağ, “A comparative study on google translate: An error analysis of Turkish-to English translations in terms of the text typology of Katherina Reiss”, RumeliDE Dil ve Edebiyat Araştırmaları Dergisi, 2019.
[29] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv: 2002.05709, 2020.
[30] F. Schroff, D. Kalenichenko, J. Philbin, “FaceNet: A Unified Embedding for Face Recognition and Clustering”, arXiv:1503.03832, 2015.
[31] S.R. Bowman, G. Angeli, C. Potts, C.D. Manning, “A large annotated corpus for learning natural language inference”, In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Portugal, 2015.
[32] A. Williams, N. Nangia, S. Bowman, “A broad-coverage challenge corpus for sentence understanding through inference”, In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, Vol. 1, 2018.
[33] N. Reimers, I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert networks”, In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, 2019.
[34] L.V.D. Maaten, G.E. Hinton, “Visualizing Data Using t-SNE”, Journal of Machine Learning Research, 9, pp. 2579–2605, 2008.