The Classification of Documents in Malay and Indonesian Using the Naive Bayesian Method Uses Words and Phrases as a Training Set

Marvin Chandra Wijaya

doi:10.13164/mendel.2020.2.023

Marvin Chandra Wijaya Maranatha Christian University

DOI: https://doi.org/10.13164/mendel.2020.2.023

Keywords: Malay, Indonesian, Language, Naive Bayesian, Classification

Abstract

Malay Language and Indonesian Language are two closely related languages, sharing a lot in common in the meanings of words and grammar. Classifying the two languages automatically using a tool is a challenge because the two languages are very similar. The classification method that is widely used today is the Naive Bayesian method. This method needs to be implemented in a particular way to increase the level of classification accuracy. In this study, a new method was used, by using a training set in the form of words and phrases instead of just using a training set in the form of words only. With this method, the level of classification accuracy of the two languages is increased.

References

Calders, T., and Verwer, S. Three naive bayes approaches for discrimination-free classification. Data Mining and Knowledge Discovery 21 (2010), 277-292.

Chen, J., Huang, H., Tian, S., and Qu, Y. Feature selection for text classification with naive bayes. Expert Systems with Applications 36 (2009), 5432-5435.

Hrebik, R., and Kukal, J. Context out classifier. MENDEL 24 (2018), 101-106.

Jiang, L., Wang, S., Li, C., and Zhang, L. Structure extended multinomial naive bayes. Information Sciences 329 (2016), 346-356.

Krawiec, K. Opening the black box: Alternative search drivers for genetic programming and testbased problems. MENDEL 23 (2017), 1-6.

Nababan, P. Language in education: The case of indonesia. International Review of Education 37 (1991), 115-131.

Namatevs, I., and Aleksejeva, L. Decision algorithm for heuristic donor-recipient matching. MENDEL 23 (2017), 33-40.

Ortmann, A. Connecting the typology and semantics of nominal possession: alienability splits and the morphology-semantics interface. Morphology 28 (2018), 99-144.

Saritas, M., and Yasar, A. Performance analysis of ann and naive bayes classification algorithm for data classification. International Journal of Intelligent Systems and Applications in Engineering 73 (2019), 88-91.

Skrabanek, P., and Yayilgan, S. WECIA Graph: Visualization of classification performance dependency on grayscale conversion setting. MENDEL 24 (2018), 41-48.

Soh, H., and Nomoto, H. The malay verbal prex men- and the unergative/unaccusative distinction. Journal of East Asian Linguistics 20 (2011), 77-106.

Sosial, J., and Vol, B. Perbedaan semantik antara bahasa indonesia dan bahasa malaysia: Satu kajian awal upaya mengelak kesalahpahaman dan perbedaan budaya antara bangsa serumpun di asia tenggara fakultas tarbiyah dan keguruan , uin sultan syarif kasim riau. Jurnal Sosial Budaya 9 (2012), 261-282.

Wan, C., Lee, L., Rajkumar, R., and Isa, D. A hybrid text classification approach with low dependency on parameter by integrating k-nearest neighbor and support vector machine. Expert Systems with Applications 39 (2012), 11880-11888.

Yap, M., Liow, S. R., Jalil, S., and Faizal, S. The malay lexicon project: A database of lexical statistics for 9,592 words. Behavior Research Methods 42 (2010), 992-1003.

Zelinka, I., and Amer, E. An ensemble-based malware detection model using minimum feature set. MENDEL 25 (2019), 1-10.

Zhang, D., Koda, K., and Leong, C. Morphological awareness and bilingual word learning: a longitudinal structural equation modeling study. Reading and Writing 29 (2016), 383-407.