Download PDFOpen PDF in browserMachine Learning for Syntactic and Morphological Analysis of Text in the Kazakh LanguageEasyChair Preprint 810114 pages•Date: May 28, 2022AbstractThe article describes the possibility of analyzing texts in the Kazakh language using machine learning. Machine learning is used in the recognition of machine and handwritten text, speech and images. In connection with the problem of determining the meaning of words, syntactic and morphological analysis of the text is used, which are interconnected and allow the text to be divided into tokens, word forms are formed. The implementation of the task is complicated by a large number of alternative options that arise during the parsing process, related both to the ambiguity of the input data (the same word form can be obtained from different typical forms) and the ambiguity of the parsing rules themselves. The work is carried out with the aim of expanding the tasks and possibilities of use related to the text: improving the translation from Kazakh to others, including sign language. Stemming is the process of finding the stem of a word for a given seed word. The stem of a word does not always match the root. Lemmatization is the process of driving a word (word form) to a terminal. Let us give some explanatory definitions. Lemma is a typical form of the word. For example, in the Kazakh language, the lemma for scientific and technical terminology is: nouns - nominative case, one - singular; adjectives act as definitions and do not acquire endings, and the modification of adjectives acting as nouns does not differ from the modification of nouns; verbs - the original form of the verb. A word form is a word presented in a specific grammatical form. A lexeme is a word as an abstract unit of ordinary language. Various paradigmatic forms (word forms) of one word are combined into one lexeme. Keyphrases: Artificial Intelligence, Tokenization, lemmatization, machine learning
|