Organized Panel Session
This paper highlights some issues concerning the analysis of South Asian languages using digital resources. Emphasis is on teaching digital linguistic techniques to students in South Asia area studies, drawing on empirical data from a digital humanities course of my own design. In this course, students with no prior background in computer programming learn the fundamentals of creating and maintaining a digital language toolkit for a language of their choice. The contents of the toolkit are determined based on the unique project requirements of each student’s research language, but in general include basic functions for reading unicode text files, interacting with a database, and performing grammatical analysis of text strings. These toolkits help students to find a starting point for more complex projects, and can be adapted for use in online dictionaries, automated translation engines, and various other ideas of interest to linguists.
South Asian languages present a number of challenges at every step of the digitization process. Optical character recognition of the scripts used for these languages is often unreliable. Transliteration schemes intended to preserve texts can introduce new ambiguities. Lack of spacing between words introduces sentence parsing difficulties. Nonstandard spellings, genre-specific vocabulary, idiosyncratic grammar and other factors can create complexities in otherwise simple projects. However, many of these problems can be resolved or avoided by using the right tools. I suggest a few solutions, and provide some sample code in Python.