316 Views
Digital Technologies Expo
Increasingly large corpora of premodern writing present many opportunities for computer-assisted approaches to exploring large bodies of textual material. Far from replacing traditional methods of close reading and interpretation, these techniques often fulfil their greatest potential only when used in focused combination with domain knowledge about the texts themselves. This presentation presents an overview and live demonstration of a toolset designed to make a core set of key text mining functions accessible to a much wider audience of scholars working with Chinese texts than previously possible. Requiring only minimal technical expertise to use, this toolset can be easily applied to written materials in almost any language; in this presentation, textual examples will primarily be taken from the Chinese Text Project, an online digital library of over 30,000 premodern Chinese texts. By extracting textual data from this digital library in real time, the platform greatly reduces the effort needed to produce and reproduce textual analyses, while simultaneously demonstrating the utility of cyberinfrastructure and application programming interfaces in practice.
The techniques demonstrated range from simple collation of term frequencies and collocations, through identification of user-defined patterns of word usage, detection of text reuse, to investigation of authorial style using principal component analysis. Related visualizations for summarizing large amounts of data will also be presented, including use of interactive charts, network graphs, heat maps, and other visual summaries, each intended not only to provide summarization of results in aggregate, but also efficient ways of exploring and navigating computationally identifiable features of the texts.
Relevant Links
https://digitalsinology.org/text-tools/
https://digitalsinology.org/text-tools-regex/
https://dsturgeon.net/texttools/
https://ctext.org/
Donald Sturgeon
Harvard University