- Panagis Yannis
Abstract
This chapter examines automated text analysis (ATA), which describes the different methodologies that can be applied in order to perform text analysis with the use of computer software. ATA is a computer-assisted method for analysing text, whenever the analysis would be prohibitively labour-intensive due to the volume of texts to be analysed. ATA methods have become more popular due to current interest in big data, taking into account the volume of textual content that is made easily accessible by the digitization of human activity. Key to ATA is the notion of corpus, which is a collection of texts. A necessary step before starting any analysis is to collect together the necessary documents and construct the corpora that will be used. Which texts need to be included in this step is dictated by the research question. After text collection, some processing steps need to be taken before the analysis starts, for example tokenization and part-of-speech tagging. Tokenization is the process of splitting a text into its constituent words, also called tokens, whereas part-of-speech tagging assigns each word a label that indicates the respective part-of-speech.