Chapter 4 What is Content analysis

In the previous chapter, we have gone through together in a whole game approach a typical scenario of automated content analysis.

In this chapter, we are going to go back to the theorectical foundation of automated content analysis. Unless the adjective before an objective functions is a negation, one should expect an object with an adjective should still belong to the category of that object. For example, a flying cat is still a cat. A liberal, popular, male dictator is still a dictator. However, fake news is not news, because ‘fake’ functions as a negation.

Based on this logic, in the previous chapter I have made a proclaimation that automated content analysis is still content analysis. It is similar to the fact that quantative content analysis is still content analysis. This seems to be a reasonable expectation, but one cannot always assume these long compound nouns created by researchers always have this simple, logical property. For example, thematic meta-analysis is not meta-analysis, eventhough ‘thematic’ does not function as a negation.5

So, what is automated content analysis all about? It is better for us to go back to the basic form: What is content analysis?

4.0.1 Definitions

In the realm of content analysis, there are two bibles, so to speak. They are “Content Analysis: An Introduction to Its Methodology” by Klaus Krippendorff and “The Content Analysis Guidebook” by Kimberley A. Neuendorf. Except the fact that both books are written by communication researcher, both books give a very similar definition of content analysis.

Krippendorff defines content analysis as follow:

Content analysis is a research technique for making replicable and valid inferences from text (or other meaningful matter) to the context of their use.*

Neuendorf defines content analysis as follow:

Content analysis is a summarizing, quantitative analysis of messages that relies on the scientific method (including attention to objectivity-intersubjectivity, a priori design, reliability, validity, generalizability, replicability, and hypothesis testing) and is not limited as to the types of variables that may be measured or the context in which the message are created or presented.*

Both definitions emphasize the scientific aspect of the technique. Neuendorf has a six-part definition of content analysis as a scientific method, which explains each of the six elements inside the parenthesis of her definition. Once again, all six elements define how content analysis is a scientific method. In this discussion, I focus only on four important elements: a priori design, hypothesis testing, reliability and validity. These four elements are important for our discussion of automated content analysis and can draw distinctions between automated content analysis and other forms of text analysis, e.g. text mining. In the following two subsection, the four elements are discussed in the following two subsections: Principle and measurement.

4.0.1.1 The principle of (automated) content analysis

4.0.1.1.1 An a priori Design

A content analysis should have an a priori (i.e. “before the fact”) design. Therefore, the research questions, hypotheses and operationalization of variables must be available before the text data were collected. A valid content analysis should be no room for researchers to play with the data and then choose the variables or even the hypotheses they want to study, thereby makes any content analysis be confirmatory, not exploratory. Thus, any exploratory work should not be labeled as an (automated) content analysis. This draws a major distinct between (automated) content analysis and other text analytic approaches such as text mining, “big data” or data science.

Data science, as defined in Wickham and Grolemund (2016) , is a circular process of “transform” (coding / data manipulation), “visualize” (data visualization) and “model” (data modeling). This repeated circular procedure – for instance, it is possible to go back from an analytic step of data modling back to coding, is not compatible with the strict a priori requirement of (automated) content analysis. (Automated) content analysis must be a linear process. In Neuendorf’s book, she derive a flowchart of content analysis.6 In that flowchart, there is no split-path and the linear process consists of nine steps: theory and rationale, conceptualization decisions, operationalization measures, creation of coding schemes, sampling, training and initial reliability, coding, and, final reliability and tabulation and reporting. If (automated) content analysis is content analysis, it should follow the same linear process. There is only a slight difference in the assessment of reliability (next subsection).

4.0.1.1.2 Hypothesis testing

A content analysis as a scientific method should be “hypothetico-deductive”. Therefore, an (automated) content analysis must have one or more theory driven hypotheses to be tested deductively. These hypotheses describe an anticipated prediction. Research question can also be derived a priori, if the existing theory is enough to support an anticipated prediction. Because of that, an (automated) content analysis must have an “a priori design” (the previous point). An exploratory analysis, such as text mining or data science approaches that finds patterns in text data, is only for hypothesis generation.

4.0.1.2 The measurement in (automated) content analysis

As with all other scientific approachs, all measurements in content analysis must have an adequate level of reliability and validity. The difference between the two concepts can be explained by the oft-cited target metaphor. In this target metaphor, the target is like the one used in Archery. Or if you are from northern or western Europe, biathlon. There is a bullseye (the center of the target) which represent the actual reality. The distance between a dot (a measurement) to the bullseye represents how close the measurement is to the actual reality.

Reliability is about the consistency of a measurement. A measurement with high reliability should give a very consistent value from the same physical reality under different measurement situations. Examples of these different measurement situations are measurements made by different rater (inter-rater reliability), at different time point (test-retest reliability) or even at different stages of the measurement (split-form reliability, internal consistency). In the target metaphor, a measurement with high reliability is presented as dots fall in the same small area.

Validity is about how close the measurement is to the actual reality. It is usually measured by subcatgories such as contruct validity (Does the operationalization of the construct to be measured theorectically justify?), content validity (Does the content of the measurement cover all aspects of measured construct?) and criterion validity (Does the measurement correlate with other measurements that are representative of the construct?).

Let’s take an example that all scholars love (and hate). If one want to measure academic success of a researcher, should one use measurements based on number of citations of this researcher’s papers (e.g. H-index)? First, this measurement can have great variability by using different citation databases (Google Scholar, Scopus, Web of Science). Thus, this measurement does not have good reliability. Regarding criterion validity, Richard Feynman has an H-index of 40. Albert Einstein is a bit better, 63. With their H-index, probably they could not even get a postdoc position nowadays.

In general, reliability and validity works antagonistically. Therefore, a measurement with very high validity usually has lower reliability and vice versa. Using the above example of measuring academic success of a researcher, a measurement with greater validity is to have a panel of experts to evaluate qualitatively the researcher’s papers and their actual impact to the field. This method should have greater validity than barely using H-index, but it trades reliability for validity because the method introduces a new source of inter-rater reliability problem.

4.0.1.2.1 Reliability

Traditional content analysis focuses a lot on reliability, especially inter-rater reliability. Klaus Krippendorff, the author of one of the “bibles”, even has an inter-rater reliability measurement bearing his name (Krippendorff’s \(\alpha\)).

This emphasis on inter-rater reliability is because of the dependency of human reading in traditional manual content analysis. Two humans can have two very different understandings of the same constructs, especially latent ones such as sentiment, liberal, populistic, or probably the most popular one, frames. Usually, researchers are advised to conduct coder training to standardize the coding among coders, an effort to improve inter-rater reliability. Using a cliche language, we train our coders to work like a computer.

Computer is a determinist machine. If the algorithm does not involve any randomization, the same input always gives the same output. Therefore, inter-rater reliability is basically a non-issue for computer’s coding. There may still some reliability issues, but they manifest differently. We will discuss these issues in the next chapter.

Remember reliability and validity work antagonistically? If computer’s coding can achieve very high reliability, what would be the validity of it?

4.0.1.2.2 Validity

4.1 What gets automated?

What the current literature says about ACA.

4.2 Best practices

  • Validation

  • Contrasting Confirmatory / exploratory

  • Methodological transparency

References

Wickham, Hadley, and Garrett Grolemund. 2016. R for data science: import, tidy, transform, visualize, and model data. " O’Reilly Media, Inc.".


  1. The reason for it is that the term was incorrectly coined. The coining this term has not carefully considered the original definition of meta-analysis.

  2. This flowchart is also available online