Literate programming

Introduction to literate programming.

info
coding
literate programming
Authors
Affiliation

Isabelle Chang

Published

2/20/23

Overview

Literate programming refers to the integration of code and prose in a reproducible document. This practice is not yet mainstream in linguistics, although it holds several advantages as opposed to traditional reporting methods. Traditionally, statistical analysis, plots, tables, citations and captions would be created and manually and inserted into a manuscript. One potential issue with this approach is the increased probability of reported errors. For example, a recent study found that…(Roettger analysis of Labphon). A literate programming approach to manuscript creation would plausibly reduce the quantity of these errors, and it would make the correct information traceable more often. Additionally, updates to the data would be (almost) automatically integrated into a given manuscript if the necssary scripts are run again.

The present tutorial will provide an example of literate programming specifically for linguists by using an open dataset in linguistics and reported a mock analysis While the emphasis of this tutorial will be on creating a simple working example in Rmarkdown, it is important to note that literate programming can be applied within R to APA style manuscripts (see the Papaja package), in slideshows (see Xaringan) and in other programs entirely (qmd, python, jupitor notebooks)

Working example

Here, we talk through an example of literate programming using open linguistics data. In particular, we are using the durationsGe dataset in the languageR package. For our example, we will report differences in the duration of dutch prefix “ge” by speaker sex.

First, we load our libraries. Both tidyverse and languageR are available on CRAN.

Reporting descriptive statistics

In general all, inline reporting occurs in Rmardown between backticks, i.e., ` `. Specifically, you have to wrap the r code with `r ` to integrate it into your document. For instance, if we want to report the overall mean for the column DurationOfPrefix, we can simply put r code such as, mean(durationsGe$DurationOfPrefix) between to back ticks like this:


The mean duration is `r mean(durationsGe$DurationOfPrefix)`. 

Which would be rendered as:

The mean duration is 0.1252515

There are several decimal points here, though! We probably don’t want that, so if we haven’t rounded the data previously, we can do so inline by using the round function:


The mean duration was `r round(mean(durationsGe$DurationOfPrefix), digits = 2)`. 

Now the code is rendered in prose as:

The mean duration was 0.13.

As you can see, this can get rather long in a hurry. Another option is to use an code chunk to calculate summary statistics and assign them to objects. Then you can simply use the objects with inline chunks. For instance, we likely also want to report how many participants are in our dataset. Let’s do this and report it in prose.

Code
mean_duration  <- round(mean(durationsGe$DurationOfPrefix), digits = 2)
n_participants <- length(unique(durationsGe$Speaker))

There were `r n_participants` participants. 
The mean duration was `r mean_duration`. 

There were 132 participants. The mean duration was 0.13.

Reporting results of statistical models

We can also report the output statistical models and tests. Typically, the results of these tests can be stored in an object in R and extracted. I will provide an example with a t-test in R. First, we will run a t-test to see whether duration varies as a function of speaker sex:

Code
t_test_object <- t.test(DurationOfPrefix ~ Sex, data = durationsGe)
t_test_object

    Welch Two Sample t-test

data:  DurationOfPrefix by Sex
t = -0.1949, df = 413.26, p-value = 0.8456
alternative hypothesis: true difference in means between group female and group male is not equal to 0
95 percent confidence interval:
 -0.009955279  0.008159221
sample estimates:
mean in group female   mean in group male 
           0.1249141            0.1258121 

For a t-test, in APA guidelines we report degrees of freedom, the t-value, and the p-value. All of these are actually stored in the object we just created, and we can automate the reporting process.

Degrees of Freedom


`r round(t_test_object$parameter, digits = 2)`. 

413.26

The t-value


`r round(t_test_object$statistic, digits = 2)`. 

-0.19

The p-value


`r round(t_test_object$p.value, digits = 2)`. 

0.85

All together


t = (`r round(t_test_object$parameter, digits = 2)`) =
`r round(t_test_object$statistic, digits = 2)`, p = 
`r round(t_test_object$p.value, digits = 2)`.   

t = (413.26) = -0.19, p = 0.85.

Citation

BibTeX citation:
@online{parrish2023,
  author = {Parrish, Kyle and Chang, Isabelle},
  title = {Literate Programming},
  date = {2023-02-20},
  url = {https://fosil-project.github.io/literate-programming/index.html},
  langid = {en}
}
For attribution, please cite this work as:
Parrish, Kyle, and Isabelle Chang. 2023. “Literate Programming.” FOSIL. February 20, 2023. https://fosil-project.github.io/literate-programming/index.html.