Improving the reproducibilty of MATLAB computing with R, RStudio, and reach

Why run MATLAB from R?

My engineering students use MATLAB for a lot of their computational work, so I have quite a number of labs, tutorials, homework sets, etc. that rely on the MATLAB environment.

With the advent of the new reach package by Christoph Schmidt, I can use R to manage MATLAB commands, functions, scripts, and outputs—and make this work more reproducible.

Getting started

Go to https://github.com/schmidtchristoph/reach for instructions to download reach.

Create a new directory, e.g., test_reach, for the files in this tutorial and assign an RStudio project to this folder.

A simple test case

To get started, I’ll be saving all files to the main project directory.

Create a new R script (File → New File → R Script) called try_reach.R, saved to the project directory.

Add a text file (File → New File → Text File) called test_mfile.m. (We can write and edit m-files in RStudio.)

The project directory has 3 files:

test_reach
    test_mfile.m
    test_reach.Rproj
    try_reach.R

Your computer has to have an active MATLAB license, but you do not have to launch MATLAB.

Make a graph

In RStudio, in the m-file, type these lines of code to create an entertaining graph.

 % test_mfile.m                 
 theta = linspace(0, 2*pi, 500);
 x = cos(7 * theta);            
 y = sin(4 * theta);            
 plot(x, y)                     
 set(gca, 'visible', 'off')

In the try_reach.R file, we only need two lines of R code to run the m-file.

# try_reach.R
library(reach)
runMatlabScript("test_mfile.m")

Running this script, a MATLAB command window opens, the graph opens and closes, and the command window closes.

Some notes on the runMatlabScript() function:

File names for the m-file appear to be limited to 16 characters.
Keep the .m suffix or the underscore in the file name can be parsed incorrectly.
The function expects the m-file to be in the current R working directory. I show how to work around this assumption later.

The MATLAB windows open and close because the runMatlabScript() function generates a sequence of commands like those shown below.

## matlab -nosplash -nodesktop -r 
## "cd C:/path/to/your/Rproject/directory; 
## test_mfile; 
## quit;"

The text inside the quotes is MATLAB syntax to change the MATLAB working directory to the Rproject working directory (cd), run the m-file (test_mfile), and close MATLAB (quit). In reach, MATLAB commands are separated by semicolons and a collection of commands is a single string in double quotes.

Save the graph to file

To save the graph to file, add a saveas() function at the end of the m-file. In the form shown below, the figure is saved to the Rproject working directory.

 % test_mfile.m                
 saveas(gcf, 'test_figure.png')

The project directory will now have:

test_reach
    test_figure.png
    test_mfile.m
    test_reach.Rproj
    try_reach.R

To preview the graph in R, add these lines to the R file. Running the R file, the figure should appear in the RStudio Plots pane.

# try_reach.R
# view the figure created by the m-file
library(png)
library(grid)
image <- readPNG('test_figure.png')
grid.raster(image)

Of course, if you are writing a report in .Rnw or .Rmd formats, you can use the usual image import functions,

# importing an image in an .Rnw file
\includegraphics{test_figure.png}

# importing an image in an .Rmd file
![alt text](test_figure.png)

No direct exchange between R and MATLAB — an obstacle to reproducibility

I used the figure file name, test_figure.png, in both the .R and the .m scripts. This is a problem for reproducibilty. If I ever change the file name in one script, I must remember to change it in both. For larger projects, relying on memory inevitably introduces errors. For reproducibility, the file name (or any variable for that matter) should be assigned one time only, in one script only.

Unfortunately, to the best of my knowledge, no R package facilitates direct information transfer between R and MATLAB. I have two thoughts on how to manage this problem, one for small projects like this tutorial and one for larger projects.

For small projects

For small projects like this tutorial, I add a chunk of R code at the top of the script to declare variable names that I will use in common in both .R and .m scripts. For example, at the top of try_reach.R, I add:

# try_reach.R
# declare common variables used by .R and .m scripts
fig_filename <- 'test_figure.png'
# end of declarations

and change the readPNG() argument,

image <- readPNG(fig_filename)

Write the same declaration to the top of the test_mfile.m.

 % test_mfile.m                                      
 % declare common variables used by .R and .m scripts
 fig_filename = 'test_figure.png'                    
 % end of declarations

and change the saveas() argument,

 % test_mfile.m           
 saveas(gcf, fig_filename)

For larger projects

The possibility for error grows with the number of declarations and the number of files. For a larger project, I would try creating a new file declarations.txt from which any .R or .m script could read declarations and to which any .R or .m script could write and append new declarations.

I haven’t tried this yet, but the function I have in mind would read the declarations file (or create it if it doesn’t exist), check that the new declaration does not already exist, append the new declaration to the file, and close the file. R scripts would require a version of this function written in R; m-files would require a version written in MATLAB.

Alternatively, one could save R variables in an declarations.Rdata file and MATLAB variables in a declarations.mat file and use reach to translate between the two. The reach package has a convert2Rdata() R function that converts a .mat file to a .Rdata file and a rList2Cell() MATLAB function that converts an R list to a MATLAB cell array. Again, I haven’t tried this approach yet—it would take some development and testing.

File management

Only the smallest of projects are suitable for having all the project files in the main Rproject directory. My typical project directory tree looks like this.

project_name
    common
    data
    design
    nonreproducible
    reports
    visuals
    project_name.Rproj

The common directory is for document elements re-used from project to project, e.g., my own R functions that I use regularly, business logo, LaTeX preambles, bibliography files, reference styles documents for rendering R Markdown to MSWord, etc.

The data directory is for data spreadsheets received from collaborators, original data sets from any source, and R scripts to gather, manipulate, and save tidy data, usually in CSV format. My scripts are self-contained so I can run them independently of any other work in the project.

Some data scientists recommend a separate directory for raw data to keep it in pristine condition. Others also recommend a separate directory for scripts that tidy the data. Everyone agrees however that you should pick a scheme, any scheme, and use it.

The design directory is for scripts that read the prepared CSV data files, create graphs and tables, and write them to the visuals directory. These R scripts are also self-contained so I can execute them independently while I’m designing and revising a graph. I prefer to call this directory “design” because my primary work is in creating graphs; others prefer “analysis.”

The reports directory is for Rnw or Rmd markup scripts that produce a reproducible report. For those of us not using a make file, this is the master script that invokes all the scripts required to render a report in the desired format, e.g., PDF, HTML, or MSWord.

The visuals directory is the destination for graphics generated by R scripts in the design directory. I also use the visuals directory to save images not generated reproducibly, for example, screen shots or downloaded images.

The nonreproducible directory. I regularly work with colleagues do not work reproducibly—who regularly do analysis in Excel, reporting in Word, and presenting in PowerPoint. Materials they send me are saved in this directory. If any of their work affects my reproducible work, I make the necessary updates and revisions to my scripts, re-run the main report, and send it to my collaborators.

Using reach with relative file paths

I’ll approach this problem as if it were one of my conventional projects with an Rmd report in the reports folder that orchestrates small code chunks in sub-directories for gathering data and creating a graph.

For this example, create 5 folders in the project directory: common, data, design, reports, and visuals.

test_reach
    common
    data
    design
    reports
    visuals
    test_reach.Rproj

Start an Rmd report and set MATLAB paths

Let’s open an Rmd file (File → New File → R Markdown).

When the untitled file opens, use (File → Save As…) to save the file with the name report.Rmd to the reports directory.
Delete the default text below the YAML header.
Change the title (in the header) if you like. Mine is titled “Testing the reach package”.

The first thing to do is ensure that relative path names work in knitr the same as they do with the R project. Add this code chunk to the Rmd file and compile.

# report.Rmd
# initialize knitr
library(knitr) 
opts_knit$set(root.dir = "../")

Next we want to ensure that MATLAB can find our m-files when saved anywhere in the project directory tree. In MATLAB syntax, this would be accomplished using the cd, genpath(), addpath(), and savepath() functions. The following R function pastes these commands together in a string and uses reach::runMatlabCommand() to run them.

Create a new R script with these lines. Save this script with file name add_to_matlab_path.R to the common directory.

# add_to_matlab_path.R
# add and save sub-directory paths to the MATLAB search path
add_to_matlab_path <- function (...) {
    library(reach)
    add_subfolders_to_path <- paste0("pathstr = [cd];"
                   , "addpath(genpath(pathstr), '-end');"
                   , "savepath;"
                   )
    reach::runMatlabCommand(add_subfolders_to_path)
}

To the Rmd report, add a code chunk to read and run the function script.

# report.Rmd
# read the R script 
source('common/add_to_matlab_path.R')

# set the MATLAB paths 
add_to_matlab_path()

Compile the report. You should see MATLAB open and close and a report generated that looks something like this.

Having run the function once, the MATLAB paths are set and do not have to be reset unless you add new sub-folders to the R project directory. To save run time when compiling the report, add eval = FALSE to the second chunk.

Of course, these paths could be added with mouse-clicks in the MATLAB GUI, but my goal is to run MATLAB entirely from R to enhance the reproducibility of my work flow.

Data files

To illustrate aspects of data management, I start with an R script that subsets the iris data set and saves it to a CSV file. In an .m script, I read the CSV and save the data to the MATLAB native .mat format.

Start a new R script, called gather_iris.R, and save it to the data directory. We’ll keep only numerical values from the data frame to simplify the MATLAB csvread() function.

# gather_iris.R
data(iris)

# create a small csv file
write.csv(iris[ , c("Petal.Length", "Petal.Width")]
                    , file = "data/iris.csv"
                    , row.names = FALSE
                    )

In the Rmd file, source gather_iris.R to write a subset of the iris data to a CSV file.

# report.Rmd
# run the .R script to create a CSV data file
source('data/gather_iris.R')

Write a new m-file gather_iris.m and save it to the data directory. This file reads the CSV file and saves the data as a .mat file. We could have used the CSV file directly, but I want to create a .mat file so I can illustrate how .mat files are used with reach.

 % data/gather_iris.m                                
 % row = 1 to omit the CSV header row                
 iris = csvread('iris.csv', 1, 0);                   

 % create 2 variables to put in the .mat file        
 petal_length = iris(:, 1); % all rows, first column 
 petal_width  = iris(:, 2); % all rows, second column

 % write two variables to file, include the path     
 save('data/iris.mat', 'petal_length', 'petal_width')

From the Rmd script, run the m-file to create the .mat data file.

# report.Rmd
# run the .m script to create a .mat data file
runMatlabScript("gather_iris.m")

Check your file structure.

test_reach
    common
        add_to_matlab_path.R
    data
        gather_iris.m
        gather_iris.R
        iris.csv
        iris.mat
    design
    reports
        report.Rmd
    visuals 
    test_reach.Rproj

Graph files

Open a new text file, save it as graph_iris.m in the design directory.

To improve reproducibility, I want to declare common variables used by .R and .m scripts. In this example, the only common variable will be the file name of the graph image we’re about to create.

At the top of the graph_iris.m file, add this declaration. The first variable is the figure file name; the second variable includes the path to the file name.

 % design/graph_iris.m                               
 % declare common variables used by .R and .m scripts
 fig01_file      = 'iris_petal.png';                 
 fig01_file_path = ['visuals/', fig01_file];         
 % end of declarations

Similarly, at the top of the .Rmd report, add a code chunk:

# report.Rmd
# declare common variables used by .R and .m scripts
fig01_file      <- 'iris_petal.png'
fig01_file_path <- paste0('visuals/', fig01_file)
# end of declarations

The rest of the .m script reads the .mat data file, creates a scatter plot, and saves the figure as a PNG image in the visuals directory.

 % design/graph_iris.m
 % read the .mat data file
 load('iris.mat')

 % create a graph   
 plot(petal_length, petal_width, 'bo',...
 'markersize', 3, 'markerfacecolor', 'b')
xlabel('Petal length (cm)')
ylabel('Petal width (cm)') 
axis([0 8 0 3])

 % write the image to file, include the path
 saveas(gcf, fig01_file_path)

In the Rmd file, add a code chunk to run the m-file.

# report.Rmd
# read the .mat file and create and save a graph
runMatlabScript('graph_iris.m')

Lastly, import the graph to the report. The simplest approach is to use the R markdown syntax, but with one path wrinkle. R markdown image syntax assumes the working directory is the directory in which the Rmd file resides, in this case, in design. Therefore the path has to include the relative path “up” (../) then “down” to the visuals directory.

Write a code chunk to add the relative path prefix (../) to the image filename.

fig01_file_path_Rmd <- paste0('../', fig01_file_path)

And write the image import markup:

![alt text](`r fig01_file_path_Rmd`)

The directory structure should look like this.

test_reach
    common
        add_to_matlab_path.R
    data
        gather_iris.m
        gather_iris.R
        iris.csv
        iris.mat
    design
        graph_iris.m
    reports
        report.Rmd
    visuals
        iris_petal.png
    test_reach.Rproj

Acknowledgments

Many thanks to Christoph Schmidt for the reach package, to Henrik Bengtsson for the R.matlab package, and to the folks at Revolutions for bringing the packages to our attention.

Leave a Reply Cancel reply