Compiled by: Ding Yi, Huang Nian, Ding Xue
Proofread by: Xi Xiongfen, Yao Jialing
Program Validation: Guo Shuyu
◆ ◆ ◆
Introduction
Why call R from Python or Python from R, and why “and” instead of “or”?
Among the top ten search results for “R Python” articles on the internet, only two discuss the advantages of using R and Python together, rather than viewing these two languages in opposition. This is understandable: both languages have very distinct advantages and disadvantages from the outset. Historically, the separation arose from educational backgrounds: statisticians tended to use R, while programmers opted for Python. However, as the number of data scientists has increased, this distinction has begun to blur:
Data scientists are the people who are the most statistically savvy among software engineers and the most programming-savvy among statisticians. – Josh Wills said this on Twitter.
Due to the unique libraries offered by both languages, the demand for data scientists who can leverage the relative strengths of both languages is continuously growing.
◆ ◆ ◆
Comparison of Python and R
Python has advantages over R in the following areas:
-
Web scraping and data extraction: While R’s rvest has simplified web scraping, Python’s BeautifulSoup and Scrapy are more mature and offer more features.
-
Database connections: Although R has many options for connecting to databases, Python’s SQLAlchemy provides all database connection functionalities with just one package and is widely used in production environments.
Conversely, R has advantages over Python in the following areas:
-
Statistical analysis options: Although Python’s SciPy, Pandas, and statsmodels provide a substantial set of statistical analysis tools, R is specifically designed around statistical analysis applications, thus offering more relevant tools.
-
Interactive graphics or dashboards: Bokeh, Plotly, and Intuitics have recently extended Python’s graphics capabilities to web browsers, but using Shiny as an example, R’s Shiny dashboards run faster and often require less code.
Moreover, as data science teams now possess a broader skill set, the programming language chosen for any application may leverage prior knowledge and experience. For some applications, especially prototyping and development, using familiar tools can speed up the process.
Plain Text “Air Gap” Strategy
This refers to sharing legitimate information based on a complete physical disconnection from the network. In this article, it refers to using plain text files to share code between the two languages – Translator’s note.
To use plain text as a physical barrier between the two languages, you need to follow these steps.
-
Refactor your R and Python scripts from the command line and accept command line arguments.
-
Output shared data to a common file format.
-
Execute one language from within the other, passing parameters as required.
Advantages:
-
The simplest method, usually the fastest
-
Easy to view intermediate output results
-
Common file formats such as CSV, JSON, and YAML have existing parsers
Disadvantages:
-
A common schema or file format needs to be agreed upon in advance
-
If the process becomes lengthy, managing intermediate output results and paths can be difficult
-
If the data volume increases, local disk read/write will become a bottleneck
Command Line Scripts
Running R and Python scripts through the command line in Windows or Linux terminal environments is similar. The command to run is broken down into the following parts:
<command_to_run> <path_to_script> <any_additional_arguments>
Where
-
<command> is the executable command (R code uses Rscript, Python code uses Python)
-
<path_to_script> is the full or relative file path of the script to be executed. Note that if there are spaces in the path name, the entire file path must be enclosed in double quotes.
-
<any_additional_arguments> is a space-separated list of parameters used to parse the script itself. Note that these cannot be passed as strings.
For example, to open a terminal environment and run an R script, the command is as follows:
Rscript path/to/myscript.R arg1 arg2 arg3
Note the following issues:
-
Rscript and Python commands must be executed in the path you are in; otherwise, you need to provide the full path of the file.
-
Path names containing spaces can cause issues, especially in Windows systems, so they must be enclosed in double quotes to be considered a single file path.
Accessing Command Line Arguments in R
In the example above, arg1, arg2, and arg3 are parameters used to parse the executable R script, which can be accessed using the commandArgs function.
##myscript.py
# Get command line arguments
myArgs <- commandArgs(trailingOnly = TRUE)
# myArgs is a feature vector of all parameters
print(myArgs) print(class(myArgs))
By setting trailingOnly to TRUE, the myArgs vector contains only the parameters added to the command line. If the default is set to FALSE, the myArgs vector also includes other parameters, such as the path of the script that was just executed.
Accessing Command Line Arguments in Python
Execute the Python script via the command line as follows:
python path/to/myscript.py arg1 arg2 arg3
Access the arg1, arg2, and arg3 parameters by importing the sys module in the Python script. The sys module contains system-specific parameters and functions, and here we are only interested in the argv attribute. This argv attribute is a list of all parameters passed to the currently executing script. The first element in the list is the full path of the script being executed.
# myscript.py
import sys
# Get command line arguments
my_args = sys.argv
# my_args is a list where the first element is the executed script
print(type(my_args))
print(my_args)
If you only want to keep the parameters passed to the script, you can use list slicing to select all parameters except the first element.
# Using slicing, select all elements except the first
my_args = sys.argv[1:]
Reviewing the R language example above, all parameters need to be passed as strings, so it is necessary to convert them to the desired data type.
Writing Output Results to Files
There are several options for sharing data between R and Python through intermediate files. Generally, for plain text files, CSVs are a good tabular data format, while JSON or YAML are the best data formats for handling variable-length fields or many nested data structures (or metadata).
These are common data serialization formats, and corresponding syntax parsers exist in both R and Python.
In R, the following packages are recommended:
-
For CSV files, use readr
-
For JSON files, use jsonlite
-
For YAML files, use yaml
In Python, the following are recommended:
-
For CSV files, use csv
-
For JSON files, use json
-
For YAML files, use PyYAML
The csv and json modules are standard library files in Python, built-in modules, while PyYAML requires additional package installation. All R packages need to be installed.
◆ ◆ ◆
Conclusion
Data transfer between R and Python can be achieved through a single transfer pathway:
-
Passing parameters via the command line
-
Transferring data using common structured text files
However, in certain instances, it is necessary to store text files as intermediate files locally, which is not only cumbersome but also affects performance. Next, we will discuss how to directly call and output in memory between R and Python.
Command Line Execution and Subprocess Execution
To better understand what happens during subprocess execution, it is worth reconsidering more details when running a Python or R process from the command line. When executing the command below, a new Python process is started to execute the script.
During execution, any data output to the standard output and standard error streams will return to the console display. The most common implementation is through Python’s built-in function print() or R’s functions cat() and print(), which write the given string to the standard output stream. Once the script execution is complete, the Python process closes immediately.
Running command line scripts in this way is useful, but it becomes cumbersome and error-prone if you want to execute multiple consecutive yet independent scripts using this method. However, this allows a Python or R process to directly execute another similar command. The benefit is that a Python parent process can start an R subprocess to run a specific script, thus completing the analysis. Once the R script finishes running, the output of the R subprocess is not sent to the console but returned to the parent process. This method eliminates the need to manually execute command lines separately.
Examples
To illustrate that the execution of one process is triggered by another, we will use two simple examples: one where Python calls R and another where R calls Python. We have intentionally downplayed the importance of the analysis results in each case to focus on how the machine implements the process.
R Script Example
Our simple R script example will take a series of numbers from the command line and return the maximum value.
# max.R
# Get command line arguments
myArgs <- commandArgs(trailingOnly = TRUE)
# Convert to numeric type
nums = as.numeric(myArgs)
# cat will write the result to the standard output stream
cat(max(nums))
Executing R Script in Python
We need to use the subprocess module, which is part of the standard library, to call from Python. We will use the check_output function to call the R script, execute the command, and store the result of the standard output.
To call R to execute the max.R script in Python, we first need to establish the command to run. In Python, this is represented as a list of strings, with the corresponding elements as follows:
[‘<command_to_run>’, ‘<path_to_script>’, ‘arg1’ , ‘arg2’, ‘arg3’, ‘arg4’]
The following code is an example of running R from Python:
# run_max.py
import subprocess
# Define command and parameters
command = ‘Rscript’
path2script = ‘path/to your script/max.R’
# args variable is a list
args = [’11’, ‘3’, ‘9’, ’42’]
# Establish subprocess command
cmd = [command, path2script] + args
# check_output will execute the command and store the result
x = subprocess.check_output(cmd, universal_newlines=True)
print(‘The maximum of the numbers is:’, x)
The parameter universal_newlines=True tells Python to interpret the returned output as a text string and handle line breaks for Windows and Linux. If omitted, the output will be returned as a byte string, and any further string operations must call x.decode() to decode it into text.
Python Script Example
In our simple Python script, we will split the given string (the first parameter) into multiple substrings based on the provided string pattern (the second parameter). The result will then be output to the console, one substring per line.
# splitstr.py
import sys
# Get incoming parameters
string = sys.argv[1]
pattern = sys.argv[2]
# Execute split
ans = string.split(pattern)
# Combine the produced element list into a new command line
# Split string and print
print(‘\n’.join(ans))
Calling Python from R
When executing a subprocess in R, it is recommended to use R’s system2 function to execute and capture output. This is because the built-in system function is platform-incompatible and very difficult to use.
Establishing the command to execute is similar to the Python example above; however, system2 expects the command to be broken down according to its parameters. Additionally, these parameters must always include the path of the script being executed as the first argument.
The last difficulty may arise from handling spaces in the R script path name. The simplest way to resolve this issue is to enclose the full path name in double quotes and then wrap this string in single quotes, so that R retains the double quotes around the parameter itself.
The following code provides an example of executing a Python script in R.
# run_splitstr.R
command = “python”
# Note the single and double quotes in the string (this is necessary if there are spaces in the path name)
path2script='”path/to your script/splitstr.py”‘
# Set args as a vector
string = “3523462—12413415—4577678—7967956—5456439”
pattern = “—“
args = c(string, pattern)
# Add the script path as the first arg parameter
allArgs = c(path2script, args)
output = system2(command, args=allArgs, stdout=TRUE)
print(paste(“The Substrings are:\n”, output))
To capture the feature vector from standard output (one element per line), stdout=TRUE must be explicitly specified in system2; otherwise, only the exit status will be returned. When stdout=TRUE, the exit status is stored in an attribute named “status”.
Conclusion
Through subprocess calls, Python and R can be integrated into a single application. This allows a parent process to call another process as a subprocess and capture any output to standard output.
Volunteer Team Introduction
Reply “Volunteer” to learn how to join us
Recommended Previous Articles, click the image to read
Quick Reference Comparison Table for Machine Learning Algorithms in Python and R
Overview of Machine Learning Algorithms (with Python and R code)