Comparing read_csv with spark_read_csv

Reading in a csv file into R using dplyr’s `read_csv()` function is so simple. The syntax & parameters of dplyr are fairly easy to remember, once you’ve done it a few times.

read_csv(file, 
    col_names = TRUE, 
    col_types = NULL,
    locale = default_locale(),
    na = c(“”, “NA”), 
    quoted_na = TRUE,
    quote = “””, 
    comment = “”, 
    trim_ws = TRUE, 
    skip = 0, n_max = Inf,
    guess_max = min(1000, n_max), 
    progress = show_progress()
)

I’ve only just started working with big data sets, & was began wondering if what I know about the dplyr syntax can be carried over to sparklyr’s spark_read_csv() function.

While not exactly the same, but if you know one, you can quite easily pick the other. There’s an additional parameter `sc`, aka spark connection, that’s required.

spark_read_csv(
    sc, 
    name,
    path, 
    header = TRUE, # FALSE forces a “V_” prefix
    columns = NULL,
    infer_schema = TRUE, # to infer column data type
    delimiter = “,”, 
    quote = “””, 
    escape = “\”,
    charset = “UTF-8”, 
    null_value = NULL,
    options = list(),
    repartition = 0, # number of partitions to distribute the generated table.
    memory = TRUE, 
    overwrite = TRUE, …
)

Leave a Reply

Your email address will not be published. Required fields are marked *

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)