Skip to content Skip to sidebar Skip to footer

Reading Data Into R Mac Vs Pc

Importing Data into R

A tutorial about data analysis using R

Dr Jon Yearsley (Schoolhouse of Biological science and Ecology Science, UCD)

  • Objectives
  • Organise yourself!
  • Information Workflow
  • Format your information (tidy data)
  • Data frames
  • Importing spreadsheet data
  • Summary of the topics covered
  • Farther Reading

How to Read this Tutorial

This tutorial is a mixture of R code chunks and explanations of the code. The R lawmaking chunks will appear in boxes.

Below is an example of a chunk of R code:

                                          # This is a chunk of R code. All text later on a # symbol is a annotate                                            # Ready working directory using setwd() function                                            setwd('Enter the path to my working directory')                                            # Clear all variables in R's retentiveness                                            rm(listing=                ls())                # Standard code to clear R'south memory                                    

Sometimes the output from running this R lawmaking will be displayed later the chunk of code. R output will be preceeded by ##.

Here is a chunk of code followed by the R output

                                          2                +                4                # Use R to add ii numbers                                    
          ## [1] 6        

Objectives

The objectives of this tutorial are:

  1. Demonstrate good practise in data organisation
  2. Introduce plain text file formats for data
  3. Explain data import into R

Organise yourself!

Earlier you lot start importing data into R you should take time to organised your workspace on your figurer:

  • Create a folder on your computer to contain all your work for this detail project (eastward.g. a folder called DataModule)
  • Inside this project folder create another folder called data. This will hold all the raw information files. These raw data files should not exist inverse.
  • Inside this project folder create a text file chosen MyFirstScript.R. You can use RStudio for this (for this utilize File->New File->R Script carte du jour option) or any bones text editor to do this (due east.thousand. Notepad, TextEdit, gedit, emacs). This file will be your R script that will contain all the commands for R. The .r or .R suffixes is the standard suffix for an R script.
  • If yous are starting a big project consider creating separate folder for: R scripts, figures, output from the R script

Your showtime R script

Now you have created the file MyFirstScript.R you should put some header text at the outset of the file to explain what the R script will do. This was described in tutorial ane.

Video Tutorial: Creating a new R script with RStudio (1 min)

The text should take a short caption of the R script followed by your proper name and the engagement you wrote the R script. Each line should start with a # so that the text is not interpreted by R (this text is for humans and so they understand what the file is intended to do). Hither is an case,

          # ********** Start of header ************** # Championship: <The championship of your R script>  # # Add together a brusque description of the R script hither. # # Writer: <your name>  (email address) # Engagement: <today's engagement> # # *********** Cease of header ****************  # Ii common commands at the start of an R script are: rm(list=ls())         # Clear R's retentiveness  setwd('~/DataModule') # Gear up the working directory  # Replace '~/DataModule' with the name of your own directory  # ****************************************** # Write your commands below.  # Remember to use comments to explain your commands                  

Writing articulate R scripts

An R script isn't only telling the computer how to perform calculations on your information. Information technology is as well explaining your working to other human beings.

"Instead of imagining that our main task is to instruct a calculator what to practise, let us concentrate rather on explaining to human beings what we want a reckoner to practise." – Donald E. Knuth

To brand your R scripts usable by humans they must be clearly commented (using the # symbol to start a comment) and clearly organised.

As you write an R script consider these questions:

  • Does your R script expect well organised (eastward.g. is it well spaced, are lines indented logically)?
  • Could someone else read the R script and sympathize the basic thought?
  • Could someone else modify your R script relatively easily?
  • In a couple of months time could you quickly read and edit your ain R script?

Professional data analysts have clarity very seriously. Here are some links to R coding fashion guides:

  1. Google's style guide, https://google.github.io/styleguide/Rguide.xml
  2. Hadley Wickham'south style guide, http://adv-r.had.co.nz/Style.html
  3. http://www.stat.ubc.ca/~jenny/STAT545A/block19_codeFormattingOrganization.html
  4. http://nicercode.github.io/web log/2013-04-05-why-nice-code/

Data Workflow

Below is a schematic of the workflow for handling data.

Figure: The workflow to follow when handling data.

In this tutorial nosotros will consider formating data, in the adjacent tutorial we'll hash out importing data, and then we'll start to consider exploring the information using graphics and numerical summaries.

Format your data (tidy data)

The workflow starts long before you lot analyse your data. It starts fifty-fifty before you have your data in some calculator software.

Organising your information should follow tidy data guidelines (see beneath) and be planned before you collect your data. The format of the data should be finalised before importing the data into R. It is often easiest to tidy your data using a spreadsheet plan earlier you import the data into R.

Well organised data from the start will make your life a lot easier and your data import as painless as possible.

Six guidelines for tidy information

When tidying your data yous should ensure that:

  1. each variable has its own column
  2. each row is an observation
  3. the top of each column contains the name of the variable
  4. there are no bare columns or blank rows between data
  5. all information in a column has the same blazon (e.g. information technology is all numerical data, or it is all text data)
  6. data are consistent (due east.thou. if a binary variable can take values 'Yep' or 'No' so only these two values are allowed, with no alternatives such as 'Y' and 'Northward')

PDF Summary: This PDF document reiterates the concept of tidy data

The link to the PDF is: http://www.ucd.ie/ecomodel/pdf/TidyData.pdf

Poorly vs well formatted information

The data set shown in the figure below are an case of poorly formatted data. The data set contains information on the lead concentrations (ppm) from three species of fish (whitefish, sucker and trout). Two types of sample were collected: samples from fillets of fish and from whole fish. The data has three variables: lead concentration, species of fish and blazon of fish sample.

Figure: A poorly formatted data set. This file would be hard to import and analyse in this format.

How would y'all improve the format of the poorly formatted data shown in the figure? (Hint: utilise the six guidelines above)

The second effigy shows some well formatted data that follows the tidy data guidelines: each column represents a unmarried variable and each row an observation.

Figure: A well formatted data set. This file would be easy to import and analyse in this format. One column contains the data for one variable. These data are the worldwide occurences of Covid-19, downlaoded from the European Centre for Disease Prevention and Control, https://www.ecdc.europa.eu/en

Data frames

A data frame is R's name for spreadsheet data (e.g. information organised in a grid, like Excel). R stores the vast bulk of data as a data frame and uses data frames when analyzing data.

A data frame forces the data to be well organised.

  • Each column is a variable. The name of this variable becomes the proper name of the column.
  • Each row corresponds to an observation. This meas that values in the same row are data nerveless about the same object. Rows can also take names.

Below is an example of a information frame (called airquality) that contains data on the air quality in New York from May - September 1973 (this is a information set that is built in to R).

                                          # The airquality data is a built-in dataset                                                          # Outset 10 rows of the airquality data frame                                            head(airquality,                due north=                10)                      
          ##    Ozone Solar.R Wind Temp Calendar month Day ## 1     41     190  7.four   67     5   1 ## ii     36     118  8.0   72     5   2 ## 3     12     149 12.6   74     5   3 ## 4     18     313 xi.5   62     five   4 ## 5     NA      NA 14.3   56     5   five ## 6     28      NA xiv.9   66     5   half dozen ## 7     23     299  8.6   65     5   vii ## 8     19      99 13.8   59     5   viii ## ix      8      19 twenty.ane   61     5   nine ## 10    NA     194  8.6   69     5  10        

You lot can type ?airquality to brandish the help file for this data prepare. The information frame has 154 rows (observations) and 6 columns (variables measured). The 6 columns contain data on: ozone concentrations (parts per billion), solar radiation, wind speed, air temperature, month and twenty-four hours of observation. You can encounter that each cavalcade has a proper name respective to the data for that column.

The structure of the data frame can exist viewed using the str() function

                                          # Display the structure of the airquality data frame                                            str(airquality)                      
          ## 'information.frame':    153 obs. of  vi variables: ##  $ Ozone  : int  41 36 12 18 NA 28 23 19 8 NA ... ##  $ Solar.R: int  190 118 149 313 NA NA 299 99 19 194 ... ##  $ Wind   : num  7.4 viii 12.half-dozen 11.v 14.3 14.ix 8.6 13.8 twenty.1 8.6 ... ##  $ Temp   : int  67 72 74 62 56 66 65 59 61 69 ... ##  $ Calendar month  : int  v 5 5 5 five v v 5 5 5 ... ##  $ Day    : int  i two three iv v 6 7 8 ix 10 ...        

The str() function shows that this is a data frame with 153 observations (rows) and six variables (columns). It also shows the data tyes of the variables: wind is a numerical variable (i.e. continuous) and the other variables are all integers (i.eastward. whole numbers).

Tidy data in R is described in more detail on this spider web page: https://cran.r-project.org/spider web/packages/tidyr/vignettes/tidy-data.html

Tibbles

A recent development (circa 2016) is an improved data frame called a tibble. We volition not talk over these new data frame objects here, merely you can read about them at https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html.

Don't Panic! Tibbles are very similar to information frames.

The important signal to know is that if you use RStudio's GUI interface to import data and so your data volition be stored in a tibble, non a data frame.

Importing spreadsheet information

To get-go working with data in R you need to import your information into R. Y'all are aiming to take a data frame that contains your information.

The simplest way to import data into R is from a text file (https://en.wikipedia.org/wiki/Text_file). Text files (sometimes called flat files) can be read past whatever computer operating organisation and by many unlike statistical programs. Saving data equally a simple text file makes your data highly transportable.

Importing data from software specific formats (e.g. Excel's .XLSX format, Minitab's .MTW format, SPSS's .SAV format or SAS's .SAS format) is possible (eastward.g. using RStudio'southward Import Dataset GUI). If you want your information to be hands shared with other people and so utilize a text file to store your information.

Nosotros advise yous to:

  • save your data as a text file (software, such equally Excel, oftentimes take an selection to save data as plain text)
  • organize data with columns corresponding to different variables before exporting to the text file
  • employ a visible text character to delimit each column (usually a comma, semi-colon). Using an invisible character (due east.g. a space or a TAB) is not recommended because these characters all look the aforementioned at first glance.

Full general advice on importing information into R tin be plant at https://cran.r-projection.org/doc/manuals/r-release/R-data.html

Converting data to a CSV text file

A comma separated values file (CSV file) is the most common format for a text file that contains data.

Here are a few video tutorials on converting data into a CSV text file then that it is suitable for import into R.

Video Tutorial: Converting data from EXCEL to a CSV format (iii mins)

Video Tutorial: Converting data from Googlesheets to a CSV format (1 min)

Viewing text files

Before importing a text file into whatsoever software package it is a huge aid if you can look at it in a text editor. Text files can contain characters that are normally invisible (e.m. spaces, tabs and end of line markers). If a text editor is going to be of use information technology must exist able to display all the characters in a file.

Three text editors that can practise this are:

notepad++ is a free program for Windows operating systems

BBedit is a free program for Mac OSX operating systems

emacs is a GNU opensource plan primarily for Linux operating systems.

On Linux systems the cat -A control from the final is as well useful.

Here are two video tutorials on this topic

Video Tutorial: Viewing data in a text file before importing into R (4 mins)

Video Tutorial: An overview of the common data text file formats (3 mins)

Data import examples

The data we'll be importing are described at http://world wide web.ucd.ie/ecomodel/Resources/datasets_WebVersion.html

The files are:

  • WOLF.CSV: This file is a text file of comma separated values.
  • HEIGHT.CSV: This file is a text file of comma separated values.
  • INSECT.TXT:This file is a text file of TAB delimited values.
  • BEEKEEPER.TXT: This file is a text file with blank infinite delimiting the values.
  • MALIN_HEAD.TXT: This file is a text file with TAB delimited values.

All these data files are uncomplicated text files that differ in the graphic symbol used to distinguish columns of data.

Comma delimited files (CSV files)

CSV stands for comma separated values (note sometimes semi-colons are used in place of commas because some countries use the comma in identify of the decimal point).

The read.table() role is a flexible function for importing text information

Video Tutorial: Importing a CSV file into R using read.table() (v mins)

                                          # Import WOLF.CSV file using read.tabular array function                            wolf                =                read.table('WOLF.CSV',                header=                TRUE,                sep=                ',')                      

The wolf variable contains the imported information. It is called a information frame.

The platonic arrangement of a data frame is for each row to be an ascertainment of some object and each columns a variable that measures some belongings of the object. For example, each row of wolf is an observation of one individual wolf and each column of wolf give information almost where the wolf was observed and the information collected from its hair sample.

The HEIGHT.CSV file besides contains comma separated values. Here is the read.table() control to read in this file

                                          # Import Height.CSV file using read.table function                            human                =                read.table('HEIGHT.CSV',                header=                TRUE,                sep=                ',')                      

Note: The function read.csv() is a special case of the read.table() function.

Use the R aid pages to larn more about these functions

                          ?read.tabular array                # Display help page on read.table role                                    

TAB delimited files (TXT files)

The INSECT.TXT data fix is a text file where variables are delimited by a TAB. In addition the first three lines contain a data description that we do non want to import.

The read.table() office tin exist used to import this file. The argument skip=3 is used to ignore the first three lines. The argument sep='\t' specifies a TAB as the variable delimiter

                                          # Import INSECT.TXT file using read.tabular array function (TAB delimited)                                            # skipping the first iii lines (skip=iii)                            insect                =                read.tabular array('INSECT.TXT',                header=T,                skip=                3,                sep=                '                \t                ')                      

The MALIN_HEAD.TXT too contains TAB delimited data. Here is the read.table() command to read in this file

                                          # Import MALIN_HEAD.TXT file using read.tabular array function (TAB delimited)                            rainfall                =                read.table('MALIN_HEAD.TXT',                header=T,                sep=                '                \t                ')                      

Bare space delimited files

The Apiculturist.TXT data set uses white space to delimit the variables. The kickoff 6 lines of the file contain a description of the data

Using read.table() with the argument sep='' will translate whatsoever space as a variable delimiter.

                                          # Import BEEKEEPER.TXT file using read.tabular array part (white space delimited)                                            # skipping the first half dozen lines (skip=6)                            bees                =                read.tabular array('Beekeeper.TXT',                header=T,                skip=                6,                sep=                '')                      

Summary of import commands

Type of text file R Command
Comma delimited (.CSV) read.table(<filename>, header=T, sep=',')
TAB delimited (.TXT) read.table(<filename>, header=T, sep='\t')
Bare space (.TXT) read.table(<filename>, header=T, sep='')
                                          # Comma separated values                            wolf                =                read.tabular array('WOLF.CSV',                header=                TRUE,                sep=                ',')              human being                =                read.tabular array('HEIGHT.CSV',                header=                TRUE,                sep=                ',')                                            # TAB delimited values                            insect                =                read.tabular array('INSECT.TXT',                header=T,                skip=                3,                sep=                '                \t                ')              rainfall                =                read.table('MALIN_HEAD.TXT',                header=T,                sep=                '                \t                ')                                            # White space delimited values                            bees                =                read.table('Beekeeper.TXT',                header=T,                skip=                6,                sep=                '')                      

Importing information using RStudio

RStudio has its own information import functionality. To use this you will need to install the R package readr. For more inofmration nearly this see RStudio's guide: https://support.rstudio.com/hc/en-us/articles/218611977-Importing-Data-with-RStudio

Video Tutorial: Importing a CSV file into R using RStudio'southward GUI (3 mins xiii secs)

Importing data using RStudio will relieve the data as a modified information frame, called a tibble (tibbles are briefly discussed above).

Importing using fread()

fread() is a powerful information import role that is similar to read.tabular array() simply faster. Information technology is office of the data.table parcel, which you lot volition need to install.

Yous should only have to give fread() the name of the file you want to import, and fread() will try to work out the advisable way to import the data. Try some examples and compare the the examples above

                                          # ******************************************                                            # Other packages for importing data --------                                            # The data.table parcel                                                          library(data.table)                # Load the information.tabular array package                                                          # Import a CSV file                            wolf2                =                fread('WOLF.CSV')                            human2                =                fread('Summit.CSV')                                            # Import TAB delimited file                            insect2                =                fread('INSECT.TXT')              rainfall2                =                fread('MALIN_HEAD.TXT')                                                          # Import white space delimited file                            bees2                =                fread('Beekeeper.TXT')                      

The fread() command is simpler to use considering it tries to gauge the format of the data in the file.

Summary of the topics covered

  • Organizing your files on your computer
  • All-time practise for formatting data
  • Reading in spreadsheet information
  • Data frames

Farther Reading

All these books can be found in UCD's library

  • Andrew P. Beckerman and Owen 50. Petchey, 2012 Getting Started with R: An introduction for biologists (Oxford Academy Press, Oxford) [Chapter ii, 3]
  • Marking Gardner, 2012 Statistics for Ecologists Using R and Excel (Pelagic, Exeter)
  • Michael J. Crawley, 2015 Statistics : an introduction using R (John Wiley & Sons, Chichester) [Chapter 2]
  • Tenko Raykov and George A Marcoulides, 2013 Basic statistics: an introduction with R (Rowman and Littlefield, Plymouth)

rochaliented02.blogspot.com

Source: https://www.ucd.ie/ecomodel/Resources/Sheet2a_data_import_WebVersion.html

Enregistrer un commentaire for "Reading Data Into R Mac Vs Pc"