Data Transformation: dplyr package


R의 신 Hadley Wickham의 책 R for Data Science 5장에 나오는 내용을 정리 한다.

필요 package

install.packages("nycflights13")
library("nycflights13")
library(tidyverse)

tidyverse 로딩시 아래와 같이 conflict메시지가 나올 수도 있다.

-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 2.2.1     √ purrr   0.2.4
√ tibble  1.4.1     √ dplyr   0.7.4
√ tidyr   0.7.2     √ stringr 1.2.0
√ readr   1.1.1     √ forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

flights데이터를 출력하면 아래와 같이 출력 된다.

> flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1  2013     1     1      517            515      2.00      830            819     11.0  UA        1545 N14228  EWR    IAH      227  
 2  2013     1     1      533            529      4.00      850            830     20.0  UA        1714 N24211  LGA    IAH      227  
 3  2013     1     1      542            540      2.00      923            850     33.0  AA        1141 N619AA  JFK    MIA      160  
 4  2013     1     1      544            545     -1.00     1004           1022    -18.0  B6         725 N804JB  JFK    BQN      183  
 5  2013     1     1      554            600     -6.00      812            837    -25.0  DL         461 N668DN  LGA    ATL      116  
 6  2013     1     1      554            558     -4.00      740            728     12.0  UA        1696 N39463  EWR    ORD      150  
 7  2013     1     1      555            600     -5.00      913            854     19.0  B6         507 N516JB  EWR    FLL      158  
 8  2013     1     1      557            600     -3.00      709            723    -14.0  EV        5708 N829AS  LGA    IAD       53.0
 9  2013     1     1      557            600     -3.00      838            846    - 8.00 B6          79 N593JB  JFK    MCO      140  
10  2013     1     1      558            600     -2.00      753            745      8.00 AA         301 N3ALAA  LGA    ORD      138  
# ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
> class(flights)
[1] "tbl_df"     "tbl"        "data.frame"

tibble이라는 data.frame을 약간 변형 시킨것을 사용한다.
tibble에 대한 상세한 내용의 링크이다.

dplyr basics

중요 package 5개를 설명한다.

  • Pick obervations by their values (filter())
  • Reorder the rows (arrange())
  • Pick variables by their names(select())
  • Create new variables with functions of existing variables (mutate())
  • Collpase many values down to a single summary (summarise())

추가로 group_by()를 통해서 scope을 조절 할 수 있다.

filter()

nov_dec <- filter(flights, month %in% c(11, 12))

# A tibble: 55,403 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1  2013    11     1        5           2359      6.00      352            345      7.00 B6         745 N568JB  JFK    PSE      205  
 2  2013    11     1       35           2250    105         123           2356     87.0  B6        1816 N353JB  JFK    SYR       36.0
 3  2013    11     1      455            500   -  5.00      641            651    -10.0  US        1895 N192UW  EWR    CLT       88.0
 4  2013    11     1      539            545   -  6.00      856            827     29.0  UA        1714 N38727  LGA    IAH      229  
 5  2013    11     1      542            545   -  3.00      831            855    -24.0  AA        2243 N5CLAA  JFK    MIA      147  
 6  2013    11     1      549            600   - 11.0       912            923    -11.0  UA         303 N595UA  JFK    SFO      359  
 7  2013    11     1      550            600   - 10.0       705            659      6.00 US        2167 N748UW  LGA    DCA       57.0
 8  2013    11     1      554            600   -  6.00      659            701    - 2.00 US        2134 N742PS  LGA    BOS       40.0
 9  2013    11     1      554            600   -  6.00      826            827    - 1.00 DL         563 N912DE  LGA    ATL      126  
10  2013    11     1      554            600   -  6.00      749            751    - 2.00 DL         731 N315NB  LGA    DTW       93.0
# ... with 55,393 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

arrange()

정렬 용도

arrange(flights, desc(arr_delay))
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     9      641            900      1301     1242
#> 2  2013     6    15     1432           1935      1137     1607
#> 3  2013     1    10     1121           1635      1126     1239
#> 4  2013     9    20     1139           1845      1014     1457
#> 5  2013     7    22      845           1600      1005     1044
#> 6  2013     4    10     1100           1900       960     1342
#> # ... with 3.368e+05 more rows, and 12 more variables:
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

select()

선택 용도

select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))
  • starts_with(“abc”): matches names that begin with “abc”.
  • ends_with(“xyz”): matches names that end with “xyz”.
  • contains(“ijk”): matches names that contain “ijk”.
  • matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain - repeated characters. You’ll learn more about regular expressions in strings.
  • num_range(“x”, 1:3) matches x1, x2 and x3.

mutate()

summarise()


'Data Science > R Basic' 카테고리의 다른 글

Jupyter Notebook으로 R 실행  (0) 2018.04.15
Data Transformation: dplyr package  (0) 2018.04.14
두 개 데이터 프레임 병합, 빼기  (0) 2018.03.04
R Factor 검색  (0) 2017.11.13
RStudio v1.1 Released  (0) 2017.10.10
rJava load 에러  (0) 2017.04.16

+ Recent posts