Data Transformation: dplyr package

JAYNUX 2018. 4. 14. 20:36

2018. 4. 14. 20:36

Data Transformation: dplyr package

R의 신 Hadley Wickham의 책 R for Data Science의 5장에 나오는 내용을 정리 한다.

Hadley Wickham본인이 직접 만든 패키지이므로 유용하다.
기타 그가 만든 패키지는 아래와 같다.

ggplot2
plyr
reshape2

plyr은 모든 함수가 R로 작성되어서 처리 속도가 느리다는 단점이 있다.
dplyr은 C++로 작성되어 불필요한 함수를 불러오지 않기 때문에 매우 빠른 처리 속도를 자랑한다.
결국 분할(split),적용(apply) 병합(combine)을 plyr로 하던것을 dplyr로 변경하는 패러다임 전환이다.

필자도 plyr의 {}{}ply를 많이 사용 했었다. 하지만 파이프연산과 조합한dplyr이 좀 더 가독성이 높고
속도도 빠르다고하니 이쪽으로 옴겨 가야 하는 것 같다.

dplyr의 지원 함수군 외의 특징

data.table 지원
각종 DB지원: MySQL, PostgreSQL, SQLite, BigQuery
데이터 큐브

함수명	내용	유사함수
filter()	지정한 조건식에 맞는 데이터 추출	subset()
select()	열의 추출	data[, c(“Year”, “Month”)]
mutate()	열 추가	transform()
arrange()	정렬	order(), sort()
summarise()	집계	aggregate()

필요 package

install.packages("nycflights13")
library("nycflights13")
library(tidyverse)

tidyverse 로딩시 아래와 같이 conflict메시지가 나올 수도 있다.

-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
√ ggplot2 2.2.1     √ purrr   0.2.4
√ tibble  1.4.1     √ dplyr   0.7.4
√ tidyr   0.7.2     √ stringr 1.2.0
√ readr   1.1.1     √ forcats 0.3.0
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()

flights데이터를 출력하면 아래와 같이 출력 된다.

> flights
# A tibble: 336,776 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1  2013     1     1      517            515      2.00      830            819     11.0  UA        1545 N14228  EWR    IAH      227  
 2  2013     1     1      533            529      4.00      850            830     20.0  UA        1714 N24211  LGA    IAH      227  
 3  2013     1     1      542            540      2.00      923            850     33.0  AA        1141 N619AA  JFK    MIA      160  
 4  2013     1     1      544            545     -1.00     1004           1022    -18.0  B6         725 N804JB  JFK    BQN      183  
 5  2013     1     1      554            600     -6.00      812            837    -25.0  DL         461 N668DN  LGA    ATL      116  
 6  2013     1     1      554            558     -4.00      740            728     12.0  UA        1696 N39463  EWR    ORD      150  
 7  2013     1     1      555            600     -5.00      913            854     19.0  B6         507 N516JB  EWR    FLL      158  
 8  2013     1     1      557            600     -3.00      709            723    -14.0  EV        5708 N829AS  LGA    IAD       53.0
 9  2013     1     1      557            600     -3.00      838            846    - 8.00 B6          79 N593JB  JFK    MCO      140  
10  2013     1     1      558            600     -2.00      753            745      8.00 AA         301 N3ALAA  LGA    ORD      138  
# ... with 336,766 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
> class(flights)
[1] "tbl_df"     "tbl"        "data.frame"

336,776의 뉴욕에서 2013에 출발한 비행 정보를 담고 있다.
US Bureau of Transportation Statistics

tibble이라는 data.frame을 약간 변형 시킨것을 사용한다.
tibble에 대한 상세한 내용의 링크이다.

dplyr basics

중요 package 5개를 설명한다.

Pick obervations by their values (filter())
Reorder the rows (arrange())
Pick variables by their names(select())
Create new variables with functions of existing variables (mutate())
Collpase many values down to a single summary (summarise())

추가로 group_by()를 통해서 scope을 조절 할 수 있다.

filter()

관찰한 데이터의 subset을 쉽게 구할 수 있도록 해준다.
첫 번째 인자는 data frame이고, 아래 예제에서 그 다음 인자들은 조건을 의미한다.
아래의 예제는 January 1st를 필터링한 예제이다.

nov_dec <- filter(flights, month %in% c(11, 12))

# A tibble: 55,403 x 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest  air_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>    <int> <chr>   <chr>  <chr>    <dbl>
 1  2013    11     1        5           2359      6.00      352            345      7.00 B6         745 N568JB  JFK    PSE      205  
 2  2013    11     1       35           2250    105         123           2356     87.0  B6        1816 N353JB  JFK    SYR       36.0
 3  2013    11     1      455            500   -  5.00      641            651    -10.0  US        1895 N192UW  EWR    CLT       88.0
 4  2013    11     1      539            545   -  6.00      856            827     29.0  UA        1714 N38727  LGA    IAH      229  
 5  2013    11     1      542            545   -  3.00      831            855    -24.0  AA        2243 N5CLAA  JFK    MIA      147  
 6  2013    11     1      549            600   - 11.0       912            923    -11.0  UA         303 N595UA  JFK    SFO      359  
 7  2013    11     1      550            600   - 10.0       705            659      6.00 US        2167 N748UW  LGA    DCA       57.0
 8  2013    11     1      554            600   -  6.00      659            701    - 2.00 US        2134 N742PS  LGA    BOS       40.0
 9  2013    11     1      554            600   -  6.00      826            827    - 1.00 DL         563 N912DE  LGA    ATL      126  
10  2013    11     1      554            600   -  6.00      749            751    - 2.00 DL         731 N315NB  LGA    DTW       93.0
# ... with 55,393 more rows, and 4 more variables: distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

arrange()

정렬 용도

arrange(flights, desc(arr_delay))
#> # A tibble: 336,776 × 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>
#> 1  2013     1     9      641            900      1301     1242
#> 2  2013     6    15     1432           1935      1137     1607
#> 3  2013     1    10     1121           1635      1126     1239
#> 4  2013     9    20     1139           1845      1014     1457
#> 5  2013     7    22      845           1600      1005     1044
#> 6  2013     4    10     1100           1900       960     1342
#> # ... with 3.368e+05 more rows, and 12 more variables:
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

select()

선택 용도

select(flights, year, month, day)
select(flights, year:day)
select(flights, -(year:day))

starts_with(“abc”): matches names that begin with “abc”.
ends_with(“xyz”): matches names that end with “xyz”.
contains(“ijk”): matches names that contain “ijk”.
matches(“(.)\1”): selects variables that match a regular expression. This one matches any variables that contain - repeated characters. You’ll learn more about regular expressions in strings.
num_range(“x”, 1:3) matches x1, x2 and x3.

mutate()

현재 있는 Data.Frame의 맨 뒤에다가 새로운 컬럼을 추가하는 것이다.

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)
mutate(flights_sml,
  gain = arr_delay - dep_delay,
  speed = distance / air_time * 60
)

# A tibble: 336,776 x 9
    year month   day dep_delay arr_delay distance air_time   gain speed
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>  <dbl> <dbl>
 1  2013     1     1      2.00     11.0      1400    227     9.00   370
 2  2013     1     1      4.00     20.0      1416    227    16.0    374
 3  2013     1     1      2.00     33.0      1089    160    31.0    408
 4  2013     1     1     -1.00    -18.0      1576    183   -17.0    517
 5  2013     1     1     -6.00    -25.0       762    116   -19.0    394
 6  2013     1     1     -4.00     12.0       719    150    16.0    288
 7  2013     1     1     -5.00     19.0      1065    158    24.0    404
 8  2013     1     1     -3.00    -14.0       229     53.0 -11.0    259
 9  2013     1     1     -3.00    - 8.00      944    140   - 5.00   405
10  2013     1     1     -2.00      8.00      733    138    10.0    319
# ... with 336,766 more rows

plyr:transform과 다른점은 연이어서 생성한 컬럼을 바로 사용해도 된다는 것이다.

mutate(flights_sml,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

# A tibble: 336,776 x 10
    year month   day dep_delay arr_delay distance air_time
   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl>
 1  2013     1     1      2.00     11.0      1400    227  
 2  2013     1     1      4.00     20.0      1416    227  
 3  2013     1     1      2.00     33.0      1089    160  
 4  2013     1     1     -1.00    -18.0      1576    183  
 5  2013     1     1     -6.00    -25.0       762    116  
 6  2013     1     1     -4.00     12.0       719    150  
 7  2013     1     1     -5.00     19.0      1065    158  
 8  2013     1     1     -3.00    -14.0       229     53.0
 9  2013     1     1     -3.00    - 8.00      944    140  
10  2013     1     1     -2.00      8.00      733    138  
# ... with 336,766 more rows, and 3 more variables: gain <dbl>,
#   hours <dbl>, gain_per_hour <dbl>

새로 생성한 variable columns만 유지하고 싶으면 transmute를 사용한다.

transmute(flights,
  gain = arr_delay - dep_delay,
  hours = air_time / 60,
  gain_per_hour = gain / hours
)

# A tibble: 336,776 x 3
     gain hours gain_per_hour
    <dbl> <dbl>         <dbl>
 1   9.00 3.78           2.38
 2  16.0  3.78           4.23
 3  31.0  2.67          11.6 
 4 -17.0  3.05         - 5.57
 5 -19.0  1.93         - 9.83
 6  16.0  2.50           6.40
 7  24.0  2.63           9.11
 8 -11.0  0.883        -12.5 
 9 - 5.00 2.33         - 2.14
10  10.0  2.30           4.35
# ... with 336,766 more rows

summarise()

그룹핑해서 하나의 single row로 합치는 역할을 한다.
그냥 summarise만 단독해서 사용하면 엄청나게 유용하진 않지만, group_by와 함께 사용하면 그 위력이 증대된다.

그냥 summarise하면 단 하나의 row로 처리됨

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 1 x 1
  delay
  <dbl>
1  12.6

group_by를 이용해서 그룹핑 가능함

by_day <- group_by(flights, year, month, day)
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
# pipe
group_by(flights, year, month, day) %>% summarise(delay = mean(dep_delay, na.rm = TRUE))

# A tibble: 365 x 4
# Groups:   year, month [?]
    year month   day delay
   <int> <int> <int> <dbl>
 1  2013     1     1 11.5 
 2  2013     1     2 13.9 
 3  2013     1     3 11.0 
 4  2013     1     4  8.95
 5  2013     1     5  5.73
 6  2013     1     6  7.15
 7  2013     1     7  5.42
 8  2013     1     8  2.55
 9  2013     1     9  2.28
10  2013     1    10  2.84
# ... with 355 more rows

파이프의 사용

아래 세개의 작업을 수행

flight를 destination에 따라서 그룹핑함
flight의 수와 거리, 평균 딜레이를 요약
Honolulu airport와 noisy point를 제거 한다.

파이를 사용하면 쓸 때 없는 중간 변수들을 정의하지 않아도 되기 때문에 편리하고 직관적인 이해를 달성한다.

파이프 사용

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "HNL")

파이프 미사용

# Combining multiple operations with the pipe
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
   count = n(),
   dist = mean(distance, na.rm = TRUE),
   delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "HNL")

참고자료

R 데이터 처리의 새로운 강자, dplyr 패키지
dplyr로 데이터 집계와 분석

저작자표시 (새창열림)

'AI > R Basic' 카테고리의 다른 글

R Studio Server Install in Ubuntu Server (3)	2018.07.04
Jupyter Notebook으로 R 실행 (3)	2018.04.15
두 개 데이터 프레임 병합, 빼기 (0)	2018.03.04
R Factor 검색 (0)	2017.11.13
RStudio v1.1 Released (0)	2017.10.10

GOOD to GREAT