Rのこと。

記事は引っ越し作業中。2023年中までに引っ越しを完了させてブログは削除予定

lubridate::mdy()の挙動に関するメモ

はじめに

lubridate::mdy()の挙動で知らないことがあったのでメモ。

lubridate::mdy()

lubridate::mdy()は簡単にいうと、年日月という形式のデータを日付に変換してくれる関数。なんだけど、渡す文字列によっては想定どおりに変換されないので注意。"December 4 2020"は2020年12月4日に変換されてほしいが日の要素の頭に空白または0がないとうまく変換されない。

packageVersion("lubridate")
[1] ‘1.7.9’

mdy("December 4 2020")
[1] "2020-04-20"

mdy("December  4 2020")
[1] "2020-12-04"

mdy("December 04 2020")
[1] "2020-12-04"

例えば、December 4, 2020 at 04:44PMという形式のデータがあると、年月日の部分だけ取り出して、lubridate::mdy()を使うと痛い目にあう。

df <- tibble(date_hm = c("December 4, 2020 at 04:44PM",
                         "December  4, 2020 at 04:44PM",
                         "December 04, 2020 at 04:44PM",
                         "April 4, 2020 at 04:44PM",
                         "April  4, 2020 at 04:44PM",
                         "April 04, 2020 at 04:44PM"),
             date = c("December 4, 2020",
                      "December  4, 2020",
                      "December 04, 2020",
                      "April 4, 2020",
                      "April  4, 2020",
                      "April 04, 2020")
)
df %>% 
  dplyr::mutate(test_mdy_hm = lubridate::mdy_hm(date_hm),
                test_mdy    = lubridate::mdy(date)) %>% 
  dplyr::select(date_hm, test_mdy_hm, date, test_mdy)

# A tibble: 6 x 4
  date_hm                      test_mdy_hm         date              test_mdy  
  <chr>                        <dttm>              <chr>             <date>    
1 December 4, 2020 at 04:44PM  2020-12-04 16:44:00 December 4, 2020  2020-04-20 # No
2 December  4, 2020 at 04:44PM 2020-12-04 16:44:00 December  4, 2020 2020-12-04
3 December 04, 2020 at 04:44PM 2020-12-04 16:44:00 December 04, 2020 2020-12-04
4 April 4, 2020 at 04:44PM     2020-04-04 16:44:00 April 4, 2020     2020-04-20 # No
5 April  4, 2020 at 04:44PM    2020-04-04 16:44:00 April  4, 2020    2020-04-04
6 April 04, 2020 at 04:44PM    2020-04-04 16:44:00 April 04, 2020    2020-04-04

ということで、暫定的に下記のように面倒だけどコネコネして変換する。ほかにもっといい方法はあると思う。

library(googlesheets4)
library(tidyverse)
library(lubridate)

sheets_auth()

d <- read_sheet(
  ss = "1ZKHN9PeuApHGzetxptpYz0C3K0uvfT5OVX3gWKw-poY",
  range = "シート1!A1:C154"
) %>% 
  dplyr::select(-unit)

d %>% 
  dplyr::mutate(
    date0 = stringr::str_replace(date, pattern = '[:space:]at.*', ''),
    date1 = stringr::str_replace(date0, pattern = ',', '')
    ) %>% 
  tidyr::separate(date1, c("mm", "dd", "yyyy")) %>% 
  dplyr::mutate(date_mod = ymd(paste0(yyyy,"/",mm,"/",dd))) %>% 
  dplyr::select(date, date_mod, weight)

# A tibble: 153 x 3
   date                     date_mod   weight
   <chr>                    <date>      <dbl>
 1 July 5, 2020 at 10:44PM  2020-07-05   88.4
 2 July 6, 2020 at 18:53PM  2020-07-06   86.8
 3 July 7, 2020 at 18:40PM  2020-07-07   86.2
 4 July 8, 2020 at 18:37PM  2020-07-08   86.0
 5 July 9, 2020 at 20:43PM  2020-07-09   85.2
 6 July 10, 2020 at 20:21PM 2020-07-10   84.9
 7 July 11, 2020 at 20:20PM 2020-07-11   84.7
 8 July 12, 2020 at 19:1PM  2020-07-12   84  
 9 July 13, 2020 at 20:23PM 2020-07-13   83.4
10 July 14, 2020 at 18:57PM 2020-07-14   83.1
# … with 143 more rows