JSONとNDJSONをRで扱う
はじめに
ここではJSONとNDJSONをRで扱う方法をまとめておく。
JSONとは
そもそもJSONとはなにか。JSON(JavaScript Object Notation)は、軽量のデータ交換フォーマットで、人間にも機械にも読み書きが容易な形式のデータのことらしい。
基本的には下記のようなKey-Value形式でデータを保持する。business_id
はvcNAWiLM4dR7D2nwwJ7nCA
で、hours
には、Tuesday
からThursday
まであって、おのおのOPENからCLOSEまでの時間がネストされて保持される。
[ { "business_id":"vcNAWiLM4dR7D2nwwJ7nCA", "full_address":"4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours":{ "Tuesday":{ "close":"17:00", "open":"08:00" }, "Friday":{ "close":"17:00", "open":"08:00" }, "Monday":{ "close":"17:00", "open":"08:00" }, "Wednesday":{ "close":"17:00", "open":"08:00" }, "Thursday":{ "close":"17:00", "open":"08:00" } }, "open":true, "categories":[ "Doctors", "Health & Medical" ], "city":"Phoenix", "review_count":9, "name":"Eric Goldberg, MD", "neighborhoods":[ ], "longitude":-111.98375799999999, "state":"AZ", "stars":3.5, "latitude":33.499313000000001, "attributes":{ "By Appointment Only":true }, "type":"business" } ]
JSON形式を読み込むには、jsonlite::fromJSON()
かjsonlite::read_json()
を使うことになる。デフォルトの設定であれば、jsonlite::fromJSON()
はデータフレームで、
library(jsonlite) library(tidyverse) jsonlite::fromJSON("json_fromat.json") business_id full_address hours.Tuesday.close hours.Tuesday.open hours.Friday.close 1 vcNAWiLM4dR7D2nwwJ7nCA 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018 17:00 08:00 17:00 2 UsFtqoBl7naz8AVUBZMjQQ 202 McClure St\nDravosburg, PA 15034 <NA> <NA> <NA> hours.Friday.open hours.Monday.close hours.Monday.open hours.Wednesday.close hours.Wednesday.open hours.Thursday.close 1 08:00 17:00 08:00 17:00 08:00 17:00 2 <NA> <NA> <NA> <NA> <NA> <NA> hours.Thursday.open open categories city review_count name neighborhoods longitude state stars 1 08:00 TRUE Doctors, Health & Medical Phoenix 9 Eric Goldberg, MD NULL -111.98376 AZ 3.5 2 <NA> TRUE Nightlife Dravosburg 4 Clancy's Pub NULL -79.88693 PA 3.5 latitude attributes.By Appointment Only attributes.Happy Hour attributes.Accepts Credit Cards attributes.Good For Groups 1 33.49931 TRUE NA NA NA 2 40.35052 NA TRUE TRUE TRUE attributes.Outdoor Seating attributes.Price Range type 1 NA NA business 2 FALSE 1 business
jsonlite::read_json()
はリストで読み込む。simplifyVector = TRUE
とすればデータフレームで返る。
jsonlite::read_json("json_fromat.json") [[1]] [[1]]$business_id [1] "vcNAWiLM4dR7D2nwwJ7nCA" [[1]]$full_address [1] "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018" [[1]]$hours [[1]]$hours$Tuesday [[1]]$hours$Tuesday$close [1] "17:00" 【略】 [[2]]$attributes [[2]]$attributes$`Happy Hour` [1] TRUE [[2]]$attributes$`Accepts Credit Cards` [1] TRUE [[2]]$attributes$`Good For Groups` [1] TRUE [[2]]$attributes$`Outdoor Seating` [1] FALSE [[2]]$attributes$`Price Range` [1] 1 [[2]]$type [1] "business"
NDJSON
NDJSON(Newline Delimited JSON)は、下記のような形式でJSONデータを保持している。改行区切りJSONとも呼ばれる。このような形式にする理由として、NDJSONは、1度に1レコードを処理できる構造化データを保存したり、ストリーミングするための便利な形式なんだとか。
{"business_id": "vcNAWiLM4dR7D2nwwJ7nCA", "full_address": "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018", "hours": {"Tuesday": {"close": "17:00", "open": "08:00"}, "Friday": {"close": "17:00", "open": "08:00"}, "Monday": {"close": "17:00", "open": "08:00"}, "Wednesday": {"close": "17:00", "open": "08:00"}, "Thursday": {"close": "17:00", "open": "08:00"}}, "open": true, "categories": ["Doctors", "Health & Medical"], "city": "Phoenix", "review_count": 9, "name": "Eric Goldberg, MD", "neighborhoods": [], "longitude": -111.98375799999999, "state": "AZ", "stars": 3.5, "latitude": 33.499313000000001, "attributes": {"By Appointment Only": true}, "type": "business"}
なので、1行目を表示させると、こんな違いがある。NDJSONは、1行で1つのレコードになる。必ずしもこうではないJSONファイルはやまほどある…泣
readLines("/Users/aki/Desktop/json_fromat.json" , n = 1, warn = FALSE) [1] "[" readLines("/Users/aki/Desktop/ndjson_fromat.json", n = 1, warn = FALSE) [1] "{\"business_id\": \"vcNAWiLM4dR7D2nwwJ7nCA\", \"full_address\": \"4840 E Indian School Rd\\nSte 101\\nPhoenix, AZ 85018\", \"hours\": {\"Tuesday\": {\"close\": \"17:00\", \"open\": \"08:00\"}, \"Friday\": {\"close\": \"17:00\", \"open\": \"08:00\"}, \"Monday\": {\"close\": \"17:00\", \"open\": \"08:00\"}, \"Wednesday\": {\"close\": \"17:00\", \"open\": \"08:00\"}, \"Thursday\": {\"close\": \"17:00\", \"open\": \"08:00\"}}, \"open\": true, \"categories\": [\"Doctors\", \"Health & Medical\"], \"city\": \"Phoenix\", \"review_count\": 9, \"name\": \"Eric Goldberg, MD\", \"neighborhoods\": [], \"longitude\": -111.98375799999999, \"state\": \"AZ\", \"stars\": 3.5, \"latitude\": 33.499313000000001, \"attributes\": {\"By Appointment Only\": true}, \"type\": \"business\"}"
なので、ここらへんを利用しつつNDJSONかどうかを判定するヘルパー関数を雑に作るとこんな感じ。
is_ndjson <- function(path) { if(str_detect(string = path, pattern = ".json$") == FALSE){ stop("must be json format") } valid <- readLines(con = path, n = 1, warn = FALSE) res <- jsonlite::validate(valid) if (res == TRUE) { res } else { res[[1]] } } is_ndjson(path = "/Users/aki/Desktop/iris.csv") is_ndjson(path = "/Users/aki/Desktop/iris.csv") でエラー: must be json format is_ndjson(path = "/Users/aki/Desktop/json_fromat.json") [1] FALSE is_ndjson(path = "/Users/aki/Desktop/ndjson_fromat.json") [1] TRUE
NDJSONの読み込みは、stream_in()
を使用する。file()
でコネクションを作る必要があります。
jsonlite::stream_in(file("/Users/aki/Desktop/ndjson_fromat.json")) opening file input connection. Imported 2 records. Simplifying... closing file input connection. business_id full_address hours.Tuesday.close hours.Tuesday.open hours.Friday.close 1 vcNAWiLM4dR7D2nwwJ7nCA 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018 17:00 08:00 17:00 2 UsFtqoBl7naz8AVUBZMjQQ 202 McClure St\nDravosburg, PA 15034 <NA> <NA> <NA> hours.Friday.open hours.Monday.close hours.Monday.open hours.Wednesday.close hours.Wednesday.open hours.Thursday.close 1 08:00 17:00 08:00 17:00 08:00 17:00 2 <NA> <NA> <NA> <NA> <NA> <NA> hours.Thursday.open open categories city review_count name neighborhoods longitude state stars 1 08:00 TRUE Doctors, Health & Medical Phoenix 9 Eric Goldberg, MD NULL -111.98376 AZ 3.5 2 <NA> TRUE Nightlife Dravosburg 4 Clancy's Pub NULL -79.88693 PA 3.5 latitude attributes.By Appointment Only attributes.Happy Hour attributes.Accepts Credit Cards attributes.Good For Groups 1 33.49931 TRUE NA NA NA 2 40.35052 NA TRUE TRUE TRUE attributes.Outdoor Seating attributes.Price Range type 1 NA NA business 2 FALSE 1 business
たとえば、'Yelp Academic Dataset Business'データは、Yelpに登録されているお店の情報がNDJSON形式で保存されている。ここからダウンロードできる。
is_ndjson(path = "yelp_academic_dataset_business.json") [1] TRUE df <- jsonlite::stream_in(file("/Users/aki/Desktop/yelp_academic_dataset_business.json")) opening file input connection. Imported 61184 records. Simplifying... closing file input connection. 警告メッセージ: parse_string(txt, bigint_as_char) で: 使われていないコネクション 4 (/Users/aki/Desktop/ndjson_fromat.json) を閉じます df %>% as_tibble() # A tibble: 61,184 x 15 business_id full_address hours$Tuesday$c… $$open $Friday$close $$open $Monday$close $$open $Wednesday$close $$open $Thursday$close $$open $Sunday$close $$open $Saturday$close $$open open categories city <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <lgl> <list> <chr> 1 vcNAWiLM4d… "4840 E Ind… 17:00 08:00 17:00 08:00 17:00 08:00 17:00 08:00 17:00 08:00 NA NA NA NA TRUE <chr [2]> Phoe… 2 UsFtqoBl7n… "202 McClur… NA NA NA NA NA NA NA NA NA NA NA NA NA NA TRUE <chr [1]> Drav… 3 cE27W9VPgO… "1530 Hamil… NA NA NA NA NA NA NA NA NA NA NA NA NA NA FALSE <chr [3]> Beth… 4 HZdLhv6COC… "301 S Hill… 21:00 10:00 21:00 10:00 21:00 10:00 21:00 10:00 21:00 10:00 18:00 11:00 21:00 10:00 TRUE <chr [6]> Pitt… 5 mVHrayjG3u… "414 Hawkin… 19:00 10:00 20:00 10:00 NA NA 19:00 10:00 19:00 10:00 NA NA 16:00 10:00 TRUE <chr [5]> Brad… 6 KayYbHCt-R… "141 Hawtho… NA NA NA NA NA NA NA NA NA NA NA NA NA NA TRUE <chr [4]> Carn… 7 b12U9TFESS… "718 Hope H… NA NA NA NA NA NA NA NA NA NA NA NA NA NA TRUE <chr [2]> Carn… 8 Sktj1eHQFu… "920 Forsyt… NA NA NA NA NA NA NA NA NA NA NA NA NA NA TRUE <chr [2]> Carn… 9 3ZVKmuK2l7… "8 Logan St… NA NA NA NA NA NA NA NA NA NA NA NA NA NA TRUE <chr [2]> Carn… 10 wJr6kSA5dc… "2100 Washi… 02:00 08:00 02:00 08:00 02:00 08:00 02:00 08:00 02:00 08:00 02:00 08:00 02:00 08:00 TRUE <chr [4]> Carn… # … with 61,174 more rows, and 86 more variables: review_count <int>, name <chr>, neighborhoods <list>, longitude <dbl>, state <chr>, stars <dbl>, latitude <dbl>, attributes$`By Appointment Only` <lgl>, $`Happy # Hour` <lgl>, $`Accepts Credit Cards` <list>, $`Good For Groups` <lgl>, $`Outdoor Seating` <lgl>, $`Price Range` <int>, $`Good for Kids` <lgl>, $Alcohol <chr>, $`Noise Level` <chr>, $`Has TV` <lgl>, # $Attire <chr>, $Ambience$romantic <lgl>, $$intimate <lgl>, $$classy <lgl>, $$hipster <lgl>, $$divey <lgl>, $$touristy <lgl>, $$trendy <lgl>, $$upscale <lgl>, $$casual <lgl>, $`Good For Dancing` <lgl>, # $Delivery <lgl>, $`Coat Check` <lgl>, $Smoking <chr>, $`Take-out` <lgl>, $`Takes Reservations` <lgl>, $`Waiter Service` <lgl>, $`Wi-Fi` <chr>, $Caters <lgl>, $`Good For`$dessert <lgl>, $$latenight <lgl>, # $$lunch <lgl>, $$dinner <lgl>, $$breakfast <lgl>, $$brunch <lgl>, $Parking$garage <lgl>, $$street <lgl>, $$validated <lgl>, $$lot <lgl>, $$valet <lgl>, $Music$dj <lgl>, $$background_music <lgl>, # $$karaoke <lgl>, $$live <lgl>, $$video <lgl>, $$jukebox <lgl>, $$playlist <lgl>, $`Drive-Thru` <lgl>, $`Wheelchair Accessible` <lgl>, $BYOB <lgl>, $Corkage <lgl>, $`BYOB/Corkage` <chr>, $`Order at # Counter` <lgl>, $`Good For Kids` <lgl>, $`Dogs Allowed` <lgl>, $`Open 24 Hours` <lgl>, $`Hair Types Specialized In`$coloring <lgl>, $$africanamerican <lgl>, $$curly <lgl>, $$perms <lgl>, $$kids <lgl>, # $$extensions <lgl>, $$asian <lgl>, $$straightperms <lgl>, $`Accepts Insurance` <lgl>, $`Ages Allowed` <chr>, $`Payment Types`$amex <lgl>, $$cash_only <lgl>, $$mastercard <lgl>, $$visa <lgl>, $$discover <lgl>, # $`Dietary Restrictions`$`dairy-free` <lgl>, $$`gluten-free` <lgl>, $$vegan <lgl>, $$kosher <lgl>, $$halal <lgl>, $$`soy-free` <lgl>, $$vegetarian <lgl>, type <chr>
階層構造の整理
JSONを読み込むとき、おおよそはネストしていることが多い。例えば、先程読み込んだデータは一見、データフレームに見える。
df %>% dplyr::select(1:3) %>% dplyr::slice(1:5) business_id full_address hours.Tuesday.close hours.Tuesday.open hours.Friday.close hours.Friday.open 1 vcNAWiLM4dR7D2nwwJ7nCA 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018 17:00 08:00 17:00 08:00 2 UsFtqoBl7naz8AVUBZMjQQ 202 McClure St\nDravosburg, PA 15034 <NA> <NA> <NA> <NA> 3 cE27W9VPgO88Qxe4ol6y_g 1530 Hamilton Rd\nBethel Park, PA 15234 <NA> <NA> <NA> <NA> 4 HZdLhv6COCleJMo7nPl-RA 301 S Hills Vlg\nPittsburgh, PA 15241 21:00 10:00 21:00 10:00 5 mVHrayjG3uZ_RLHkLj-AMg 414 Hawkins Ave\nBraddock, PA 15104 19:00 10:00 20:00 10:00 hours.Monday.close hours.Monday.open hours.Wednesday.close hours.Wednesday.open hours.Thursday.close hours.Thursday.open hours.Sunday.close hours.Sunday.open 1 17:00 08:00 17:00 08:00 17:00 08:00 <NA> <NA> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 4 21:00 10:00 21:00 10:00 21:00 10:00 18:00 11:00 5 <NA> <NA> 19:00 10:00 19:00 10:00 <NA> <NA> hours.Saturday.close hours.Saturday.open 1 <NA> <NA> 2 <NA> <NA> 3 <NA> <NA> 4 21:00 10:00 5 16:00 10:00
これに対して、そのまま変数名を引っ張ろうとするとエラーになる。
df %>% dplyr::select(hours.Tuesday.close) .f(.x[[i]], ...) でエラー: オブジェクト 'hours.Tuesday.close' がありません
理由は、str()
で構造を見てみると、hours >> Tuesday >> open/close
となっておりJSONの階層構造がそのまま反映されているため。
df %>% dplyr::select(1:3) %>% dplyr::slice(1:5) %>% str() 'data.frame': 5 obs. of 3 variables: $ business_id : chr "vcNAWiLM4dR7D2nwwJ7nCA" "UsFtqoBl7naz8AVUBZMjQQ" "cE27W9VPgO88Qxe4ol6y_g" "HZdLhv6COCleJMo7nPl-RA" ... $ full_address: chr "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018" "202 McClure St\nDravosburg, PA 15034" "1530 Hamilton Rd\nBethel Park, PA 15234" "301 S Hills Vlg\nPittsburgh, PA 15241" ... $ hours :'data.frame': 5 obs. of 7 variables: ..$ Tuesday :'data.frame': 5 obs. of 2 variables: .. ..$ close: chr "17:00" NA NA "21:00" ... .. ..$ open : chr "08:00" NA NA "10:00" ... ..$ Friday :'data.frame': 5 obs. of 2 variables: .. ..$ close: chr "17:00" NA NA "21:00" ... .. ..$ open : chr "08:00" NA NA "10:00" ... ..$ Monday :'data.frame': 5 obs. of 2 variables: .. ..$ close: chr "17:00" NA NA "21:00" ... .. ..$ open : chr "08:00" NA NA "10:00" ... ..$ Wednesday:'data.frame': 5 obs. of 2 variables: .. ..$ close: chr "17:00" NA NA "21:00" ... .. ..$ open : chr "08:00" NA NA "10:00" ... ..$ Thursday :'data.frame': 5 obs. of 2 variables: .. ..$ close: chr "17:00" NA NA "21:00" ... .. ..$ open : chr "08:00" NA NA "10:00" ... ..$ Sunday :'data.frame': 5 obs. of 2 variables: .. ..$ close: chr NA NA NA "18:00" ... .. ..$ open : chr NA NA NA "11:00" ... ..$ Saturday :'data.frame': 5 obs. of 2 variables: .. ..$ close: chr NA NA NA "21:00" ... .. ..$ open : chr NA NA NA "10:00" ...
値を取り出すために、色々な面倒がある。
df$hours$Tuesday$close[1:5] [1] "17:00" NA NA "21:00" "19:00"
なので、jsonlite::flatten()
で階層構造をフラットにする。見た目は変わらないが、階層構造はなくなっている。すごく便利な関数。
df %>% dplyr::select(1:3) %>% dplyr::slice(1:5) %>% jsonlite::flatten() %>% str() 'data.frame': 5 obs. of 16 variables: $ business_id : chr "vcNAWiLM4dR7D2nwwJ7nCA" "UsFtqoBl7naz8AVUBZMjQQ" "cE27W9VPgO88Qxe4ol6y_g" "HZdLhv6COCleJMo7nPl-RA" ... $ full_address : chr "4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018" "202 McClure St\nDravosburg, PA 15034" "1530 Hamilton Rd\nBethel Park, PA 15234" "301 S Hills Vlg\nPittsburgh, PA 15241" ... $ hours.Tuesday.close : chr "17:00" NA NA "21:00" ... $ hours.Tuesday.open : chr "08:00" NA NA "10:00" ... $ hours.Friday.close : chr "17:00" NA NA "21:00" ... $ hours.Friday.open : chr "08:00" NA NA "10:00" ... $ hours.Monday.close : chr "17:00" NA NA "21:00" ... $ hours.Monday.open : chr "08:00" NA NA "10:00" ... $ hours.Wednesday.close: chr "17:00" NA NA "21:00" ... $ hours.Wednesday.open : chr "08:00" NA NA "10:00" ... $ hours.Thursday.close : chr "17:00" NA NA "21:00" ... $ hours.Thursday.open : chr "08:00" NA NA "10:00" ... $ hours.Sunday.close : chr NA NA NA "18:00" ... $ hours.Sunday.open : chr NA NA NA "11:00" ... $ hours.Saturday.close : chr NA NA NA "21:00" ... $ hours.Saturday.open : chr NA NA NA "10:00" ... df %>% dplyr::select(1:3) %>% dplyr::slice(1:5) %>% jsonlite::flatten() business_id full_address hours.Tuesday.close hours.Tuesday.open hours.Friday.close hours.Friday.open 1 vcNAWiLM4dR7D2nwwJ7nCA 4840 E Indian School Rd\nSte 101\nPhoenix, AZ 85018 17:00 08:00 17:00 08:00 2 UsFtqoBl7naz8AVUBZMjQQ 202 McClure St\nDravosburg, PA 15034 <NA> <NA> <NA> <NA> 3 cE27W9VPgO88Qxe4ol6y_g 1530 Hamilton Rd\nBethel Park, PA 15234 <NA> <NA> <NA> <NA> 4 HZdLhv6COCleJMo7nPl-RA 301 S Hills Vlg\nPittsburgh, PA 15241 21:00 10:00 21:00 10:00 5 mVHrayjG3uZ_RLHkLj-AMg 414 Hawkins Ave\nBraddock, PA 15104 19:00 10:00 20:00 10:00 hours.Monday.close hours.Monday.open hours.Wednesday.close hours.Wednesday.open hours.Thursday.close hours.Thursday.open hours.Sunday.close hours.Sunday.open 1 17:00 08:00 17:00 08:00 17:00 08:00 <NA> <NA> 2 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> 4 21:00 10:00 21:00 10:00 21:00 10:00 18:00 11:00 5 <NA> <NA> 19:00 10:00 19:00 10:00 <NA> <NA> hours.Saturday.close hours.Saturday.open 1 <NA> <NA> 2 <NA> <NA> 3 <NA> <NA> 4 21:00 10:00 5 16:00 10:00
階層構造をフラットにすれば、そのまま変数名を引っ張れる。
df %>% jsonlite::flatten() %>% dplyr::select(hours.Tuesday.close) %>% dplyr::slice(1:5) hours.Tuesday.close 1 17:00 2 <NA> 3 <NA> 4 21:00 5 19:00
データフレームとティブル
JOSNを扱う上でもう1つ面倒なのでが、ネストしている値の見え方。データフレームとティブルでは色々違うので、基本はティブルを推奨。まずはデータフレームで見て見ると、categories
はカンマ区切りになっているようみ見える。
head(df["categories"], 5) categories 1 Doctors, Health & Medical 2 Nightlife 3 Active Life, Mini Golf, Golf 4 Shopping, Home Services, Internet Service Providers, Mobile Phones, Professional Services, Electronics 5 Bars, American (New), Nightlife, Lounges, Restaurants
なので、これを分割してみると、なんかc()
で結合されているっぽい。
head(df["categories"], 5) %>% tidyr::separate(categories, into = c("col1", "col2", "col3", "col4", "col5", "col6"), sep = ",") col1 col2 col3 col4 col5 col6 1 c("Doctors" "Health & Medical") <NA> <NA> <NA> <NA> 2 Nightlife <NA> <NA> <NA> <NA> <NA> 3 c("Active Life" "Mini Golf" "Golf") <NA> <NA> <NA> 4 c("Shopping" "Home Services" "Internet Service Providers" "Mobile Phones" "Professional Services" "Electronics") 5 c("Bars" "American (New)" "Nightlife" "Lounges" "Restaurants") <NA> 警告メッセージ: Expected 6 pieces. Missing pieces filled with `NA` in 4 rows [1, 2, 3, 5].
その理由は、これはもともとリストなので、データフレームが,
でつないで、表示しているだけ…なんと。as_tibble()
すると、リストのまま表示してくれるので、色々とトラブルは回避できそう。
df_flat <- df %>% jsonlite::flatten() %>% as_tibble() df_flat %>% select(categories) %>% head(5) # A tibble: 5 x 1 categories <list> 1 <chr [2]> 2 <chr [1]> 3 <chr [3]> 4 <chr [6]> 5 <chr [5]>
データフレームのリストであれば、mutate(hoge = map())
で色々できるので、あとはよしなにどうぞ。拙文ではあるが{purrr}で覗くStarWarsの世界とか参考になるかもしれません。
df_flat %>% head(5) %>% dplyr::select(business_id, categories) %>% dplyr::mutate(length = purrr::map_int(.x = categories, .f = function(x){length(x)}), flg_Bars = purrr::map_lgl(.x = categories, .f = function(x){ any(str_detect(string = x, pattern = "Bars")) }) )%>% arrange(business_id) # A tibble: 5 x 4 business_id categories length flg_Bars <chr> <list> <int> <lgl> 1 cE27W9VPgO88Qxe4ol6y_g <chr [3]> 3 FALSE 2 HZdLhv6COCleJMo7nPl-RA <chr [6]> 6 FALSE 3 mVHrayjG3uZ_RLHkLj-AMg <chr [5]> 5 TRUE 4 UsFtqoBl7naz8AVUBZMjQQ <chr [1]> 1 FALSE 5 vcNAWiLM4dR7D2nwwJ7nCA <chr [2]> 2 FALSE
business_id
がmVHrayjG3uZ_RLHkLj-AMg
にはBars
が含まれているようなので、バラして確認する…separate()
のinto
のところが嫌な感じ。なので、本番で仮にいるならコンマの最大数とかで渡す工夫するか。確かにあるっぽいので、問題ない。
df_flat %>% head(5) %>% select(business_id, categories) %>% unnest(cols = c(categories)) %>% group_by(business_id) %>% summarise(categories_list = paste0(categories, collapse = ",")) %>% tidyr::separate(categories_list, into = paste0("cate", 1:6), sep = ",") %>% arrange(business_id) A tibble: 5 x 7 business_id cate1 cate2 cate3 cate4 cate5 cate6 <chr> <chr> <chr> <chr> <chr> <chr> <chr> 1 cE27W9VPgO88Qxe4ol6y_g Active Life Mini Golf Golf NA NA NA 2 HZdLhv6COCleJMo7nPl-RA Shopping Home Services Internet Service Providers Mobile Phones Professional Services Electronics 3 mVHrayjG3uZ_RLHkLj-AMg Bars American (New) Nightlife Lounges Restaurants NA 4 UsFtqoBl7naz8AVUBZMjQQ Nightlife NA NA NA NA NA 5 vcNAWiLM4dR7D2nwwJ7nCA Doctors Health & Medical NA NA NA NA 警告メッセージ: