Saturday, 8 December, 2018 UTC


Summary

“Nothing is certain except for death and taxes.”
This isn’t a dataset made with a bed of roses or manicured green grass. A bit more serious. Let’s see if we can quickly learn anything here. The dataset is the following.
“name” : “NCHS – Leading Causes of Death: United States”,
“attribution” : “National Center for Health Statistics”,
The public dataset is available at https://data.cdc.gov/api/views/bi63-dtpu/rows.json?accessType=DOWNLOAD
Step 1: Download the file into a local file (e.g. health.json). Upload this file to one of the nodes in the Couchbase cluster.
Step 2: import the data into a bucket called cause.  After you create the bucket, create the primary index. You’ll need this for querying.
/opt/couchbase/bin/cbimport json -c couchbase://127.0.0.1 -u Administrator -p password -b cause -d file://health.json -g cause:0 -f sample
CREATE PRIMARY INDEX ON cause;
Step 3. Inspect the structure of the data.
All of the data is provided in a SINGLE JSON document.  Because of this, INFER doesn’t help.  You’ll have to inspect and understand the structure manually. This data in typical government dataset with a lot of data in simple arrays with the meaning of each entity given at in the metadata.
Simple array:
<strong>select data from cause ;</strong>
 
 
 
 
 
 
 
 
This simply contains an array of data without the schema.  For the public datasets, the schema is in the meta field.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Let’s transform the structure into simple JSON key-value pairs so we can handle these bit more effectively.  You can learn more about how this magic happened in this article.
WITH cs AS (
  SELECT
    meta.`view`.columns [*].fieldName f,
    data
  FROM
    cause
)
SELECT
  o
FROM
  cs UNNEST cs.data AS d1 
LET o = OBJECT p :d1 [ARRAY_POSITION(cs.f, p)] FOR p IN cs.f END;
 
 
 
 
 
 
 
 
Task1: Find out the cause for most deaths in a state, by year.
The common table expression (CTE) in the WITH clause (csdata) transforms the complex json data into flat JSON.   You can do this dynamically or do this once and INSERT back into a bucket, as I’ve discussed in the article on New York baby names. In this article, I use CTEs.
WITH csdata as (
  WITH cs AS (
    SELECT
      meta.`view`.columns [*].fieldName f,
      data
    FROM
      cause
  )
  SELECT
    o
  FROM
    cs UNNEST cs.data AS d1 LET o = OBJECT p :d1 [ARRAY_POSITION(cs.f, p)] FOR p IN cs.f END
)
SELECT
  c.o.state,
  c.o.year,
  c.o.cause_name,
  COUNT(c.o.cause_name),
  SUM(TONUMBER(c.o.deaths)) totdeaths
FROM
  csdata as c
WHERE
  c.o.state &lt;&gt; "United States"
  and c.o.cause_name &lt;&gt; "All causes"
GROUP BY
  c.o.state,
  c.o.year,
  c.o.cause_name
ORDER BY
  totdeaths DESC,
  c.o.state,
  c.o.year
 
 
 
 
 
 
 
 
In this case, all the deaths in California come on top, mainly due to its population.
Task 2. Find out leading causes of death in each state for the year 2016.
Query 2: Use the resultset from the previous query and then use the FIRST_VALUE() window function to determine the top cause.  Partitioning by state (in the OVER BY clause) will give you the partitions by state and ORDER BY dx.totdeaths within the OVER BY clause will give you the top cause in every state.
WITH csdata as (
  WITH cs AS (
    SELECT
      meta.`view`.columns [*].fieldName f,
      data
    FROM
      cause
  )
  SELECT
    o
  FROM
    cs UNNEST cs.data AS d1 LET o = OBJECT p :d1 [ARRAY_POSITION(cs.f, p)] FOR p IN cs.f END
),
d2 as(
SELECT
  c.o.state,
  c.o.year,
  c.o.cause_name,
  SUM(TONUMBER(c.o.deaths)) totdeaths
FROM
  csdata as c
WHERE
  c.o.state &lt;&gt; "United States"
  and c.o.cause_name &lt;&gt; "All causes"
  and c.o.year = "2016"
GROUP BY
  c.o.state,
  c.o.year,
  c.o.cause_name),
d3 as (
SELECT dx.state, dx.cause_name, dx.totdeaths,
  FIRST_VALUE(dx.cause_name) OVER(PARTITION BY dx.state ORDER BY dx.totdeaths DESC) topreason,
  FIRST_VALUE(dx.totdeaths) OVER(PARTITION BY dx.state ORDER BY dx.totdeaths DESC) topcount
FROM d2 dx)
SELECT d3
FROM d3
WHERE d3.topcount = d3.totdeaths
order by d3.state
 
 
 
 
 
 
 
 
Task 3.  Find out how the top reason has changed by the year, from 1999 to 2016 by state.
Query 3:  Simply generate the report for all the years (199-2016) and then determine the top reason and finally get the highest reason by grouping by state, year and getting MAX(topcount) for the topreason cause.
WITH csdata as (
  WITH cs AS (
    SELECT
      meta.`view`.columns [*].fieldName f,
      data
    FROM
      cause
  )
  SELECT
    o
  FROM
    cs UNNEST cs.data AS d1 LET o = OBJECT p :d1 [ARRAY_POSITION(cs.f, p)] FOR p IN cs.f END
),
d2 as(
SELECT
  c.o.state,
  c.o.year,
  c.o.cause_name,
  SUM(TONUMBER(c.o.deaths)) totdeaths
FROM
  csdata as c
WHERE
  c.o.state &lt;&gt; "United States" 
  and c.o.cause_name &lt;&gt; "All causes"
GROUP BY
  c.o.state,
  c.o.year,
  c.o.cause_name),
d3 as (
SELECT dx.state, dx.year,
  FIRST_VALUE(dx.cause_name) OVER(PARTITION BY dx.state, dx.year ORDER BY dx.totdeaths DESC ) topreason,
  FIRST_VALUE(dx.totdeaths) OVER(PARTITION BY dx.state, dx.year ORDER BY dx.totdeaths DESC) topcount
FROM d2 dx)
SELECT d3.state , d3.year , d3.topreason, max(d3.topcount) topcount
FROM d3
GROUP BY d3.state, d3.year, d3.topreason
order by d3.state, d3.year
Here’s the partial result.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Visualizing this gives us the following histogram.
 
 
The post JSON to Insights: Analyzing US healthcare Data. appeared first on The Couchbase Blog.