Introduction

Business is the cornerstone of prosperity in society. Companies create the resources that permit social development and welfare. In the background of number of companies is growing bigger and changing rapidly, we can not know structure of their distribution among industries if we do not conduct effective analysis.

In order to get to understand business pattern in the society and support questions such as which states are with the largest scale of business, what is payroll status of different industries like, what are the competitive industries of New York state, this study is designed to help.

It is intriguing and meaningful to explore deep into massive data to get holistic perception of the society’s business status and comprehensive understanding of some particular industries. In this study, figures and interactive graphics will be produced to get knowledge about industries status among different states using indexes like employment, annual payroll and number of firms. The relationship between employment and annual payroll will be discussed. Industries development status in New York state will be focused on. And actual data will be used to identify spatial autocorrelation of information industry.


Materials and methods

Packages

Here are all the packages needed for the project:

library(readxl)
library(ggplot2)
library(highcharter)
library(dplyr)
library(viridisLite)
library(forecast)
library(treemap)
library(flexdashboard)
library(leaflet)
library(sp)
library(geojson)
library(geojsonio)
library(rgdal)
library(foreach)
library(rbokeh)
library(plotly)
library(raster)
library(spdep)
library(knitr)
library(sf)
library(widgetframe)
knitr::opts_chunk$set(cache = TRUE)  # cache the results for quick compiling


Data

Original data including American Annual Payroll and Employment by states (2015), American Annual Payroll and Employment among industries by states (2015), the US states polygon data (geojson). Modifications are made to make these data more suitable for the project.

1. Total number of firms among different states:

if (!file.exists("./data/total_2015.xlsx"))
{
download.file("https://www2.census.gov/programs-surveys/susb/tables/2015/us_state_totals_2015.xlsx",destfile = "./data/total_2015.xlsx",mode="wb")
}
dataset <- read_excel("./data/total_2015.xlsx",1,skip=16)
colnames(dataset) <- c("FIPS","Geo_Des","Emp_Size","NFirms","NEst","NEmp","Flag","Ann_Pay","Ann_Flag")
subset <- dataset[dataset$Emp_Size=="01:  Total",]
knitr::kable(head(subset), caption = "") 
FIPS Geo_Des Emp_Size NFirms NEst NEmp Flag Ann_Pay Ann_Flag
01 Alabama 01: Total 73409 98540 1634391 G 67370353 G
02 Alaska 01: Total 16952 20907 267999 G 15643303 G
04 Arizona 01: Total 105004 136352 2295186 G 102671393 G
05 Arkansas 01: Total 50451 65175 1003113 G 39451191 G
06 California 01: Total 740303 908120 14325377 G 856954246 G
08 Colorado 01: Total 133930 161737 2253795 G 117539555 G

2. Geojson Data with interested industries:

df <- read.csv(file="./data/ann_industry.csv",header=TRUE,skip=2)
colnames(df) <- c("GEOID","ID","GEO_NAME","NAICS","INDUSTRY","YEAR","ESTAB","EMP","PAYQ1","PAYANN")

url <- "http://leafletjs.com/examples/choropleth/us-states.js"
doc <- readLines(url)
doc2 <- gsub("var statesData = ", "", doc)
fn<-tempfile()
write(doc2,fn)
write(doc2, file = "data/us-states.geojson")
poly <- readOGR(dsn=fn)
## OGR data source with driver: GeoJSON 
## Source: "C:\Users\timwe\AppData\Local\Temp\Rtmpiw2HnI\file303c255d6afe", layer: "file303c255d6afe"
## with 52 features
## It has 3 fields
states <- geojsonio::geojson_read("data/us-states.geojson", what = "sp")
for(i in 1:nrow(poly))
{
poly[i,'Agriculture'] <- as.numeric(df[which(df$GEO_NAME==as.character(poly[i,]$name) & df$INDUSTRY=='Agriculture, forestry, fishing and hunting'),][1,10])
poly[i,'Utilities'] <- as.numeric(df[which(df$GEO_NAME==as.character(poly[i,]$name) & df$INDUSTRY=='Utilities'),][1,10])
poly[i,'Manufacturing'] <- as.numeric(df[which(df$GEO_NAME==as.character(poly[i,]$name) & df$INDUSTRY=='Manufacturing'),][1,10])
poly[i,'Information'] <- as.numeric(df[which(df$GEO_NAME==as.character(poly[i,]$name) & df$INDUSTRY=='Information'),][1,10])
poly[i,'HealthCare'] <- as.numeric(df[which(df$GEO_NAME==as.character(poly[i,]$name) & df$INDUSTRY=='Health care and social assistance'),][1,10])
poly[i,'Arts'] <- as.numeric(df[which(df$GEO_NAME==as.character(poly[i,]$name) & df$INDUSTRY=='Arts, entertainment, and recreation'),][1,10])
}

3. Status of industries among different states:

df2 <- read.csv(file="./data/ann_industryc.csv",header=TRUE)
colnames(df2) <- c("GEOID","ID","GEO_NAME","NAICS","INDUSTRY","YEAR","ESTAB","EMP","PAYQ1","PAYANN") 
knitr::kable(head(df2), caption = "") 
GEOID ID GEO_NAME NAICS INDUSTRY YEAR ESTAB EMP PAYQ1 PAYANN
0400000US04 4 Arizona 3364 Aerospace product and parts manufacturing 2015 79 19795 514076 1900315
0400000US04 4 Arizona 11 Agriculture, forestry, fishing and hunting 2015 203 1610 14158 56552
0400000US04 4 Arizona 481 Air transportation 2015 109 17052 331345 1254002
0400000US04 4 Arizona 336411 Aircraft manufacturing 2015 9 h D D
0400000US04 4 Arizona 451211 Book stores 2015 141 1811 6038 24646
0400000US04 4 Arizona 51 Information 2015 2121 49040 904992 3551580

Methods

Different kinds of interactive visualization, spatial autocorrelation are used as main methods in the project.


Results

Numbers of Firms by State

Scale of business among states can be somehow recognized based on the number of firms. Interactive map showed below provides preliminary knowledge of business scale.

thm <- 
  hc_theme(
    colors = c("#1a6ecc", "#434348", "#90ed7d"),
    chart = list(
      backgroundColor = "transparent",
      style = list(fontFamily = "Source Sans Pro")
    ),
    xAxis = list(
      gridLineWidth = 1
    )
  )
n <- 4
colstops <- data.frame(
  q = 0:n/n,
  c = substring(viridis(n + 1), 0, 7)) %>%
  list_parse2()

highchart() %>%
  hc_add_series_map(usgeojson, subset, name = "Number of Firms",
                    value = "NFirms", joinBy = c("woename", "Geo_Des"),
                    dataLabels = list(enabled = TRUE,
                                      format = '{point.properties.postalcode}')) %>%
  hc_colorAxis(stops = colstops) %>%
  hc_legend(valueDecimals = 0, valueSuffix = "%") %>%
  hc_mapNavigation(enabled = TRUE) %>%
  hc_add_theme(thm)

Payroll and Employment

Two interactive figures showed below are to provides some more detailed knowledge of business scale among states based indexes of annual payroll and employment. The first fugure uses area of polygon to represent the number of annual payroll and the unit is 1000 dollars. The second figure lists states with top 10 employment. It is easy to find that ranks of annual payroll and employment are similar.

Annual Payroll by State

tm <- treemap(subset, index = c("Emp_Size", "Geo_Des"),
              vSize = "Ann_Pay", vColor = "Ann_Pay",
              type = "value", palette = rev(viridis(6)),title="2017 Annual Payroll of all firms by state")

highchart() %>% 
  hc_add_series_treemap(tm, allowDrillToNode = TRUE,
                        layoutAlgorithm = "squarified") %>% 
  hc_add_theme(thm)

Total Employment by State (Top 10)

set.seed(2)

nprods <- 10
barData <- subset[order(subset$NEmp,decreasing=TRUE),][1:10,]
barData %>% 
  .$Geo_Des %>% 
  rep(times = barData$NEmp) %>% 
  factor(levels = unique(.)) %>% 
  hchart(showInLegend = FALSE, name = "NEmp", pointWidth = 10) %>% 
  hc_add_theme(thm) %>% 
  hc_chart(type = "bar")

Relationship between Payroll and Employment

From the analysis above, phenomenon that ranks of annual payroll and employment are similar is found. Hypothesis that annual payroll and employment are positive correlated can be come up with. In order to judge the correctness of the hypothesis, annual payroll and emplyment data of 18 states is used to figure out rules between annual payroll and employment. Axix-X shows the number of employment and axix-Y shows the number of annual payroll. Different shapes of points represent data in different state.

figure(width=800,height=800) %>% 
  ly_points(EMP,PAYANN,data=df2,color=GEO_NAME,glyph=GEO_NAME,hover=list(INDUSTRY,EMP,PAYANN)) %>% y_axis(visible=FALSE) %>% x_axis(visible=FALSE)

Result shows that annual payroll and employment in most states are not linear related. So that conclustion can be there is no direct connection between payroll and employment.

Payroll of interested industries

More detailed information about business is needed. Interactive map showed below displays selected industries’ payroll among different states. It is obvious to know that which industries are leading industries of different states.

  leaflet(poly) %>% addTiles() %>%
  addPolygons(label=paste(poly$name," (Agriculture=",poly$Agriculture,")"),
              group = "Agriculture",
              color = "#444444", 
              weight = 0.1, 
              smoothFactor = 0.5,
              opacity = 1.0, 
              fillOpacity = 0.5,
              fillColor = ~colorQuantile("YlOrRd", Agriculture)(Agriculture),
              highlightOptions = highlightOptions(color = "white", weight = 2,
                bringToFront = TRUE)) %>%
  addPolygons(label=paste(poly$name," (Utilities=",poly$Utilities,")"),
              group = "Utilities",
              color = "#444444", 
              weight = 0.1, 
              smoothFactor = 0.5,
              opacity = 1.0, 
              fillOpacity = 0.5,
              fillColor = ~colorQuantile("YlOrRd", Utilities)(Utilities),
              highlightOptions = highlightOptions(color = "white", weight = 2,
                bringToFront = TRUE)) %>%
  addPolygons(label=paste(poly$name," (Manufacturing=",poly$Manufacturing,")"),
              group = "Manufacturing",
              color = "#444444", 
              weight = 0.1, 
              smoothFactor = 0.5,
              opacity = 1.0, 
              fillOpacity = 0.5,
              fillColor = ~colorQuantile("YlOrRd", Manufacturing)(Manufacturing),
              highlightOptions = highlightOptions(color = "white", weight = 2,
                bringToFront = TRUE)) %>%
  addPolygons(label=paste(poly$name," (Information=",poly$Information,")"),
              group = "Information",
              color = "#444444", 
              weight = 0.1, 
              smoothFactor = 0.5,
              opacity = 1.0, 
              fillOpacity = 0.5,
              fillColor = ~colorQuantile("YlOrRd", Information)(Information),
              highlightOptions = highlightOptions(color = "white", weight = 2,
                bringToFront = TRUE)) %>%
   addPolygons(label=paste(poly$name," (HealthCare=",poly$HealthCare,")"),
              group = "HealthCare",
              color = "#444444", 
              weight = 0.1, 
              smoothFactor = 0.5,
              opacity = 1.0, 
              fillOpacity = 0.5,
              fillColor = ~colorQuantile("YlOrRd", HealthCare)(HealthCare),
              highlightOptions = highlightOptions(color = "white", weight = 2,
                bringToFront = TRUE)) %>%
   addPolygons(label=paste(poly$name," (Arts=",poly$Arts,")"),
              group = "Arts",
              color = "#444444", 
              weight = 0.1, 
              smoothFactor = 0.5,
              opacity = 1.0, 
              fillOpacity = 0.5,
              fillColor = ~colorQuantile("YlOrRd", Arts)(Arts),
              highlightOptions = highlightOptions(color = "white", weight = 2,
                bringToFront = TRUE)) %>%
  addLayersControl(
    baseGroups = c("Agriculture", "Utilities","Manufacturing","Information","HealthCare","Arts"),
    options = layersControlOptions(collapsed = FALSE)
  )%>%
addMiniMap()

Industries in New York State

Interesting phenomenan are found when some selected industries data of New York state is displayed. The first histogram displays annual payroll among different industries and the second displays employment among different industries. Some more information can be explored When comparing these two histograms. For some more analysis, please see the 4th point in the part Conclusions.

plot_ly(df2[which(df2$GEO_NAME=='New York'),],type="bar",x=~INDUSTRY,y=~PAYANN,color=~INDUSTRY) %>% 
 layout(xaxis = list(showline = F, 
                      showticklabels = F, 
                      fixedrange = T, 
                      title = ""),
         yaxis = list(visible=FALSE))
plot_ly(df2[which(df2$GEO_NAME=='New York'),],type="bar",x=~INDUSTRY,y=~EMP,color=~INDUSTRY) %>% 
 layout(xaxis = list(showline = F, 
                      showticklabels = F, 
                      fixedrange = T, 
                      title = ""),
         yaxis = list(visible=FALSE))

Spatial Autocorrelation of Information Industry

Information Industry is a special industry which is not limited too much by geographical factors when compared with traditional industries. But it could be more convincing to prove it with real data. Here only data of states in the continent US is anaylzed since only those states are geographical adjacent.

Main goal of the calculation below is to obtain the Moran’s I value and p-value which can be analyzed later to figure our whether information industry is spatial autocorrelated.

Adjacency is used as criterion to determine which polygons are “near” and to quantify that.

Function poly2nb is used to create the object of “neighbour”.

poly2 <- poly[-which(poly$name == 'Puerto Rico'|poly$name =='Alaska'|poly$name =='Hawaii'),]
head(data.frame(poly2))
##   id        name density Agriculture Utilities Manufacturing Information
## 0 01     Alabama   94.65       18813      9097          4410       12510
## 2 04     Arizona   57.05       33876      3275         42194       24880
## 3 05    Arkansas   56.43       12750     35891         37379        5479
## 4 06  California  241.70          92     38488         39397       41722
## 5 08    Colorado   49.33       34460     42376         39105       38797
## 6 09 Connecticut  739.10        4366     42990          2047       25185
##   HealthCare  Arts
## 0       1956 23950
## 2       9504  8204
## 3      38140 12722
## 4        987  9754
## 5       6834 10977
## 6       5590 39265
w <- poly2nb(poly2,row.names=poly2$id)
class(w)
## [1] "nb"

Plot the links between the polygons, the links actually show the data stored in variable w.

plot(poly2,col='gray',border='blue',lwd=2)
xy <- coordinates(poly2)
plot(w,xy,col='red',lwd=2,add=TRUE)

w can be transformed into a spatial weights matrix. A spatial weights matrix reflects the intensity of the geographic relationship between observations

wm <- nb2mat(w,style='B',zero.policy=TRUE)

Now Moran’s index of spatial autocorrelation can be computed in terms of the formular listed below.

n<- length(poly2)
y <- poly2$Information
ybar <- mean(y)

dy <- y-ybar
g <- expand.grid(dy,dy)
yiyj <- g[,1]*g[,2]

pm <- matrix(yiyj,ncol=n)
pmw <- pm*wm

spmw <- sum(pmw)
spmw
## [1] -4525772619
smw <- sum(wm)
sw <- spmw/smw

vr <- n/sum(dy^2)

MI <- vr*sw
MI
## [1] -0.1331731
EI <- -1/(n-1)
EI
## [1] -0.02083333
ww <- nb2listw(w,style='B',zero.policy = TRUE)
print(ww,zero.policy=TRUE)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 49 
## Number of nonzero links: 218 
## Percentage nonzero weights: 9.07955 
## Average number of links: 4.44898 
## 
## Weights style: B 
## Weights constants summary:
##    n   nn  S0  S1   S2
## B 49 2401 218 436 4392

As the rusult of Moran I is obtained (which equals to -0.133173133), function moran.test is used to verify.

moran.test(poly2$Information,ww,randomisation = FALSE,zero.policy = TRUE)
## 
##  Moran I test under normality
## 
## data:  poly2$Information  
## weights: ww  
## 
## Moran I statistic standard deviate = -1.2477, p-value = 0.8939
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.133173133      -0.020833333       0.008107271

Verification shows the rusult is correct and p value is obtained simultaneously. Analysis of the result will be conduct in the conclusion part.

Conclusions

Analysze the results above, it is obviously that:

1. Number of firms:
+ California, Texas, New York, Florida are on the top 4 on the rank of number of firms

2. Payroll and Employment:
+ California, Texas, New York, Florida are also the top 4 of the rank of both payroll and employment

3. Relationship between Payroll and Employment: No direct relationship

4. Industries in New York State:
+ Most industries are on the same level developed
+ Some industries like accommodation and television broadcasting are with little employment but large payroll
+ In contrast, some industries like agriculture, aircraft manufacturing and book stores are with relatively large emplyment but smaller payroll.

In terms of Moran’s theory about spatial autocorrelation, Moran’s I > 0 represents positive spatial autocorrelated and Moran’s I < 0 represents negative spatial autocorrelated. And when Moran’s I is close to zero, it represents value distributes spatial randomly.

As to value p, the treshold value is normally set as 0.05, if p is greater than threshold, the larger the value, the more evidence that the so-called null hypothesis can be approved, which indicates the value is randomly distributed.

In our case, MI= -0.1331731 is near to 0 and p=0.8939 is much larger than 0.05, so that we cannot reject the null hypothesis. It is quite possible that the spatial distribution of the scale of information industry is the result of random spatial processes.

References

[1] GE, Ying, et al. “Application of spatial autocorrelation for the spatial patterns of urbanization and localization economy [J].” Human Geography 3 (2005): 21-25.

[2] Juhn, Chinhui, and Simon Potter. “Explaining the recent divergence in payroll and household employment growth.” (1999).

[3] Shen, Chenhua, Chaoling Li, and Yali Si. “Spatio-temporal autocorrelation measures for nonstationary series: A new temporally detrended spatio-temporal Moran’s index.” Physics Letters A 380.1 (2016): 106-116.