HPMS Basemap Part 1: Data Analysis to Determine Shape / Distribution

Jul 4

Intro

I'm currently trying to put a HPMS basemap together and have several pieces of code that scrape ESRI feature servers as I started map rendering I realized the data was sometimes unreliable across different states and decided I needed to look at the HPMS submissions more holistically mainly at the underlying attribute data.

Source Data

The source data is located here I utilize the ARNOLD_{STATE}2019 layers and the HPMS_FULL{STATE}_2019 layers for GIS data and attribute data. In this exercise we are using the HPMS full layers.

Downloading into CSV > GIS Form

The first thing I did is write something that exports the FULL state layers into a csv file for easier tabular manipulation. After that I scraped the data into 50+ individual files totaling about 2.87 GB. Some states for whatever reason don't have data on the a particular year layer configuration not sure why, but I tried to choose the year with most coverage.

Goal

The goal is to try to get an idea of how reliably I can rely on "route_signing" data to be in the HPMS FULL layers specifically to style and render out map tiles based on these fields. This is time consuming process so having a good idea of what the data looks like before trial and error rendering is crucial. This route_signing data will then map to a field on the basemap data that dictates styling or whether it should exist or at that current tile.

Practicality With Hardware

This code reads in all csv files and combines them into one big table. Generally this isn't practical for large datasets but locally this fit into memory after being read in. I have a 32GB macbook pro but if I was on a machine with less ram this probably wouldn't be feasible or super slow going to swap. This script ran in under a minute which isn't bad considering it is 30 gb in memory.

Data Munging / Manipulation / Visualization

I generally prefer this approach over sql statements because I am super familiar with pandas and I can manipulate the data through a repl like ipython as I go. Pandas allows you to manipulate tabular data in an expressive way to shape it into the format you like/need.

I then used seaborn to render out the heatmap above for the result.

The Code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import os

# setting the state code label conversion
sc_label = {1: 'AL', 2: 'AK', 4: 'AZ', 5: 'AR', 6: 'CA', 8: 'CO', 9: 'CT', 10: 'DE', 12: 'FL', 13: 'GA', 15: 'HI', 16: 'ID', 17: 'IL', 18: 'IN', 19: 'IA', 20: 'KS', 21: 'KY', 22: 'LA', 23: 'ME', 24: 'MD', 25: 'MA', 26: 'MI', 27: 'MN', 28: 'MS', 29: 'MO', 30: 'MT', 31: 'NE', 32: 'NV', 33: 'NH', 34: 'NJ', 35: 'NM', 36: 'NY', 37: 'NC', 38: 'ND', 39: 'OH', 40: 'OK', 41: 'OR', 42: 'PA', 44: 'RI', 45: 'SC', 46: 'SD', 47: 'TN', 48: 'TX', 49: 'UT', 50: 'VT', 51: 'VA', 53: 'WA', 54: 'WV', 55: 'WI', 56: 'WY', 60: 'AS', 66: 'GU', 69: 'MP', 72: 'PR', 78: 'VI'}

# setting the route_signing conversion
s_label = {1: 'Not Signed ', 6: 'County', 2: 'Interstate ', 7: 'Township', 3: 'U.S.', 8: 'Municipal', 4: 'State ', 9: 'Parkway Marker or Forest Route Marker', 5: 'Off-Interstate Business Marker ', 10: ' None of the Above'}

# getting all the csvs file in a particular directory 
csvs = [os.path.join("csvs", file) for file in os.listdir("csvs") if os.path.getsize(os.path.join("csvs", file)) > 0]

# reading in each states csv and then combining into one huge data frame 
data = pd.concat([pd.read_csv(i, encoding="latin1") for i in csvs], ignore_index=True)

# getting length 
data[['begin_point','end_point']].fillna(value=0.0,inplace=True)
data['length'] = data.end_point-data.begin_point

# creating state code count dictionary 
s_d = data.groupby("state_code")['length'].sum().to_dict()

# creating state route signed grouped and count 
ee = data.groupby(["state_code", "route_signing"])[["length"]].sum()

# defining how I want the sign systems to appear on the graph
s_systems = ['Interstate ', 'U.S.', 'State ', 'County', 'Township', 'Municipal', 'Off-Interstate Business Marker ', 'Parkway Marker or Forest Route Marker', ' None of the Above', 'Not Signed ', '']

# apply a function over f system to give me a percent of each route signing 
ee = ee.apply(lambda x: float(x.length) / float(s_d.get(x.name[0], 0.0)) * 100.0, axis=1)

# resetting index and labelling column distribution 
ee = ee.reset_index()
ee.rename(columns={0: "Distribution"}, inplace=True)

# mapping state code and route signing appropriately 
ee.state_code = ee.state_code.astype(int).map(lambda x:sc_label.get(x,''))
ee.route_signing = ee.route_signing.astype(int).map(lambda x:s_label.get(x,''))
usigns = ee.route_signing.unique()

# creating the pivot table
pv_tbl = ee.pivot("route_signing", "state_code", "Distribution")

# creating the heat map
ax = sns.heatmap(
    pv_tbl.loc[[i for i in s_systems if i in usigns], :], 
    cmap="YlGnBu",
    xticklabels=ee.state_code.unique(),
    # des='HPMS 2019 - Percent of Mileage for Each route_signing Field',
    annot=True, fmt=".1f",linewidths=0.5)

plt.title("HPMS 2019 - Percent of Mileage for Each route_signing Field")
plt.show()

Results

This exercise showed that route_signing can be generally relied on for styling of data in HPMS and more generally how we can manipulate large HPMS data sets to get the data we are looking for into the correct shape using pandas and python.

Bennett Murphy