Using the R programming language with HSLynk Open Source for statistical analysis

This article shows how to query data using R's dplyr package to analyze US Dept of Housing and Urban Development Homeless Management Information System (HUD HMIS) data. HMIS is just one human services application of the HSLynk Platform as a Service, but homelessness is a large and growing problem, so let's use it for our example.

Here's an example of using HSLynk with powerful data analytic tools that aren't available with most other HMIS systems.  This example analyzes to sample HUD Homeless Management Information Standard sample files located at https://hudhdx.info/VendorResources.aspx Thank you Genelle Denzin, for figuring this out!

Before attempting to connect, please be sure you have the following:

1. Install R. (If it is already installed, be sure it is updated.)
2. Install R Studio. (If it is already installed, be sure it is updated.)
3. Get credentials and database ID for the database you will be using from HSLynk Support.
4. Get credentials to your VPN connection from HSLynk Support.

How to Set Up your Connection

To connect to the HSLynk Big Data Warehouse with R, RStudio, etc., the best way is using Impala (as opposed to Hive).

You need to install the correct Cloudera driver for Impala. Cloudera supports Windows, Mac, and Linux. See the documentation. Use the credentials given to you in step 3 above.

RStudio doesn't currently offer Secure Sockets connections in its free build, so you need to connect over a VPN. We recommend OpenVPN. Use the credentials given to you in step 4 above.

Before you run the following code, be sure you have successfully connected with your credentials to the VPN.

Here's a sample program to connect to Impala from R via Implyr.

If you don't already have these packages installed, please run:

install.package(odbc)
install.package(implyr)
install.package(dplyr)
library(odbc)
library(implyr)
library(dplyr)
impala <- src_impala(
drv = odbc::odbc(),
driver = "[directory or reference to driver]",
# this will depend on your operating system and setup. if you need help, please ask on the Slack channel
host = "HOST",
port = 21050,
database = "[get this from HSLynk Support]",
uid = "[get this from HSLynk Support]",
pwd = "[get this from HSLynk Support]",
UseSasl=1,
AuthMech=3
)

List the tables inside the database

tables <- dbGetQuery(impala, "show tables")
View(tables)

Creating dataframes from the database

client <- data.frame(tbl(impala, "client"))
enrollment <- data.frame(tbl(impala, "enrollment"))
exit <- data.frame(tbl(impala, "exit"))
moveindate <- data.frame(tbl(impala, "moveindate"))
project <- data.frame(tbl(impala, "project"))

Sample code

# view all clients where Ethnicity is Latino (Latino = 1 in the specs)
client %>%
select(id, active, date_created, ethnicity, ethnicity_desc) %>%
filter(ethnicity == 1) %>%
View()

Once you're connected, check out all the dplyr commands you can use: https://dplyr.tidyverse.org/

taken from: https://github.com/servinglynk/hslynk-open-source-docs/wiki/Impala-connection-via-R-using-Implyr-and-Dplyr

dplyr commands cheat sheet: https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf