H-1b Visa Prediction Models

We had a FY 2019 H-1b Visa Data Distribution post, now we are having models to predict the annual wage and wage level using the same dataset.

Buisiness question: Foreign worker with specialized knowledge, what would the annual wage look like?

Predict annual wage

['Unnamed: 0',
'VISA_CLASS',
'CASE_STATUS',
'SOC_TITLE',
'SOC_CODE',
'NAICS_CODE',
'EMPLOYER_NAME',
'AGENT_ATTORNEY_LAW_FIRM',
'WORKSITE_ADDRESS_CLEAN',
'WORKSITE_CITY_CLEAN',
'WORKSITE_STATE',
'WORKSITE_POSTAL_CODE',
'WAGE_RATE_ANNUAL',
'PREVAILING_WAGE_RATE_ANNUAL',
'PW_WAGE_LEVEL',
'PW_OES_YEAR',
'GPS']

First model: Linear Regression model with feature PREVAILING_WAGE_RATE_ANNUAL

y = 28099.74951382118 + 0.8969421177290609x1
R^2: 0.5468001738806643

improvement: using gradient boosting

with 1 feature: PREVAILING_WAGE_RATE_ANNUAL

From this above graph, we see the increasing of prevailing wage generally increases annual wage

with 2 features: PREVAILING_WAGE_RATE_ANNUAL, SOC_CODE

(SOC_CODE: Occupational code associated with the job being requested for certification, as classified by the Standard Occupational Classification (SOC) System.)

for example: the occupation for code 17-2061 is Computer Hardware Engineers

Explain individual predictions with shapley value plots

Here is an example graph that help us understand how the algorithm made it’s classification.

Predict prevailing wage level

prevailing wage level (2019)

Baseline:

Baseline using Mode(Level II)

Decision Tree:

Random Forest:

Feature Importance for Random Forest:

XGBoost

Feature Importance for XGBoost: