H-1b Visa Prediction Models

We had a FY 2019 H-1b Visa Data Distribution post, now we are having models to predict the annual wage and wage level using the same dataset.

Buisiness question: Foreign worker with specialized knowledge, what would the annual wage look like?

Predict annual wage

The list of columns:

['Unnamed: 0',
'VISA_CLASS',
'CASE_STATUS',
'SOC_TITLE',
'SOC_CODE',
'NAICS_CODE',
'EMPLOYER_NAME',
'AGENT_ATTORNEY_LAW_FIRM',
'WORKSITE_ADDRESS_CLEAN',
'WORKSITE_CITY_CLEAN',
'WORKSITE_STATE',
'WORKSITE_POSTAL_CODE',
'WAGE_RATE_ANNUAL',
'PREVAILING_WAGE_RATE_ANNUAL',
'PW_WAGE_LEVEL',
'PW_OES_YEAR',
'GPS']

First model: Linear Regression model with feature PREVAILING_WAGE_RATE_ANNUAL

R^2: 0.5468001738806643

improvement: using gradient boosting

with 1 feature: PREVAILING_WAGE_RATE_ANNUAL

From this above graph, we see the increasing of prevailing wage generally increases annual wage

with 2 features: PREVAILING_WAGE_RATE_ANNUAL, SOC_CODE

(SOC_CODE: Occupational code associated with the job being requested for certification, as classified by the Standard Occupational Classification (SOC) System.)

for example: the occupation for code 17-2061 is Computer Hardware Engineers

Explain individual predictions with shapley value plots

NAICS_CODE: Industry code associated with the employer requesting permanent labor condition, as classified by the North American Industrial Classification System (NAICS).

Here is an example graph that help us understand how the algorithm made it’s classification.

Predict prevailing wage level

Buisiness question: Foreign workers with level I salary have lower chance of drawing the lottery, what are the prevailing wage level for each filing?

Baseline:

Decision Tree:

Random Forest:

Feature Importance for Random Forest:

XGBoost

Feature Importance for XGBoost: