H-1b Visa Prediction Models
We had a FY 2019 H-1b Visa Data Distribution post, now we are having models to predict the annual wage and wage level using the same dataset.
Buisiness question: Foreign worker with specialized knowledge, what would the annual wage look like?
Predict annual wage
The list of columns:
['Unnamed: 0',
'VISA_CLASS',
'CASE_STATUS',
'SOC_TITLE',
'SOC_CODE',
'NAICS_CODE',
'EMPLOYER_NAME',
'AGENT_ATTORNEY_LAW_FIRM',
'WORKSITE_ADDRESS_CLEAN',
'WORKSITE_CITY_CLEAN',
'WORKSITE_STATE',
'WORKSITE_POSTAL_CODE',
'WAGE_RATE_ANNUAL',
'PREVAILING_WAGE_RATE_ANNUAL',
'PW_WAGE_LEVEL',
'PW_OES_YEAR',
'GPS']
First model: Linear Regression model with feature PREVAILING_WAGE_RATE_ANNUAL
R^2: 0.5468001738806643
improvement: using gradient boosting
with 1 feature: PREVAILING_WAGE_RATE_ANNUAL
From this above graph, we see the increasing of prevailing wage generally increases annual wage
with 2 features: PREVAILING_WAGE_RATE_ANNUAL, SOC_CODE
(SOC_CODE: Occupational code associated with the job being requested for certification, as classified by the Standard Occupational Classification (SOC) System.)
for example: the occupation for code 17-2061 is Computer Hardware Engineers
Explain individual predictions with shapley value plots
NAICS_CODE: Industry code associated with the employer requesting permanent labor condition, as classified by the North American Industrial Classification System (NAICS).
Here is an example graph that help us understand how the algorithm made it’s classification.
Predict prevailing wage level
Buisiness question: Foreign workers with level I salary have lower chance of drawing the lottery, what are the prevailing wage level for each filing?
Baseline:
Decision Tree:
Random Forest:
Feature Importance for Random Forest:
XGBoost
Feature Importance for XGBoost: