Problem Statement:
Predict the risk of mortality of a
patient (due to corona Virus) based on his blood report, given the dataset of
patient hospitalization records. Please ensure that appropriate features and
data-rows are chosen. Please choose appropriate hyper-parameters for the model
and justify the hyper-parameter based on the performance measure chosen. Once
you train the model, there is a separate test set also available. The
prediction should yield high performance measure in the test set. Justify your
model based on the performance measure chosen, and also list the most important
features arrived by training the model. You can choose a variety of ML
algorithms
Data Available:
We have some real time data of about available
for 375 + 100 patients. These patients
took multiple blood tests and these results have been recorded with time. The
patients that survived have an outcome of 0 and patients who succumbed have an
outcome of 1. I have also supplied the test data set.
Task:
1) Write a model to predict the mortality likelihood of the patient
based on the data given. This needs to be done by following methods:
a. Do not fill any missing data. Substitute all the missing data as -1
i.
Take the final data report of
the patient as the input data for each patient, and fit the model. This implies
that size of the training data is only 375 rows
ii.
Augment the training data by
adding relevant rows to the training data. Expectation is not to have as many
rows as the rows in the datasheet given, but use some criteria to group rows
together
b. Try to fill the missing data by typical methods: Mean, Most
Co-related value, etc.
i.
Take the final data report of
the patient as the input data for each patient, and fit the model. This implies
that size of the training data is only 375 rows
ii.
Augment the training data by
adding relevant rows to the training data. Expectation is not to have as many
rows as the rows in the datasheet given, but use some criteria to group rows
together
iii.
Can you identify the most important
features and use those features in model creation? How does that model’
performance metrics compare to the model consuming all 75 features?
2) Choose Accuracy as the performance measure
3) Identify the co-related features using multiple measures, and plot
their dependencies to each other and to the target variable
a. Understand and Analyze the data
i.
Identify from the data any
dependencies among the features and their impact on the target variables
ii.
Show multiple visualizations of
the feature dependencies
4) Can we identify the most important features from the trained model?
5) Create ML model(s) for the outcome.
a. Can you try using an ensemble of models?
b. What are the correct hyper-parameters (for that algorithm)?
Expectation is that there will be 5 models
created (1a.i, ii, and 1b.i, ii, iii) and each having accuracy as the
performance measure. Please output the appropriate loss also