Assignment: Working with data in Python

Task: Construct a data frame

Using Python in Jupyter Notebook, construct a pandas data frame df that contains the following observations:

Task: Add markdown

Add at least one markdown block to explain what you have done.

Task: Combine and filter

Create another data frame df2 that contains the following data of heart rates at rest:

Note that individuals 6, 7, and 11 have no heart rate data.

Combine the two data frames so that there is one row per person. The new table should include the variables from both tables.

Remove those patients from the data frame who were born before year 1940.

Task: Compute basic statistics

Expand your program to compute the basic statistics (count, mean, quartiles, etc.) summary for all variables.

Task: Compute and visualize correlations

Compute the pairwise correlation matrix, and visualize the correlations.

Tips: Seaborn heatmap. See "Demo: Iris" material.

Pay attention to variable pairs with high positive or negative correlation. Some machine learning methods perform worse if the variables are correlated.

Task: Work with an existing data set

Create another JuPyter Notebook.

Load the Chronic kidney disease data set.

Chronic Kidney Disease (CKD) data set
Material	Link	Reference
Data set	csv	Dr.P.Soundarapandian.M.D.,D.M (Senior Consultant Nephrologist), Apollo Hospitals, Managiri, Madurai Main Road, Karaikudi, Tamilnadu, India. Downloadable via Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Explanation	txt

Tips: See pandas documentation for read_csv, p. 222.

Your goal is to find out the minimum, maximum, and mean values and pairwise correlation coefficients of all numerical variables for the affected individuals (individuals with CKD) in the data set.

Tips: See the accompanying text document for an interpretation of the variables.

Check the data types, and check how the missing data is encoded.

Filter the data set to include only the affected patients.

For the remaining subset, print the basic statistics.

Then, calculate the pairwise correlation coefficients (correlation matrix) between each pair of numerical variables.

Back to main page