Important Information:
Read all the instructions carefully before you begin!
You
Important Information:
Read all the instructions carefully before you begin!
You will need to save the (.ipynb) file as a searchable PDF (NOT as a picture), and submit it as the primary resource. Pictures or snapshots of your work will NOT be accepted.
The generated CSV file and .ipynb file must be submitted in a zip-folder as the secondary source.
You may use Jupyter Notebook or Colab as per your convenience.
Non-compliance with the above instructions will result in a 0 grade on the relevant portions of the assignment. Your instructor will grade your assignment based on what you submitted. Failure to submit the assignment or submitting an assignment intended for another class will result in a 0 grade, and resubmission will not be allowed. Make sure that you submit your original work. Suspected cases of plagiarism will be treated as potential academic misconduct and will be reported to the College Academic Integrity Committee for a formal investigation. As part of this procedure, your instructor may require you to meet with them for an oral exam on the assignment.
Important First Steps:
You can use either Anaconda or Colab to work on the Jupyter notebook that you will submit as your final project on Forum:
Start by downloading this Jupyter Notebook to your local machine.
Open a tab in your browser and type https://colab.research.google.com/.
This will open a small window. Choose the last option Show notebooks in Drive on the upper menu, “Upload”. Then choose the Jupyter notebook you have saved in step 1.
You can start working on your assignment by answering the questions in the corresponding cells.
A sample code is provided for tasks 3 to 6. Remember these are only sample codes, and you will need to make minor revisions to the codes to be able to complete the tasks.
If you have any questions, please reach out to your instructors and the CIS tutors.
Background
Imagine that you have graduated CIS and now work as a consultant.
You are hired by a health and fitness company.
They have collected detailed data from 507 physically active participants. This data includes information about the participant’s body measurements as well as personal attributes such as age, weight, height, and gender.
The company wants you to analyze this data in ways that can help them design personalized fitness evaluations and training regimens for their users.
Note: The entire dataset (and descriiptions of each of the variables) can be found [here] (https://vincentarelbundock.github.io/Rdatasets/doc/openintro/bdims.html).
In Assignment 1 you will take a random sample of 100 participants from the 507 individuals who were studied, and analyze the data for these 100 individuals.
Task 1.
As mentioned above, you will select a random sample of 100 individuals from the company’s data set.
You will then conduct analyses on this random sample.
Look at the code below. To select a random sample from the data, you should replace Name with your own name in the code.
After you have done so run the code. The code will generate a CSV file with a random sample of 100 participants. It will also be labeled with your name.
REMEMBER: you need to add this CSV file to a zip file along with your .ipynb. file when submitting your assignment.
Task 2.
Now that you have your data set you are ready to start analyzing it!
The first step is to explore your dataset.
Look at the variables that make up the data set.
Once you’ve done so, imagine you are writing a report for the fitness company that hired you.
Start with a brief introduction to the research question you are exploring, then the dataset you are analyzing (e.g., what is the sample you are analyzing? What are the variables?)
Assume that your audience is the company’s leadership. They will be with what you are reporting.
Task 3.
Run the code to randomly select 4 variables from your dataset.
It will then print the names of the four variables that were randomly selected.
REMEMBER: Check the full name of each of your variables, you can find it here.
Your task is to do the following:
You should create a histogram and generate descriiptive statistics for each of the four variables that were randomly selected above. You can use the code below to help you do so.
For each variable you need to describe the following: shape,** center**,** spread**, and the presence of any outliers.
Task 4.
Now that you have described and plotted data, let’s explore if the data differ for male and female participants.
Generate grouped box plots for each of the 4 variables in Task 3.
Your boxplot should compare the distributions for males and females in your dataset.
Afterwards, you should describe what you observe in each case.
Make sure you mention the five-number summaries for both genders.
Task 5
Part A
Select TWO variables from Task 3. Treat these as an independent variable.
Now create a scatterplot for each variable.
In each case, the plot should visualize the relationship between the variable and weight (dependent variable).
Describe each scatterplot in terms of the form,** strength**, and direction of the relationship between the variables.
Part B
Examine if the relationship explored in each scatterplot varies by gender.
Hint: You will need to create scatterplots separately for each gender to answer this question.
Task 6.
PART A
Finally, for each of the variables you focused on in Task 5:
Fit a simple linear regression model that predicts a participant’s Weight based on the variable you selected.
Make sure you generate, interpret, and use the residual plot, the standard error, and the R^2 to assess the fit of each linear model.
If the model is a good fit, interpret the slope and the y-intercept.
PART B
If you found that the relationship between weight and the variable you selected differed for males and females in Task 5 (Part B) then:
Run the regression model for each gender separately and interpret your findings accordingly.
Assignment Information
Length:
N/A
Weight:
18%
Learning Outcomes Added
CompProgramDesign: Generate working programs in a computer language that can solve computational problems; find and fix bugs that appear in them.
Variables: Identify and classify the relevant variables of a system, problem, or model.
DescriiptiveStats: Calculate and interpret descriiptive statistics appropriately.
Correlation: Apply and interpret measures of correlation; distinguish correlation and causation.
Visualizations: Interpret, analyze, and create data visualizations.
#CompProgramDesign
Generate working programs in a computer language that can solve computational problems; find and fix bugs that appear in them.
Rubric
0
Did not submit the assigned item, or submitted something that does not meet the minimum standard.
1
Does not generate working programs in a computer language that can solve computational problems or find and fix bugs that appear in them when prompted, or do so mostly or entirely inaccurately.
2
Generates working programs in a computer language that can solve computational problems only somewhat accurately, or finds and fixes bugs that appear in them in a way that fails to address the relevant problems or goals.
3
Accurately or effectively generates working programs in a computer language that can solve computational problems or finds and fixes bugs that appear in them in a way that addresses the relevant problems or goals.
4
Accurately or effectively generates working programs in a computer language that can solve computational problems and finds and fixes bugs that appear in them in a way that addresses the relevant problems or goals and demonstrates a deep grasp of the skill or concept by analyzing, explaining, or justifying the application in a way appropriate to the given context.
#Variables
Identify and classify the relevant variables of a system, problem, or model.
An important initial step in analyzing a problem, model, or system, is to evaluate the relevant variables: the features that can change. To do so, one must identify and classify the relevant variables, examine the relationships between them, and consider how they can be measured and manipulated. In the context of a study or experiment, this typically involves identifying which features play the role of independent variables, the hypothesized predictors or causes, and which play the role of the dependent variable, the hypothesized effect that results from changes in the independent variable(s). Careful thought must also be given to extraneous variables that could also affect the dependent variable but are not intentionally under study and consider how to control them (keep them constant). Further, classifying the type of variables (for example, qualitative or quantitative) is essential as it will affect the type of statistical analyses that are appropriate. Representing a problem formally may also involve identifying the quantities that are constant (not varying), usually referred to as parameters. Parameters help to define the nature of the problem and can be treated as constraints on the solution.
Example
You are working for the customer service department of a restaurant chain, and are interested in determining the degree to which server friendliness (the independent variable) affects tip percentage (the dependent variable). You make the hypothesized causal relationship between variables explicit: you expect that friendly attendants will lead to satisfied customers, and that satisfied customers will provide a larger tip. This is supported by your previous research findings that people in a positive mood tend to be less stingy. You’re aware of several other extraneous variables that could affect tip percentages such as food quality, customer income, type of meal ordered, and time of day. This leads you to consider methods to diminish the effects of these variables on your results, attempting to control for them as much as possible. You also classify what type of variables these are, noting that server friendliness is usually ranked on an ordinal scale, which would change the type of statistical analyses performed on the collected data, leading you to suggest alternative ways to measure friendliness on the customer survey. Identifying and classifying these variables is your first step in formally representing the problem of interest which can now be further developed.
All Outcomes > Core Competency: Critical Thinking > Subcompetency: Representation > #Variables
#Variables
Identify and classify the relevant variables of a system, problem, or model.
An important initial step in analyzing a problem, model, or system, is to evaluate the relevant variables: the features that can change. To do so, one must identify and classify the relevant variables, examine the relationships between them, and consider how they can be measured and manipulated. In the context of a study or experiment, this typically involves identifying which features play the role of independent variables, the hypothesized predictors or causes, and which play the role of the dependent variable, the hypothesized effect that results from changes in the independent variable(s). Careful thought must also be given to extraneous variables that could also affect the dependent variable but are not intentionally under study and consider how to control them (keep them constant). Further, classifying the type of variables (for example, qualitative or quantitative) is essential as it will affect the type of statistical analyses that are appropriate. Representing a problem formally may also involve identifying the quantities that are constant (not varying), usually referred to as parameters. Parameters help to define the nature of the problem and can be treated as constraints on the solution.
Example
You are working for the customer service department of a restaurant chain, and are interested in determining the degree to which server friendliness (the independent variable) affects tip percentage (the dependent variable). You make the hypothesized causal relationship between variables explicit: you expect that friendly attendants will lead to satisfied customers, and that satisfied customers will provide a larger tip. This is supported by your previous research findings that people in a positive mood tend to be less stingy. You’re aware of several other extraneous variables that could affect tip percentages such as food quality, customer income, type of meal ordered, and time of day. This leads you to consider methods to diminish the effects of these variables on your results, attempting to control for them as much as possible. You also classify what type of variables these are, noting that server friendliness is usually ranked on an ordinal scale, which would change the type of statistical analyses performed on the collected data, leading you to suggest alternative ways to measure friendliness on the customer survey. Identifying and classifying these variables is your first step in formally representing the problem of interest which can now be further developed.
Your Outcome Scores
Title Comment Date Score
IDS103 Session 4 – Experimental Studies commentYour justifications are well thought out. Nice job! If you have any questions or areas you’d like to work on, feel free to ask for more feedback. General comments: 1. While there’s a positive correlation between teachers calling students by name and higher grades, causation cannot be firmly concluded without additional research to eliminate other influencing factors. 2. To establish a causal relationship, a controlled experimental design is essential with large sample sizes, observe the group for longer time, random sampling, control groups etc. 3. While there may be an association between ice cream sales and beach drownings, it doesn’t imply causation. Hot weather, influencing both ice cream sales and beach attendance, could be a confounding variable. Jan 23 2024
4
IDS103 Session 3 – Observational Studies & Sampling comment1. The scenario is an example of observational study as there is no control or manipulation of the variables. We can state from this study if there is a relation between exercise routine and blood pressure but cannot state that as a conclusion or fact. 2. Measurement of blood pressure is a number and can be a decimal value. Hence it is quantitative and continuous. If you specified that we can use “numbers” for representing the variable, good work. Note: Since you have limited time for answering poll questions, just note down the “precise points” that would show your knowledge. Elaborate if you have more time. Jan 17 2024
2
Rubric
0
Did not submit the assigned item, or submitted something that does not meet the minimum standard.
1
Does not formally represent a problem when prompted; identifies or classifies variables of a system, model, or problem mostly or entirely inaccurately; defines independent or dependent variables of a system, model, or problem mostly or entirely inaccurately; does not demonstrate or recall the distinction between variables and parameters.
2
Identifies and classifies variables of a system, model, or problem only somewhat accurately; defines independent or dependent variables of a system, model, or problem only somewhat accurately; (when applicable) does not accurately distinguish between variables and parameters.
3
Identifies and classifies variables of a system, model, or problem mostly or entirely accurately; defines independent and dependent variables of a system, model, or problem mostly or entirely accurately; (when applicable) distinguishes between the variables and the parameters of a system mostly or entirely accurately.
4
Accurately identifies, classifies, and provides a detailed descriiption of the variables and parameters of a system, model, or problem; accurately defines and provides a detailed descriiption of the relationship between the independent and dependent variables of a system, model, or problem; (when applicable) accurately and effectively identifies all relevant variables and parameters of a complex, sophisticated system, providing clear explanation on the distinction between them.
#DescriiptiveStats
Calculate and interpret descriiptive statistics appropriately.
Descriiptive statistics describe properties of a set of data. Two important categories include measures of location (such as mean, median, mode) and measures of spread (such as standard deviation and range). Instead of considering each data point separately, such statistics provide an overview of key properties of the entire set of data. The mean, median and mode can be very different if the distribution of scores is skewed or there are big outliers. It is important to use the relevant descriiptive statistics for a particular purpose, to interpret them appropriately, and to recognize their uses and limitations.
Example
You want to choose a career, and you start by conducting market research on salaries of different jobs. Looking into how much physicians make on average, you see the mean is the highest among the careers you consider. But when checking the median salary, you discover that it is worse than what a ship’s first mate makes! Plotting a histogram of the physician salaries from a database, you find that the mean is positively skewed by heart surgeons and a few other high-paying specialists. Additionally, you see that the standard deviation is large, indicating that there is substantial variation from the mean salary. You note that, while the mean is higher for physicians, you would have to be one of the few high earning physicians to earn a lot. Because of the higher median and low variance of the first mate’s salary, you infer that this could actually be a safer choice: mid-level shipmates make more than mid-level physicians. In other words, in the top 50% of earners, the lowest-paid shipmates make more than the lowest-paid physicians. However, your ambition and love of medicine drive you to go for the ultimately higher earning potential of physicians, with the caveat that you will have to specialize in your discipline to achieve your desired salary (as your research shows that being a mid-level physician is not worth it, comparatively).
Rubric
0
Did not submit the assigned item, or submitted something that does not meet the minimum standard.
1
Does not demonstrate knowledge of descriiptive statistics when prompted; calculates or interprets a descriiptive statistic mostly or entirely inaccurately; (when applicable) creates or interprets a histogram for the data mostly or entirely inaccurately.
2
Chooses an inappropriate descriiptive statistic; calculates or interprets a descriiptive statistic only somewhat accurately; (when applicable) creates or interprets a histogram for the data only somewhat accurately.
3
Chooses an appropriate descriiptive statistic; calculates or interprets a descriiptive statistic mostly or entirely accurately; (when applicable) creates or interprets a histogram for the data mostly or entirely accurately.
4
Chooses an appropriate descriiptive statistic and justifies the choice; accurately calculates a descriiptive statistic with clear detailed steps; provides a well reasoned and justified interpretation of the statistic; (when applicable) correctly creates a histogram for the data and provides detailed interpretation; (when applicable) applies multiple descriiptive statistics to create a robust analysis of the data.
#Correlation
Apply and interpret measures of correlation; distinguish correlation and causation.
A correlation, most often quantified by Pearson’s correlation coefficient (r), indicates the degree of interdependence of the values of two variables. This observation can in some cases be used to estimate how likely two events or characteristics are to occur together, but care must be taken to apply and interpret correlation appropriately. Correlation does not indicate that one event or characteristic causes the other: Both can be caused by a third event or characteristic or can be merely coincidental.
Example
You are interning as a business analyst for a large multilateral development bank. You are working on an annual report of projects under your department, and your manager asked you to examine the relationship between projects’ duration and their media coverage, quantified based on a linear formula. You calculate the correlation coefficient from the data and find an r-value of 0.48, which suggests a positive and moderately strong correlation between the variables of interest. This seems to indicate that media coverage increases linearly with project duration. However, features in the scatterplot make you hesitant about the relationship. Firstly, your boss was mainly interested in the relationship between these variables for projects with a very long duration, for which you have next to no data points. You know that while the relationship may appear somewhat linear over the given range of available data, it may be non-linear for long projects, and you should be cautious to extrapolate. Secondly, you notice substantial deviations from homoscedasticity (equal variance) which indicates there might be a more complicated relationship between the variables. For that reason, the r-value should not be taken as reliable evidence for a linear relationship, let alone a causal connection. It’s possible, for example, that both have a common cause, such as the amount of funding for the project or the size of the client.
Rubric
0
Did not submit the assigned item, or submitted something that does not meet the minimum standard.
1
Does not demonstrate knowledge of correlation when prompted; computes or interprets the correlation coefficient mostly or entirely inaccurately; confuses correlation and causation.
2
Computes or interprets the correlation coefficient only somewhat accurately; distinguishes between correlation and causation only somewhat accurately.
3
Computes or interprets the correlation coefficient mostly or entirely accurately; recognizes the difference between correlation and causation; (when applicable) identifies extraneous variables that could be the underlying cause of the correlation.
4
Accurately computes or interprets the correlation coefficient and provides detailed a clear and explanation within the given context; recognizes and effectively explains the difference between correlation and causation within the given context; (when applicable) identifies nontrivial extraneous variables that could be the underlying cause of the correlation and effectively explains the potential links.
#Visualizations
Interpret, analyze, and create data visualizations.
Typically, one can understand data best by looking at it from multiple different perspectives, which can be facilitated by data visualization techniques—this is because the human brain did not evolve to be able rapidly to scan and make sense of columns of data. Histograms, cumulative histograms, difference histograms, bar graphs, line graphs, scatter plots, and many other types of graphics can provide insights into how to ask and answer questions. Different data visualization techniques have different strengths and weaknesses, and one must consider properties of the data and the questions of interest when deciding how best to use these tools.
Example
Your marketing firm is working with a company that is trying to increase product sales from existing customers. You have access to a range of data and create a few graphs to highlight potential areas of opportunity. You wonder if there are seasonal trends, so you construct a line graph of monthly sales, which show that sales substantially dip during winter months. A line graph is appropriate because of the temporal relationship the time series data have. Next, you wonder if particular demographic groups drive sales, so you create a bar graph to compare those groups. A bar graph is appropriate because the demographic groups membership is a categorical variable. Last, you wonder if income levels may relate to product sales, so you construct a scatter plot with income levels on the x axis and product sales per customer on the y axis. A scatter plot is appropriate because the two variables are continuous variables and you want to evaluate the relationship between them. For each graph, you ensure that you have proper axis labels with units, and a brief but informative caption. The insights the data visualizations provide help guide the marketing strategy.
Rubric
0
Did not submit the assigned item, or submitted something that does not meet the minimum standard.
1
Does not critique or apply tools for generating data visualizations when prompted, or does so entirely or mostly ineffectively; does not analyze or interpret data visualizations when prompted, or does so mostly or entirely inaccurately.
2
Critiques a data visualization only somewhat accurately; generates a somewhat effective data visualization; analyzes or interprets a data visualization only somewhat accurately.
3
Effectively generates a data visualization; effectively analyzes or interprets a data visualization; (when applicable) effectively critiques a data visualization.
4
Effectively generates a detailed data visualization appropriate for the data; effectively analyzes or interprets a data visualization and provide appropriate justification and details; (when applicable) effectively critiques a data visualization and clearly explains why certain methods are better for distinct data sets.