Statistics
write the first one..
Federal Election Commission Independent Expenditures for the 2012 and 2016 (to date) Presidential primaries. This is a rather large data set. I have created a new data set that only includes expenditures = $10 and < $5,000. The data set is PresCandIndExpenditures.csv and the variable of interest is “Expenditure Amount.” There are other variables in the file, but you only will be using “Expenditure Amount” for this assignment. The data were downloaded from http://www.fec.gov/data/DataCatalog.do.
1
Written Assignment 3 (Group Assignment)
Due uploaded in a single .doc, .docx, or .pdf file to Canvas by 11:45 PM on Friday November
20th (note new due date)
There are three data sets with four quantitative variables for this assignment. Each group member is
to pick one of the datasets and one quantitative variable for this assignment. Each group member
must use a different variable. The options are:
1. Federal Election Commission Independent Expenditures for the 2012 and 2016 (to date)
Presidential primaries. This is a rather large data set. I have created a new data set that only
includes expenditures = $10 and < $5,000. The data set is PresCandIndExpenditures.csv and the
variable of interest is “Expenditure Amount.” There are other variables in the file, but you only
will be using “Expenditure Amount” for this assignment. The data were downloaded from
http://www.fec.gov/data/DataCatalog.do.
2. Heights and weights of Major League Baseball players. This data set includes two quantitative
variables. You pick either height or weight. The data set is MLBHeightWeightData.xlsx. Two
students from a group can use this data set, provided you use different variables. These data
were downloaded from
http://wiki.stat.ucla.edu/socr/index.php/SOCR_Data_MLB_HeightsWeights.
3. Time to a major breakdown data for used cars sold by “shady” used car dealerships. These data
were simulated. This data set only includes a single variable and is named
UsedCarBreakDowns.csv.
Assignment
We are treating the data for each variable as a population. Each group member will be
responsible for completing steps 1 – 4 for their variable. Step 5 is to be completed as a single
group response.
1. Describe your population. Your descriptions should be limited to your single variable and
should include visual, numerical, and verbal descriptions. Since we are treating the data as the
population, you need to make sure you use the correct formula for finding a population standard
deviation. Many software packages default to a sample SD. In StatCrunch you can get a
population SD by requesting “Unadj. std. dev.” from the selection of Summary statistics.
2. Use the mean and SD of your variable to create a normal model. Compare the normal model to
the distribution of your variable and explain why you think the model is or is not useful.
3. Draw 100 simple random samples (sampling with replacement) from your population for each
of n = 10, n = 25, and n = 50. For each sample calculate the mean and create a histogram of the
sampling distribution of the mean. This will result in three histograms, one for each sample size.
Also calculate the mean and SD for each sampling distribution. Then calculate the theoretical
mean and SD for each of the three sampling distributions based on the mean and SD of your
population. You might end up with a table that looks something like Table 1 shown below (be
sure to adjust your caption as appropriate). Comment on how the means and SDs from the three
sampling distributions you simulated compare to the theoretical means and SDs. Also, comment
on any differences you observe between samples of size n = 10, n = 25 and n = 50.
2
Table. 1. Means and SDs for sampling distributions of the mean..
Sample Size Mean From
Samples
SD from
Samples
Theoretical
Mean
Theoretical
SD
10
25
50
4. Repeat Step 2 for your sample of size n = 50 only.
5. As a group, come together with your individual analyses and write a summary that synthesizes
the results. Specifically you want to comment on the types of distributions you observed for the
individual population variables, focusing on shapes and spreads, and how the distributions
compare with the sampling distributions for the mean created in Step 3. Are there any
similarities in the sampling distributions of the means for the various variables? We are looking
for qualitative comparisons for this step. Your sampling distributions were based on 100 means.
Do you think you would have gotten the same results if the sampling distributions were based
on 1,000 means or 10,000 means? How about 50 means?
You are to submit one report per group. Make sure your group numbers as well as individual group
member names are included on the report. Also indicate which variable each group member was
responsible for. Your report should be organized as follows:
• Page 1: Group summary—limited to one page.
• Pages 2-7 for groups of 3 or pages 2-9 for groups of four containing the population
summaries—limited to a maximum of 2 pages per variable. See bolded directions on page 3
regarding overlaying normal distributions onto histograms.
StatCrunch has all the capabilities you need to do this assignment. To select simple random samples
from your population, go to Data > Sample.
• In “Select columns:” click on your variable name.
• In “Sample size:” enter 10, 25, or 50 (note you will have to do this three times).
• In “Number of samples:” enter 100.
• In “Sampling options:” check “Sample with replacement”
• In “Store samples:” select the middle radio button “Stacked with a sample id.”
• Leave everything else and click “Compute!”
o You may get a Warning Window that pops up saying “Whoa!! Lots of unique numeric
values for Sample. Want to turn on binning for this procedure?” –if this happens, click
“Cancel”
o This will add two columns of data to your data set. One column will be the values of
your variable that were selected for the samples. The second column will be a sample
identifier; e.g., sample 1, sample 2, …, sample 100.
3
You can now get the means for each of your samples by going to Stat > Summary Stats.
• In “Select column(s):” choose the column for the sample values
• In “Grouping by:” choose the column that has the sample numbers (1, … , 100)
• In “Statistics:” choose “Mean”
• In “Output:” check “Store in data table”
• Click “Compute!”—this will create a new column with the means from each of your
samples. These are the values that will be used to describe the sample distribution for the
given sample size. You can then use Stat > Summary Stats again with this new column to
get the mean and SD for the sampling distribution based on the 50 samples.
recommend that you create the data for all three sampling distributions first. Then you can get the
summary stats (means and SDs) at once. You can also create the histograms at once and use the
“For multiple graphs:” option to get all the histograms in a single figure with same-scaled axes. To
save on space and the number of figures you have to create, overlay normal distributions onto
the histograms. This way you can combine your figures for Steps 1 and 2, and Steps 3 and 4.
You can get overlays of normal distributions by selecting Stat > Histogram
• In “Display options:” > “Overlay distrib:” choose Normal and enter an appropriate Mean
and Std. Dev.