What is a scatter plot?

What is a line of fit?

How do we fit a line to a plot?

Try fitting a curve to some scatter plots yourself. We have some scatter plots printed out for you up front. (It's also linked on the Datasets page.) Now go ahead and sketch a line of fit onto each plot. Try to have about equal numbers of points above the curve as below, and each individual point as close to your curve as you can manage. Not all of the plots will have a straight line as the best fitting curve.

Make a guess as to what sort of functions might match the curves you have drawn! (You can use the plotting function from the previous module to check your answers.)

Suppose you want to open up an animal shelter and take care of cats. You've surveyed a bunch of other animal shelters to find out how many cats they currently have, and how much money they spend on cat food per week. It might be nice to be able to see if there's any relationship between these two things. Let's make a scatter plot!

- Download the program
**lineFit**into your programs directory. Save it as**lineFit.py**. - Download
**this dataset**into your programs directory. Save it as**cats**. - Open your terminal, and load
**lineFit.py**in gedit to take a look at it:

`cd programs/`

gedit lineFit.py& - We want to plot the data in cats, so change the filename being passed to
the
**loadData()**function from 'data' to 'cats'. Save the changes. - Go back to your terminal and start IPython:

`ipython` - Now run your program! Type:

`run lineFit.py`

Take a look at the scatter plot this produces. Can you tell what sort of function the line of fit might be?

Now let's fit a line to the plot:

- Go back to
**lineFit.py**in gedit. - Comment out the line where we called the
**drawPoints()**function by putting a hash symbol at the beginning of the line:

**#**drawPoints(xvals, yvals, connect=False) - Take a look at the last line of the program:

This line will plot a scatter plot using the points stored in xvals and yvals, and then it will fit a line to the data. The third parameter(currently a '1') tells the program what degree of function to try to fit to the plot. 1 means a linear function, 2 means quadratic, etc.`#drawFitLine(xvals, yvals, 1)` - Right now the last line is a comment, so it won't be run. Uncomment the last line by erasing the hash symbol and save the change.
- Close your old plot, then run the program again.

In addition to drawing a line, the program tells you the function for the line it came up with. Can you use the function it gives to predict about how much it might cost per week for 11 cats?

Choose one of the other scatter plot datasets from **here**
(any of the datasets under the heading **Functions**) and save it to your programs directory.
Try running lineFit.py using it. If the line doesn't seem to fit very well, try changing the
degree and running the program again.

Be sure to save one of the plots for your website, and answer these questions.

- Download
**this dataset**into your programs directory. Save it as**outliers**. - Change the filename being passed to the
**loadData()**function in lineFit.py to 'outliers'. Save the changes. - Run the program and take a look at the plot.

How well does the lit of fit seem to fit the scatterplot?

Outliers are points which are very far away from most of the other points in a scatterplot. Look at the plot again; can you find some outliers? How do you think the outliers might impact the line of fit?

- Download
**this dataset**into your programs directory. Save it as**no_outliers**. - Change the filename being passed to the
**loadData()**function in lineFit.py from 'outliers' to 'noOutliers'. Save the changes. - Run the program and take a look at the plot.

Compare this plot to the one with outliers. This dataset is the same as the previous dataset, with the outliers taken out. Can you tell how it changed the line of fit?

Save one of the plots for your website. In addition to the questions here, please answer these questions on your website:

- How do outliers affect a line of fit?
- Which line of fit do you think is better: the one with the outliers, or the one with the outliers removed?