diff --git a/assignment/assignment_programming.ipynb b/assignment/assignment_programming.ipynb index 299e0bf..2bdadc4 100644 --- a/assignment/assignment_programming.ipynb +++ b/assignment/assignment_programming.ipynb @@ -1,185 +1,880 @@ { - "cells": [ - { - "cell_type": "markdown", - "id": "f7c24d5d-9973-45a1-83be-8bca8b03e576", - "metadata": {}, - "source": [ - "# Assignment: Programming Review\n", - "## Do Q1 and one other question." - ] + "cells": [ + { + "cell_type": "markdown", + "id": "f7c24d5d-9973-45a1-83be-8bca8b03e576", + "metadata": { + "id": "f7c24d5d-9973-45a1-83be-8bca8b03e576" + }, + "source": [ + "# Assignment: Programming Review\n", + "## Do Q1 and one other question." + ] + }, + { + "cell_type": "markdown", + "id": "4a3fb7b5-0345-447d-840a-59f667fe9c0c", + "metadata": { + "id": "4a3fb7b5-0345-447d-840a-59f667fe9c0c" + }, + "source": [ + "**Q1.** First, think about your priorities in life. What kind of salary do you want to make after graduation? Do you mind getting more schooling? What kind of work-life balance are you looking for? Where do you want to work, geographically? You don't have to write this down here, just think about it. \n", + "\n", + "1. Go to the Occupational Outlook Handbook at [https://www.bls.gov/ooh/](https://www.bls.gov/ooh/). Look up \"Data Scientist.\" Read about the job and start collecting data about it from the job profile (e.g. salary, education required, work setting)\n", + "2. Find 7-10 other jobs that appeal to you, and collect the same data as you did for Data Scientist. Put it all in a spreadsheet.\n", + "3. Do any of your findings surprise you?\n", + "4. Rank the jobs you picked from best to worst, and briefly explain why you did so.\n", + "5. Please submit your spreadsheet with the assignment --- you can \"de-identify\" it and remove anything that you find personally identifying or you don't want to share, of course. We'll play with these data later.\n" + ] + }, + { + "cell_type": "markdown", + "id": "7e9d65ad-3740-43d3-a944-b3653fbeb80c", + "metadata": { + "id": "7e9d65ad-3740-43d3-a944-b3653fbeb80c" + }, + "source": [ + "Depends on student opinions." + ] + }, + { + "cell_type": "markdown", + "source": [ + "3: I was surprised that the Software Development jobs had a high future growth rate. With the rise of AI, I would expect a lot of Software Devleopment jobs to start getting replaced. So, I am surprsied the growth is so high. I was also surprised that teaching has a negative growth. I feel like teaching is a steady job. It is not going to be replaced. It is also a needed job. I was also surprised there are so little aerospace majors. I assumed because it is at most engineering schools, the major's job would be more common.\n", + "\n", + "4:\n", + "1. Data Scientist\n", + "2. Software Developers, Quality Assurance Analysts, and Testers\n", + "3. Aerospace Engineer\n", + "4. Mechanical Engineer\n", + "5. Athletes and Sports Competitors\n", + "6. Airline and Commercial Pilots\n", + "7. Plumbers, Pipefitters, and Steamfitters\n", + "8. Automotive Service Technicians and Mechanics\n", + "9. Kindergarten and Elementary School Teachers\n", + "\n", + "I chose this order because I wanted to prioritize projected job growth, median salary, and work setting. Data Scientist and Software Developers are at the top because of their super high growth rate and salary. Aerospace Engineers have a high median salary. Mechanical Engineers and Atheletes have a decent salary and a high growth rate. Pilots are only 6th because of the work setting despite the high salary. Plumbers and Mechanics are low due to their low salary. Kindergarten and Elementary School Teachers are last because of their low median salary and negative growth." + ], + "metadata": { + "id": "F24Nz1VWxyoh" + }, + "id": "F24Nz1VWxyoh" + }, + { + "cell_type": "markdown", + "id": "7cfa75b2-aaef-4368-a043-93437887879c", + "metadata": { + "id": "7cfa75b2-aaef-4368-a043-93437887879c" + }, + "source": [ + "**Q2.** Being able to make basic plots to visualize sets of points is really helpful, particularly with data analysis. The pyplot plots are built up slowly by defining elements of the plot, and then using `plt.show()` to create the final plot. This question gives you some practice doing that **iterative** building process.\n", + "\n", + "1. Import the `numpy` module as `np` and the `matplotlib.pyplot` package as `plt`.\n", + "2. Use `np.linspace` to create a grid of 50 points ranging from 0 to 1.\n", + "3. Create a numpy array $y$ containing the values for the natural logarithm function using the `np.log(x)` function. Create a numpy array $z$ containing the values for the exponential function using the `np.exp(x)` function.\n", + "4. Use the `plt.scatter(x,y)` method for the $y$ and $z$ vectors to create two scatter plots of the points you've created.\n", + "5. Use the `plt.show()` method to create the plot.\n", + "\n", + "That plot has some problems.\n", + "\n", + "6. Before the `plt.show()` call, add labels to the $x$ and $y$ axes using `plt.xlabel(label)` and `plt.ylabel(label)`. Add a title to the graph using `plt.title(title)`, like \"Natural Log and Exponential Functions\".\n", + "7. That looks a lot better, but we need a legend. When you screate the scatter plots, add the argument `label='Natural Log'` or `label='Exponential'` to the `.scatter` method call. Before the `plt.show()` call, add a `plt.legend(loc = 'lower right')` method call, which creates a legend in the lower right.\n", + "\n", + "Now do it again, with slightly less direction:\n", + "\n", + "8. Create a grid of 100 equally spaced points ranging from -6.5 to 6.5.\n", + "9. Use the sine and cosine functions in Numpy to compute the values of those functions for each point on your grid. (You'll have to find out what those functions are.)\n", + "10. Plot the values of the two functions for the values on the grid on the same plot.\n", + "11. The scatter plot looks really noisy. Instead of `plt.scatter(x,y)` to make a scatter plot, use `plt.plot(x,y)` to make a line graph.\n", + "12. Make the plot again, with labels for the axes, a title, and a legend in the lower left instead of the lower right.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "9c05ee86-06dc-4ae9-896d-0cb70c4d6c3d", + "metadata": { + "id": "9c05ee86-06dc-4ae9-896d-0cb70c4d6c3d", + "outputId": "0d9769db-dc95-41a5-cdea-a671c8e6d95f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 637 + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + ":8: RuntimeWarning: divide by zero encountered in log\n", + " y = np.log(x)\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "
matplotlib.pyplot.show
def show(*args, **kwargs) -> None
/usr/local/lib/python3.11/dist-packages/matplotlib/pyplot.pyDisplay all open figures.\n",
+              "\n",
+              "Parameters\n",
+              "----------\n",
+              "block : bool, optional\n",
+              "    Whether to wait for all figures to be closed before returning.\n",
+              "\n",
+              "    If `True` block and run the GUI main loop until all figure windows\n",
+              "    are closed.\n",
+              "\n",
+              "    If `False` ensure that all figure windows are displayed and return\n",
+              "    immediately.  In this case, you are responsible for ensuring\n",
+              "    that the event loop is running to have responsive figures.\n",
+              "\n",
+              "    Defaults to True in non-interactive mode and to False in interactive\n",
+              "    mode (see `.pyplot.isinteractive`).\n",
+              "\n",
+              "See Also\n",
+              "--------\n",
+              "ion : Enable interactive mode, which shows / updates the figure after\n",
+              "      every plotting command, so that calling ``show()`` is not necessary.\n",
+              "ioff : Disable interactive mode.\n",
+              "savefig : Save the figure to an image file instead of showing it on screen.\n",
+              "\n",
+              "Notes\n",
+              "-----\n",
+              "**Saving figures to file and showing a window at the same time**\n",
+              "\n",
+              "If you want an image file as well as a user interface window, use\n",
+              "`.pyplot.savefig` before `.pyplot.show`. At the end of (a blocking)\n",
+              "``show()`` the figure is closed and thus unregistered from pyplot. Calling\n",
+              "`.pyplot.savefig` afterwards would save a new and thus empty figure. This\n",
+              "limitation of command order does not apply if the show is non-blocking or\n",
+              "if you keep a reference to the figure and use `.Figure.savefig`.\n",
+              "\n",
+              "**Auto-show in jupyter notebooks**\n",
+              "\n",
+              "The jupyter backends (activated via ``%matplotlib inline``,\n",
+              "``%matplotlib notebook``, or ``%matplotlib widget``), call ``show()`` at\n",
+              "the end of every cell by default. Thus, you usually don't have to call it\n",
+              "explicitly there.
\n", + " \n", + "
" + ] + }, + "metadata": {}, + "execution_count": 2 + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# First Part\n", + "\n", + "x = np.linspace(0,1)\n", + "\n", + "y = np.log(x)\n", + "z = np.exp(x)\n", + "plt.scatter(x,y)\n", + "plt.scatter(x,z)\n", + "\n", + "\n", + "plt.show\n", + "\n" + ] + }, + { + "cell_type": "code", + "source": [ + "# First Part with labels\n", + "\n", + "x = np.linspace(0,1)\n", + "\n", + "y = np.log(x)\n", + "z = np.exp(x)\n", + "plt.scatter(x,y, label='Natural Log')\n", + "plt.scatter(x,z, label='Exponential')\n", + "\n", + "plt.xlabel(\"X axis\")\n", + "\n", + "plt.ylabel(\"Y axis\")\n", + "\n", + "plt.title(\"Natural Log and Exponential Functions\")\n", + "\n", + "plt.legend(loc = 'lower right')\n", + "\n", + "plt.show\n", + "\n" + ], + "metadata": { + "id": "l79ujsxA-fCT", + "outputId": "9bd22e5c-1edc-45e7-a51c-0d5c5cc09238", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 676 + } + }, + "id": "l79ujsxA-fCT", + "execution_count": 4, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + ":5: RuntimeWarning: divide by zero encountered in log\n", + " y = np.log(x)\n" + ] + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "
matplotlib.pyplot.show
def show(*args, **kwargs) -> None
/usr/local/lib/python3.11/dist-packages/matplotlib/pyplot.pyDisplay all open figures.\n",
+              "\n",
+              "Parameters\n",
+              "----------\n",
+              "block : bool, optional\n",
+              "    Whether to wait for all figures to be closed before returning.\n",
+              "\n",
+              "    If `True` block and run the GUI main loop until all figure windows\n",
+              "    are closed.\n",
+              "\n",
+              "    If `False` ensure that all figure windows are displayed and return\n",
+              "    immediately.  In this case, you are responsible for ensuring\n",
+              "    that the event loop is running to have responsive figures.\n",
+              "\n",
+              "    Defaults to True in non-interactive mode and to False in interactive\n",
+              "    mode (see `.pyplot.isinteractive`).\n",
+              "\n",
+              "See Also\n",
+              "--------\n",
+              "ion : Enable interactive mode, which shows / updates the figure after\n",
+              "      every plotting command, so that calling ``show()`` is not necessary.\n",
+              "ioff : Disable interactive mode.\n",
+              "savefig : Save the figure to an image file instead of showing it on screen.\n",
+              "\n",
+              "Notes\n",
+              "-----\n",
+              "**Saving figures to file and showing a window at the same time**\n",
+              "\n",
+              "If you want an image file as well as a user interface window, use\n",
+              "`.pyplot.savefig` before `.pyplot.show`. At the end of (a blocking)\n",
+              "``show()`` the figure is closed and thus unregistered from pyplot. Calling\n",
+              "`.pyplot.savefig` afterwards would save a new and thus empty figure. This\n",
+              "limitation of command order does not apply if the show is non-blocking or\n",
+              "if you keep a reference to the figure and use `.Figure.savefig`.\n",
+              "\n",
+              "**Auto-show in jupyter notebooks**\n",
+              "\n",
+              "The jupyter backends (activated via ``%matplotlib inline``,\n",
+              "``%matplotlib notebook``, or ``%matplotlib widget``), call ``show()`` at\n",
+              "the end of every cell by default. Thus, you usually don't have to call it\n",
+              "explicitly there.
\n", + " \n", + "
" + ] + }, + "metadata": {}, + "execution_count": 4 + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Second Part\n", + "x = np.linspace(-6.5, 6.5, num = 100)\n", + "s = np.sin(x)\n", + "c = np.cos(x)\n", + "\n", + "plt.scatter(x, s, label='Sin')\n", + "plt.scatter(x, c, label='Cos')\n", + "\n", + "\n", + "\n", + "\n", + "plt.show\n" + ], + "metadata": { + "id": "-DI37RAATmE3", + "outputId": "72939a22-b221-4c9a-cac9-21d26f34267f", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 599 + } + }, + "id": "-DI37RAATmE3", + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "
matplotlib.pyplot.show
def show(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/matplotlib/pyplot.pyDisplay all open figures.\n",
+              "\n",
+              "Parameters\n",
+              "----------\n",
+              "block : bool, optional\n",
+              "    Whether to wait for all figures to be closed before returning.\n",
+              "\n",
+              "    If `True` block and run the GUI main loop until all figure windows\n",
+              "    are closed.\n",
+              "\n",
+              "    If `False` ensure that all figure windows are displayed and return\n",
+              "    immediately.  In this case, you are responsible for ensuring\n",
+              "    that the event loop is running to have responsive figures.\n",
+              "\n",
+              "    Defaults to True in non-interactive mode and to False in interactive\n",
+              "    mode (see `.pyplot.isinteractive`).\n",
+              "\n",
+              "See Also\n",
+              "--------\n",
+              "ion : Enable interactive mode, which shows / updates the figure after\n",
+              "      every plotting command, so that calling ``show()`` is not necessary.\n",
+              "ioff : Disable interactive mode.\n",
+              "savefig : Save the figure to an image file instead of showing it on screen.\n",
+              "\n",
+              "Notes\n",
+              "-----\n",
+              "**Saving figures to file and showing a window at the same time**\n",
+              "\n",
+              "If you want an image file as well as a user interface window, use\n",
+              "`.pyplot.savefig` before `.pyplot.show`. At the end of (a blocking)\n",
+              "``show()`` the figure is closed and thus unregistered from pyplot. Calling\n",
+              "`.pyplot.savefig` afterwards would save a new and thus empty figure. This\n",
+              "limitation of command order does not apply if the show is non-blocking or\n",
+              "if you keep a reference to the figure and use `.Figure.savefig`.\n",
+              "\n",
+              "**Auto-show in jupyter notebooks**\n",
+              "\n",
+              "The jupyter backends (activated via ``%matplotlib inline``,\n",
+              "``%matplotlib notebook``, or ``%matplotlib widget``), call ``show()`` at\n",
+              "the end of every cell by default. Thus, you usually don't have to call it\n",
+              "explicitly there.
\n", + " \n", + "
" + ] + }, + "metadata": {}, + "execution_count": 24 + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Second Part with Line\n", + "x = np.linspace(-6.5, 6.5, num = 100)\n", + "s = np.sin(x)\n", + "c = np.cos(x)\n", + "\n", + "plt.plot(x, s, label='Sin')\n", + "plt.plot(x, c, label='Cos')\n", + "\n", + "\n", + "\n", + "\n", + "plt.show\n" + ], + "metadata": { + "id": "Sxpa8N1y_CJz", + "outputId": "804ea4e8-8293-4b2e-9100-6d1964c9ce2a", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 599 + } + }, + "id": "Sxpa8N1y_CJz", + "execution_count": 5, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "
matplotlib.pyplot.show
def show(*args, **kwargs) -> None
/usr/local/lib/python3.11/dist-packages/matplotlib/pyplot.pyDisplay all open figures.\n",
+              "\n",
+              "Parameters\n",
+              "----------\n",
+              "block : bool, optional\n",
+              "    Whether to wait for all figures to be closed before returning.\n",
+              "\n",
+              "    If `True` block and run the GUI main loop until all figure windows\n",
+              "    are closed.\n",
+              "\n",
+              "    If `False` ensure that all figure windows are displayed and return\n",
+              "    immediately.  In this case, you are responsible for ensuring\n",
+              "    that the event loop is running to have responsive figures.\n",
+              "\n",
+              "    Defaults to True in non-interactive mode and to False in interactive\n",
+              "    mode (see `.pyplot.isinteractive`).\n",
+              "\n",
+              "See Also\n",
+              "--------\n",
+              "ion : Enable interactive mode, which shows / updates the figure after\n",
+              "      every plotting command, so that calling ``show()`` is not necessary.\n",
+              "ioff : Disable interactive mode.\n",
+              "savefig : Save the figure to an image file instead of showing it on screen.\n",
+              "\n",
+              "Notes\n",
+              "-----\n",
+              "**Saving figures to file and showing a window at the same time**\n",
+              "\n",
+              "If you want an image file as well as a user interface window, use\n",
+              "`.pyplot.savefig` before `.pyplot.show`. At the end of (a blocking)\n",
+              "``show()`` the figure is closed and thus unregistered from pyplot. Calling\n",
+              "`.pyplot.savefig` afterwards would save a new and thus empty figure. This\n",
+              "limitation of command order does not apply if the show is non-blocking or\n",
+              "if you keep a reference to the figure and use `.Figure.savefig`.\n",
+              "\n",
+              "**Auto-show in jupyter notebooks**\n",
+              "\n",
+              "The jupyter backends (activated via ``%matplotlib inline``,\n",
+              "``%matplotlib notebook``, or ``%matplotlib widget``), call ``show()`` at\n",
+              "the end of every cell by default. Thus, you usually don't have to call it\n",
+              "explicitly there.
\n", + " \n", + "
" + ] + }, + "metadata": {}, + "execution_count": 5 + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ] + }, + { + "cell_type": "code", + "source": [ + "# Second Part: Line Graph\n", + "x = np.linspace(-6.5, 6.5, num = 100)\n", + "s = np.sin(x)\n", + "c = np.cos(x)\n", + "\n", + "plt.plot(x, s, label='Sin')\n", + "plt.plot(x, c, label='Cos')\n", + "\n", + "\n", + "plt.xlabel(\"X axis\")\n", + "plt.ylabel(\"Y axis\")\n", + "plt.title(\"Sin and Cos Functions\")\n", + "\n", + "\n", + "plt.legend(loc = 'lower left')\n", + "\n", + "plt.show\n" + ], + "metadata": { + "id": "5aXo8SZiT2LQ", + "outputId": "6c0e62ba-937a-4cf9-a120-7ce318d73bfb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 641 + } + }, + "id": "5aXo8SZiT2LQ", + "execution_count": null, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ], + "text/html": [ + "
\n", + "
matplotlib.pyplot.show
def show(*args, **kwargs)
/usr/local/lib/python3.10/dist-packages/matplotlib/pyplot.pyDisplay all open figures.\n",
+              "\n",
+              "Parameters\n",
+              "----------\n",
+              "block : bool, optional\n",
+              "    Whether to wait for all figures to be closed before returning.\n",
+              "\n",
+              "    If `True` block and run the GUI main loop until all figure windows\n",
+              "    are closed.\n",
+              "\n",
+              "    If `False` ensure that all figure windows are displayed and return\n",
+              "    immediately.  In this case, you are responsible for ensuring\n",
+              "    that the event loop is running to have responsive figures.\n",
+              "\n",
+              "    Defaults to True in non-interactive mode and to False in interactive\n",
+              "    mode (see `.pyplot.isinteractive`).\n",
+              "\n",
+              "See Also\n",
+              "--------\n",
+              "ion : Enable interactive mode, which shows / updates the figure after\n",
+              "      every plotting command, so that calling ``show()`` is not necessary.\n",
+              "ioff : Disable interactive mode.\n",
+              "savefig : Save the figure to an image file instead of showing it on screen.\n",
+              "\n",
+              "Notes\n",
+              "-----\n",
+              "**Saving figures to file and showing a window at the same time**\n",
+              "\n",
+              "If you want an image file as well as a user interface window, use\n",
+              "`.pyplot.savefig` before `.pyplot.show`. At the end of (a blocking)\n",
+              "``show()`` the figure is closed and thus unregistered from pyplot. Calling\n",
+              "`.pyplot.savefig` afterwards would save a new and thus empty figure. This\n",
+              "limitation of command order does not apply if the show is non-blocking or\n",
+              "if you keep a reference to the figure and use `.Figure.savefig`.\n",
+              "\n",
+              "**Auto-show in jupyter notebooks**\n",
+              "\n",
+              "The jupyter backends (activated via ``%matplotlib inline``,\n",
+              "``%matplotlib notebook``, or ``%matplotlib widget``), call ``show()`` at\n",
+              "the end of every cell by default. Thus, you usually don't have to call it\n",
+              "explicitly there.
\n", + " \n", + "
" + ] + }, + "metadata": {}, + "execution_count": 23 + }, + { + "output_type": "display_data", + "data": { + "text/plain": [ + "
" + ], + "image/png": "\n" + }, + "metadata": {} + } + ] + }, + { + "cell_type": "markdown", + "id": "ab57312f-fd41-4763-b38c-3b7c1b062b1c", + "metadata": { + "id": "ab57312f-fd41-4763-b38c-3b7c1b062b1c" + }, + "source": [ + "**Q3.** This is a basic review of some statistics along with practice writing functions. Like we talked about, Python is a general purpose programming language and ships without basic data handling or statistical packages coded in. The beginning of the code chunk below generates random values for you to test your work on; since the random values are generates as NumPy arrays, you'll need to use the Numpy methods `np.sum(x)` to sum the vector $x$ and `np.sqrt(x)` to take the square roots of the values in $x$, as well as the Python function `len(x)` to get the length of $x$.\n", + "\n", + "Try to reuse the functions you've already defined as you work through the following questions, rather than rewriting code you've already written.\n", + "\n", + "1. Write a function that computes the **sample average** or **mean** of a vector $x$,\n", + "$$\n", + "\\bar{x} = \\dfrac{x_1 + x_2 + ... + x_N}{N} = \\dfrac{\\sum_{i=1}^N x_i}{N}.\n", + "$$\n", + "Write a function in the code chunk below to compute this quantity, and then use it to compute the mean of $x$.\n", + "2. Write a function that computes the **sample standard deviation** of a vector $x$,\n", + "$$\n", + "s_x = \\sqrt{\\dfrac{(x_1 - \\bar{x})^2 + ... (x_N - \\bar{x})^2 }{N-1}} = \\sqrt{\\dfrac{ \\sum_{i=1}^N (x_i - \\bar{x})^2 }{N-1}}.\n", + "$$\n", + "The intuition of this quantity is that it computes roughly the average distance from each point $x_i$ to the sample mean $\\bar{x}$. If it is small, it means all the points are clustered tightly around the mean, and if it is large, it means the points are typically far away from the average. Write a function in the code chunk below to compute this quantity, and then use it to compute the sample standard deviation of $x$.\n", + "3. Write a function that calls the previous two to **standardize** the values of the vector as a **$z$-score**:\n", + "$$\n", + "z = \\dfrac{x-\\bar{x}}{s}.\n", + "$$\n", + "The intuition of this quantity is that it is recentering all the values of $x$ so the average is zero and then scaling them by the standard deviation. If the data are normally distributed and $N$ is large, the $z$ score will approximately follow a standard normal distribution. Write a function in the code chunk below to compute this quantity, and then use it to compute the z-scores for $x$.\n", + "4. The **sample covariance** of two vectors $x=(x_1,...,x_N)$ and $y=(y_1,...,y_N)$ is defined as\n", + "$$\n", + "cov(x,y) = \\dfrac{(x_1 - \\bar{x})(y_1-\\bar{y}) + (x_2 - \\bar{x})(y_2-\\bar{y}) + ... + (x_N - \\bar{x})(y_{N}-\\bar{y})}{N-1}\n", + "$$\n", + "$$\n", + "= \\dfrac{\\sum_{i=1}^N (x_i - \\bar{x})(y_i - \\bar{y})}{N-1}.\n", + "$$\n", + "The intuition of this quantity is that it looks at the pairs $(x_i, y_i)$ and compares them to the means $(\\bar{x},\\bar{y})$ to determine whether $x$ and $y$ tend to co-vary in the same direction relative to their means: If the values of $x$ and $y$ are typically both above or below the mean values of $x$ and $y$, then $x$ and $y$ will have a positive covariance, but if $x$ is typically above the mean of $x$ when $y$ is typically below the mean of $y$ or vice versa, then they will have a negative covariance. Write a function in the code chunk below to compute this quantity, and then use it to compute the covariance of the generated $x$ and $y$.\n", + "6. The **sample correlation coefficient** of two vectors $x$ and $y$ is defined as\n", + "$$\n", + "r_{xy} = \\dfrac{cov(x,y)}{s_x s_y}\n", + "$$\n", + "Use your functions to create a new function that compute this quantity. The intuition of this quantity is that it is like the covariance, but normalized so that its values like between -1 and 1: perfect negative linear association between the variables at -1, no association at 0, and perfect positive linear association between the variables at 1." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "f42ab788-af0c-47a4-a103-46c9a37bfad3", + "metadata": { + "id": "f42ab788-af0c-47a4-a103-46c9a37bfad3" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import math as math\n", + "np.random.seed(100) # Set the seed for the random number generator\n", + "rho, sigma_x, sigma_y = -.4, 3, 2 # Variance-Covariance Parameters\n", + "vcv = np.array([[sigma_x**2, rho*sigma_x*sigma_y],\n", + " [rho*sigma_x*sigma_y,sigma_y**2]]) # VCV Matrix\n", + "mu = np.array([-1,2]) # Population averages\n", + "sample = np.random.multivariate_normal(mu,vcv,200) # Multivariate normal draws\n", + "x = sample[:,0]\n", + "y = sample[:,1]\n", + "\n", + "#############################################################################\n" + ] + }, + { + "cell_type": "markdown", + "id": "5924279e-4b74-4941-9819-f028ce3db974", + "metadata": { + "id": "5924279e-4b74-4941-9819-f028ce3db974" + }, + "source": [ + "**Q4.** Optimization is at the core of data science and statistics: Picking the best fit or estimate according to some desirable criteria (e.g. Maximum Likelihood Estimation). In this question, you're going to write a function that finds the highest possible value for some function, and returns the value of the function and the maximizing value.\n", + "\n", + "1. Write a function that creates a grid from some value $a$ to a second value $b$ with $N$ steps, computes the value of a function $f$ on that grid, and then finds the maximizers of the function. Have it return a dictionary with the maximum value and the maximizers themselves.\n", + "2. Use your function to maximize $f(x) = 100 - 2x^2 + 3x$ for values of $a=-1$, $b=1$, and for values of $N=3$, $N=10$, $N=100$, $N=1000$ and $N=5000$.\n", + "3. The true maximizer of this function is $.75$ and the true maximum value is $101.125$. Is that what you got in the previous step? Why not? How does the quality of the maximization depend on $N$? How does your computed answer change as $N$ gets larger?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c3b32977-7687-4770-8f12-2ab0593f2258", + "metadata": { + "id": "c3b32977-7687-4770-8f12-2ab0593f2258" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "import matplotlib.pyplot as plt\n" + ] + }, + { + "cell_type": "markdown", + "id": "3c6e5e34-00c0-4101-a1af-34cf61d69ffc", + "metadata": { + "id": "3c6e5e34-00c0-4101-a1af-34cf61d69ffc" + }, + "source": [ + "3." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.4" + }, + "colab": { + "provenance": [], + "toc_visible": true + } }, - { - "cell_type": "markdown", - "id": "4a3fb7b5-0345-447d-840a-59f667fe9c0c", - "metadata": {}, - "source": [ - "**Q1.** First, think about your priorities in life. What kind of salary do you want to make after graduation? Do you mind getting more schooling? What kind of work-life balance are you looking for? Where do you want to work, geographically? You don't have to write this down here, just think about it.\n", - "\n", - "1. Go to the Occupational Outlook Handbook at [https://www.bls.gov/ooh/](https://www.bls.gov/ooh/). Look up \"Data Scientist.\" Read about the job and start collecting data about it from the job profile (e.g. salary, education required, work setting)\n", - "2. Find 7-10 other jobs that appeal to you, and collect the same data as you did for Data Scientist. Put it all in a spreadsheet.\n", - "3. Do any of your findings surprise you?\n", - "4. Rank the jobs you picked from best to worst, and briefly explain why you did so.\n", - "5. Please submit your spreadsheet with the assignment --- you can \"de-identify\" it and remove anything that you find personally identifying or you don't want to share, of course. We'll play with these data later.\n" - ] - }, - { - "cell_type": "markdown", - "id": "7e9d65ad-3740-43d3-a944-b3653fbeb80c", - "metadata": {}, - "source": [ - "Depends on student opinions." - ] - }, - { - "cell_type": "markdown", - "id": "7cfa75b2-aaef-4368-a043-93437887879c", - "metadata": {}, - "source": [ - "**Q2.** Being able to make basic plots to visualize sets of points is really helpful, particularly with data analysis. The pyplot plots are built up slowly by defining elements of the plot, and then using `plt.show()` to create the final plot. This question gives you some practice doing that **iterative** building process.\n", - "\n", - "1. Import the `numpy` module as `np` and the `matplotlib.pyplot` package as `plt`.\n", - "2. Use `np.linspace` to create a grid of 50 points ranging from 0 to 1.\n", - "3. Create a numpy array $y$ containing the values for the natural logarithm function using the `np.log(x)` function. Create a numpy array $z$ containing the values for the exponential function using the `np.exp(x)` function.\n", - "4. Use the `plt.scatter(x,y)` method for the $y$ and $z$ vectors to create two scatter plots of the points you've created.\n", - "5. Use the `plt.show()` method to create the plot.\n", - "\n", - "That plot has some problems.\n", - "\n", - "6. Before the `plt.show()` call, add labels to the $x$ and $y$ axes using `plt.xlabel(label)` and `plt.ylabel(label)`. Add a title to the graph using `plt.title(title)`, like \"Natural Log and Exponential Functions\".\n", - "7. That looks a lot better, but we need a legend. When you screate the scatter plots, add the argument `label='Natural Log'` or `label='Exponential'` to the `.scatter` method call. Before the `plt.show()` call, add a `plt.legend(loc = 'lower right')` method call, which creates a legend in the lower right.\n", - "\n", - "Now do it again, with slightly less direction:\n", - "\n", - "8. Create a grid of 100 equally spaced points ranging from -6.5 to 6.5.\n", - "9. Use the sine and cosine functions in Numpy to compute the values of those functions for each point on your grid. (You'll have to find out what those functions are.)\n", - "10. Plot the values of the two functions for the values on the grid on the same plot.\n", - "11. The scatter plot looks really noisy. Instead of `plt.scatter(x,y)` to make a scatter plot, use `plt.plot(x,y)` to make a line graph.\n", - "12. Make the plot again, with labels for the axes, a title, and a legend in the lower left instead of the lower right.\n" - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "id": "9c05ee86-06dc-4ae9-896d-0cb70c4d6c3d", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import matplotlib.pyplot as plt\n" - ] - }, - { - "cell_type": "markdown", - "id": "ab57312f-fd41-4763-b38c-3b7c1b062b1c", - "metadata": {}, - "source": [ - "**Q3.** This is a basic review of some statistics along with practice writing functions. Like we talked about, Python is a general purpose programming language and ships without basic data handling or statistical packages coded in. The beginning of the code chunk below generates random values for you to test your work on; since the random values are generates as NumPy arrays, you'll need to use the Numpy methods `np.sum(x)` to sum the vector $x$ and `np.sqrt(x)` to take the square roots of the values in $x$, as well as the Python function `len(x)` to get the length of $x$.\n", - "\n", - "Try to reuse the functions you've already defined as you work through the following questions, rather than rewriting code you've already written.\n", - "\n", - "1. Write a function that computes the **sample average** or **mean** of a vector $x$,\n", - "$$\n", - "\\bar{x} = \\dfrac{x_1 + x_2 + ... + x_N}{N} = \\dfrac{\\sum_{i=1}^N x_i}{N}.\n", - "$$\n", - "Write a function in the code chunk below to compute this quantity, and then use it to compute the mean of $x$.\n", - "2. Write a function that computes the **sample standard deviation** of a vector $x$,\n", - "$$\n", - "s_x = \\sqrt{\\dfrac{(x_1 - \\bar{x})^2 + ... (x_N - \\bar{x})^2 }{N-1}} = \\sqrt{\\dfrac{ \\sum_{i=1}^N (x_i - \\bar{x})^2 }{N-1}}.\n", - "$$\n", - "The intuition of this quantity is that it computes roughly the average distance from each point $x_i$ to the sample mean $\\bar{x}$. If it is small, it means all the points are clustered tightly around the mean, and if it is large, it means the points are typically far away from the average. Write a function in the code chunk below to compute this quantity, and then use it to compute the sample standard deviation of $x$.\n", - "3. Write a function that calls the previous two to **standardize** the values of the vector as a **$z$-score**:\n", - "$$\n", - "z = \\dfrac{x-\\bar{x}}{s}.\n", - "$$\n", - "The intuition of this quantity is that it is recentering all the values of $x$ so the average is zero and then scaling them by the standard deviation. If the data are normally distributed and $N$ is large, the $z$ score will approximately follow a standard normal distribution. Write a function in the code chunk below to compute this quantity, and then use it to compute the z-scores for $x$.\n", - "4. The **sample covariance** of two vectors $x=(x_1,...,x_N)$ and $y=(y_1,...,y_N)$ is defined as\n", - "$$\n", - "cov(x,y) = \\dfrac{(x_1 - \\bar{x})(y_1-\\bar{y}) + (x_2 - \\bar{x})(y_2-\\bar{y}) + ... + (x_N - \\bar{x})(y_{N}-\\bar{y})}{N-1}\n", - "$$\n", - "$$\n", - "= \\dfrac{\\sum_{i=1}^N (x_i - \\bar{x})(y_i - \\bar{y})}{N-1}.\n", - "$$\n", - "The intuition of this quantity is that it looks at the pairs $(x_i, y_i)$ and compares them to the means $(\\bar{x},\\bar{y})$ to determine whether $x$ and $y$ tend to co-vary in the same direction relative to their means: If the values of $x$ and $y$ are typically both above or below the mean values of $x$ and $y$, then $x$ and $y$ will have a positive covariance, but if $x$ is typically above the mean of $x$ when $y$ is typically below the mean of $y$ or vice versa, then they will have a negative covariance. Write a function in the code chunk below to compute this quantity, and then use it to compute the covariance of the generated $x$ and $y$.\n", - "6. The **sample correlation coefficient** of two vectors $x$ and $y$ is defined as\n", - "$$\n", - "r_{xy} = \\dfrac{cov(x,y)}{s_x s_y}\n", - "$$\n", - "Use your functions to create a new function that compute this quantity. The intuition of this quantity is that it is like the covariance, but normalized so that its values like between -1 and 1: perfect negative linear association between the variables at -1, no association at 0, and perfect positive linear association between the variables at 1. " - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "id": "f42ab788-af0c-47a4-a103-46c9a37bfad3", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import math as math\n", - "np.random.seed(100) # Set the seed for the random number generator\n", - "rho, sigma_x, sigma_y = -.4, 3, 2 # Variance-Covariance Parameters\n", - "vcv = np.array([[sigma_x**2, rho*sigma_x*sigma_y],\n", - " [rho*sigma_x*sigma_y,sigma_y**2]]) # VCV Matrix\n", - "mu = np.array([-1,2]) # Population averages\n", - "sample = np.random.multivariate_normal(mu,vcv,200) # Multivariate normal draws\n", - "x = sample[:,0]\n", - "y = sample[:,1]\n", - "\n", - "#############################################################################\n" - ] - }, - { - "cell_type": "markdown", - "id": "5924279e-4b74-4941-9819-f028ce3db974", - "metadata": {}, - "source": [ - "**Q4.** Optimization is at the core of data science and statistics: Picking the best fit or estimate according to some desirable criteria (e.g. Maximum Likelihood Estimation). In this question, you're going to write a function that finds the highest possible value for some function, and returns the value of the function and the maximizing value.\n", - "\n", - "1. Write a function that creates a grid from some value $a$ to a second value $b$ with $N$ steps, computes the value of a function $f$ on that grid, and then finds the maximizers of the function. Have it return a dictionary with the maximum value and the maximizers themselves.\n", - "2. Use your function to maximize $f(x) = 100 - 2x^2 + 3x$ for values of $a=-1$, $b=1$, and for values of $N=3$, $N=10$, $N=100$, $N=1000$ and $N=5000$.\n", - "3. The true maximizer of this function is $.75$ and the true maximum value is $101.125$. Is that what you got in the previous step? Why not? How does the quality of the maximization depend on $N$? How does your computed answer change as $N$ gets larger?" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "id": "c3b32977-7687-4770-8f12-2ab0593f2258", - "metadata": {}, - "outputs": [], - "source": [ - "import numpy as np\n", - "import matplotlib.pyplot as plt\n" - ] - }, - { - "cell_type": "markdown", - "id": "3c6e5e34-00c0-4101-a1af-34cf61d69ffc", - "metadata": {}, - "source": [ - "3. " - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.4" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/assignment/jobs_data.csv b/assignment/jobs_data.csv new file mode 100644 index 0000000..c22ca0e --- /dev/null +++ b/assignment/jobs_data.csv @@ -0,0 +1,11 @@ +Job Name,2023 Median Pay (USD,Typical Entry-Level Education,Work Setting ,Projected Growth Percent (2023 - 2033) ,"Number of Jobs, 2023",Work Experience in a Related Occupation,On the Job Training +Data Scientist,"108,020",Bachelor's degree,Office,36,"202,900",None,"None " +"Software Developers, Quality Assurance Analysts, and Testers","130,160",Bachelor's degree,Office,17,"1,897,100",None,"None " +"Plumbers, Pipefitters, and Steamfitters","61,550",High school diploma or equivalent,Traveling,6,"473,400","None ","Apprenticeship " +Automotive Service Technicians and Mechanics,"47,770",Postsecondary nondegree award,Repair Shop,3,"794,600",None,Short-term on-the-job training +Airline and Commercial Pilots,"171,210",Bachelor's degree and Flight School,Vehicle,5,"152,800",Comercial or Military Pilot,"Moderate-term on-the-job training " +Athletes and Sports Competitors,"70,280","No formal educational credential ",Athletic Facilities,11,"25,100",None,"Long-term on-the-job training " +"Mechanical Engineers ","99,510",Bachelor's degree,Office,11,"291,900",None,"None " +Aerospace Engineer,"130,720","Bachelor's degree ",Office,6,"68,900",None,"None " +"Kindergarten and Elementary School Teachers +","63,670","Bachelor's degree ",School,-1,"1,563,700",None,"None " \ No newline at end of file