{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction to Python programming in bioinformatics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### A Brief Introduction" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python is a high-level programming language developed by a Dutch programmer Guido van Rossum in the 90's (https://en.wikipedia.org/wiki/Python_(programming_language)). It is an interpreted language, which means the code does not need to be compiled prior to running it, like other languages such as C++ or Java. See here for more information: https://en.wikipedia.org/wiki/Interpreted_language\n", "\n", "It is different from many other programming languages because of significane of white spaces. For example, the two following code snippets are interpreted differently, even though they have very similar code:\n", "\n", "```Python\n", "for i in range(1,10):\n", " print(i)\n", " \n", "for i in range(1,10):\n", " print(i)\n", "```\n", "\n", "The difference is that the first one has 4 white spaces after code indentation, whereas the second one has only 2 white spaces. Python developers recommend the use of **4** white spaces to indent code for the code to be readable. See Python coding guidelines here: https://www.python.org/dev/peps/pep-0008/\n", "\n", "It is easy to write any Python code but difficult to write *beautiful* code that is easy to read and easy to decipher/debug." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Today, you will be using both the `jupyter-lab` and terminal environment to write simple Python scripts that you can use to help you along with your projects. I will not be going over exhaustively what you can do with Python programming language but just brief introduction and I will be showing you important things you should know with regard to using Python for genomics work." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Basics" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python is already installed and running in your terminal environment because you have already installed Miniconda, which by default will install versions of Python you have downloaded. There is a significant difference between Python versions 2 and 3. They have made a lot of changes to code syntax between the two. For example, in Python 2 series, you can type like this to print something to screen:\n", "\n", "```Python\n", "print \"Hello, World!\"\n", "```\n", "\n", "But in Python 3 series, you need to type like this:\n", "\n", "```Python\n", "print(\"Hello, World\")\n", "```\n", "\n", "This is just one of the examples. Since we are sticking to Python 3, we will need to use the code syntax as shown in the second example. In `jupyter-lab`, you can use code cells to start writing Python code right away. It works as a mini environment in which you can test things. For example, you can use it like a calculator." ] }, { "cell_type": "code", "execution_count": 152, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Hello!\n" ] } ], "source": [ "print(\"Hello!\")" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4223.5" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "(134+8353-(2*20)) / 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Before we dive further into coding, a few important datatypes you need to know and remember in Python are listed below:\n", "\n", "1. **Booleans** are either True or False.\n", "1. **Numbers** can be integers (1 and 2), floats (1.1 and 1.2), fractions (1/2 and 2/3), or even complex numbers.\n", "1. **Strings** are sequences of Unicode characters, e.g. an HTML document.\n", "1. **Bytes** and **byte arrays**, e.g. a JPEG image file.\n", "1. **Lists** are ordered sequences of values.\n", "1. **Tuples** are ordered, immutable sequences of values.\n", "1. **Sets** are unordered bags of values.\n", "1. **Dictionaries** are unordered bags of key-value pairs. \n", "\n", "For more details on the datatypes, refer to the tutorial here: https://diveintopython3.problemsolving.io/native-datatypes.html\n", "\n", "The most common datatypes you will encounter in bioinformatics/genomics related context would be **numbers**, **strings**, **lists**, **tuples**, **sets**, and **dictionaries**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Strings" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Strings can be anything that will be treated as text. For example, a sentence like this:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is a string'" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\"This is a string\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When I just pasted a string in `jupyter` environment it just returns what's being paste. However, you can put these strings into a variable. For example," ] }, { "cell_type": "code", "execution_count": 185, "metadata": {}, "outputs": [], "source": [ "astring = \"This is a string\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, you have a variable named `astring`. When you type this variable name in `jupyter`, you see this:" ] }, { "cell_type": "code", "execution_count": 154, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This is a string'" ] }, "execution_count": 154, "metadata": {}, "output_type": "execute_result" } ], "source": [ "astring" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It just prints the string to screen. You can also turn anything into a string by using the `str()` function. For example, I have a bunch of numbers, which I want to treat it as text for some reason, I can do something like this:" ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [], "source": [ "s = str(12345)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, I have stored this \"12345\" into a variable named `s`. If you want to know what type of variable it is, you can type like this:" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "str" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "s.__class__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It returns a `str`, which means the variable `s` is a string. Once you have a string, you can do a lot of things with it. For example, you can split the string into individual words." ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This', 'is', 'a', 'string']" ] }, "execution_count": 160, "metadata": {}, "output_type": "execute_result" } ], "source": [ "astring.split(\" \")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In the example above, I have split the string in the variable `astring` into individual words and using space (\" \") as a delimiter. You can also split without specifying the delimiter, in which case, the default action is to split by a white space." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This', 'is', 'a', 'string']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "astring.split()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, after you split the string into individual words, you will notice that you have each words contained within a single quote and all of them are in square brackets. What does this mean? This mean these are a **list** of words. In Python, anything contained within square brackets are treated as contents of a list. Let's try to play around with this list. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Lists" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'This'" ] }, "execution_count": 187, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alist = astring.split(\" \")\n", "alist[0]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's happening here? Here, I have stored the list produced as a result of string splittin function into a variable named `alist`. Then when you type `alist[0]`, it subsets the first content of the list, which is the word `'This'`. If you want to print the last item in the list, you can type:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'string'" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alist[-1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The \"-1\" in a list means you are accessing the last item within a list. If you want to print the items in a list one by one, you can type like this:" ] }, { "cell_type": "code", "execution_count": 188, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "This\n", "is\n", "a\n", "string\n" ] } ], "source": [ "for i in alist:\n", " print(i)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, I am running a `for` loop to print the contents of a list one after another. In this example, the letter `i` is a variable that temporarily stores the contents of a list one by one. Then the `print` function prints the variable to screen in a loop. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another useful datatype you can use is a `set`. Sets are a collection of unique values or items. For example, if you a list of items as shown below, you can turn into a set by typing the `set()` function." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "another_list = ['A', 'B', 'B', 'C', 'D']\n", "aset = set(another_list)" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'A', 'B', 'C', 'D'}" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "aset" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, you have two instances of a `'B'` and when you turn the list into a set, then only one of it is retained. Sets are usually contained in curly brackets `{ }`. You can think of many scenarios in which sets can become useful; for example, getting unique names from a pool of names." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dictionaries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Dictionaries are pairs of **keys** and **values**. As the name suggest, it behaves in the same way a dictionary would; it will explain meanings of a word, for example. Below is an example of a dictionary:" ] }, { "cell_type": "code", "execution_count": 173, "metadata": {}, "outputs": [], "source": [ "adict = {\n", " 'A': 'x',\n", " 'B': 'y',\n", " 'C': 'z'\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, I have stored 3 keys named A, B, and C and their values x, y, and z. So when you want to look up what a key stored, you can type like this:" ] }, { "cell_type": "code", "execution_count": 176, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'x'" ] }, "execution_count": 176, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adict['A']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "It prints the letter `x` which is stored as a value of the key `A`. You can only try to call key-value pairs that are present in this example dictionary. If you try to put a key that doesn't exist, it will complain." ] }, { "cell_type": "code", "execution_count": 177, "metadata": {}, "outputs": [ { "ename": "KeyError", "evalue": "'D'", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mKeyError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0madict\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'D'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mKeyError\u001b[0m: 'D'" ] } ], "source": [ "adict['D']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, it gives you an error as Python could not find the key `D` in this dictionary.\n", "\n", "To print all the keys in a dictionary, you can type like this:" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_keys(['A', 'B', 'C'])" ] }, "execution_count": 180, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adict.keys()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And all the values in a dictionary can be printed like this:" ] }, { "cell_type": "code", "execution_count": 181, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "dict_values(['x', 'y', 'z'])" ] }, "execution_count": 181, "metadata": {}, "output_type": "execute_result" } ], "source": [ "adict.values()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The usefulness of dictionaries is probably not apparent at this point but as you grow your programming skill sets, you will start to realize it is a very useful feature of a programming language." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Tuples" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Tuples are similar to lists but they are immutable, which means it stores values that cannot be changed and they are stored in parantheses `( )`. Other datatypes such as lists and dictionaries are mutable, which means you can modify the contents after being created. Usually, I would store things in a tuple that I shouldn't modify, for example, GPS coordinates. An example is shown below." ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "atuple = (1, 2, 3, 5)" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(1, 2, 3, 5)" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "atuple" ] }, { "cell_type": "code", "execution_count": 183, "metadata": {}, "outputs": [ { "ename": "TypeError", "evalue": "'tuple' object does not support item assignment", "output_type": "error", "traceback": [ "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)", "\u001b[0;32m\u001b[0m in \u001b[0;36m\u001b[0;34m\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0matuple\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;36m8\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m", "\u001b[0;31mTypeError\u001b[0m: 'tuple' object does not support item assignment" ] } ], "source": [ "atuple[0] = 8" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, it will not allow you to change the contents of it. A list, however, will allow you to do that. For example," ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['This', 'is', 'a', 'string']" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alist" ] }, { "cell_type": "code", "execution_count": 190, "metadata": {}, "outputs": [], "source": [ "alist[0] = 'Here'" ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Here', 'is', 'a', 'string']" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "alist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see here, I have replaced the first item in the `alist` variable with a string `Here`. So these few examples are to get you started with learning Python but you should explore outside of class to learn more in depth." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List comprehensions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another very useful feature of Python is list comprehension. Let's say you want to pull everything from one list that matches a criterion you provided and store it in another list, you can type like this:" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "blist = [i for i in alist if i == 'Here']" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Here']" ] }, "execution_count": 37, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, I have pull out the contents that match the string `'Here'` into another list named `blist`. Let's say if you want to pull out words that contain a letter `s`, you can type like this:" ] }, { "cell_type": "code", "execution_count": 192, "metadata": {}, "outputs": [], "source": [ "blist = [i for i in alist if 's' in i]" ] }, { "cell_type": "code", "execution_count": 193, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['is', 'string']" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "blist" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, the usefulness of such functions may not be apparent to you at this time but eventually you will reach a point when features like these are indispensible to your coding workflow." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Useful Python libraries" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Base python contains useful features that you can already start using but sometimes they come from other Python packages/libraries. Some of the important ones are listed below:\n", "\n", "- pandas\n", "- matplotlib\n", "- numpy\n", "- scipy\n", "- seaborn\n", "- biopython\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pandas" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is a Python package oriented towards statistical analyses and provides a lot of useful features/functions that would otherwise take longer to code without it (see here: https://pandas.pydata.org/). For example, if you want to put a table into a dataframe, you can simply type something like this:\n", "\n", "```Python\n", "import pandas as pd\n", "df = pd.read_csv(\"table.txt\", sep=\"\\t\")\n", "```\n", "\n", "in this example, you are importing a \"tab-delimited\" (as indicated by the `\"\\t\"`) table into a dataframe named `df`. But before you can use Pandas and its features, you first need to import the library into your Python environment by typing `import pandas as pd`. This means you are importing the whole Pandas library and providing it a shortcut named `pd`. Let's try and use Pandas here today. Download this file into a directory (maybe in a subfolder in your `exercises` folder): https://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/prokaryote_type_strain_report.txt\n", "\n", "This file contains \"type\" strains of bacteria and archaea. Now, we will use Pandas to see what you can do with the table." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'1.21.4'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import numpy as np\n", "np.__version__" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "df = pd.read_csv(\"prokaryote_type_strain_report.txt\", sep=\"\\t\")" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#scientific nametype materials and coidentical strainshas sequences from type material?number of assemblies per taxonnumber of assemblies from type materials per taxonnumber of assemblies from type materials per species
0'Burkholderia humi' Srinivasan et al. 2013JCM 18069,KEMC 7302-068,Rs7yes000
1'Lysobacter humi' Akter et al. 2016CCTCC AB 2015292,KACC 18284,THG-PC4,culture-co...yes000
2'Massilia aquatica' Lu et al. 2020FT127W,GDMCC 1.1690,KACC 21482yes000
3'Megasphaera vaginalis' Bordigoni et al. 2020CSUR P4857,Marseille-P4857yes000
4'Micromonospora endophytica' Thanaboripat et a...BCC 67267,BCC<THA> 67267,DCWR9-8-2,NBRC ...yes000
\n", "
" ], "text/plain": [ " #scientific name \\\n", "0 'Burkholderia humi' Srinivasan et al. 2013 \n", "1 'Lysobacter humi' Akter et al. 2016 \n", "2 'Massilia aquatica' Lu et al. 2020 \n", "3 'Megasphaera vaginalis' Bordigoni et al. 2020 \n", "4 'Micromonospora endophytica' Thanaboripat et a... \n", "\n", " type materials and coidentical strains \\\n", "0 JCM 18069,KEMC 7302-068,Rs7 \n", "1 CCTCC AB 2015292,KACC 18284,THG-PC4,culture-co... \n", "2 FT127W,GDMCC 1.1690,KACC 21482 \n", "3 CSUR P4857,Marseille-P4857 \n", "4 BCC 67267,BCC<THA> 67267,DCWR9-8-2,NBRC ... \n", "\n", " has sequences from type material? number of assemblies per taxon \\\n", "0 yes 0 \n", "1 yes 0 \n", "2 yes 0 \n", "3 yes 0 \n", "4 yes 0 \n", "\n", " number of assemblies from type materials per taxon \\\n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "\n", " number of assemblies from type materials per species \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#scientific nametype materials and coidentical strainshas sequences from type material?number of assemblies per taxonnumber of assemblies from type materials per taxonnumber of assemblies from type materials per species
8195Gloeobacter kilaueensisATCC BAA-2537,BCCM ULC0316,CCAP 1431/1,JS1,cul...yes111
8196Gloeobacter morelensisMG652769yes111
8197Gloeobacter violaceusPCC 7421yes111
\n", "
" ], "text/plain": [ " #scientific name \\\n", "8195 Gloeobacter kilaueensis \n", "8196 Gloeobacter morelensis \n", "8197 Gloeobacter violaceus \n", "\n", " type materials and coidentical strains \\\n", "8195 ATCC BAA-2537,BCCM ULC0316,CCAP 1431/1,JS1,cul... \n", "8196 MG652769 \n", "8197 PCC 7421 \n", "\n", " has sequences from type material? number of assemblies per taxon \\\n", "8195 yes 1 \n", "8196 yes 1 \n", "8197 yes 1 \n", "\n", " number of assemblies from type materials per taxon \\\n", "8195 1 \n", "8196 1 \n", "8197 1 \n", "\n", " number of assemblies from type materials per species \n", "8195 1 \n", "8196 1 \n", "8197 1 " ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['#scientific name'].str.contains('Gloeobacter')]" ] }, { "cell_type": "code", "execution_count": 196, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#scientific nametype materials and coidentical strainshas sequences from type material?number of assemblies per taxonnumber of assemblies from type materials per taxonnumber of assemblies from type materials per species
19403[Ruminococcus] faecisEg2,JCM 15917,JCM:15917,KCTC 5757,KCTC:5757yes1011
19404[Ruminococcus] gnavusATCC 29149,ATCC:29149,VPI C7-9yes8033
19405[Ruminococcus] lactarisATCC 29176,ATCC:29176yes711
19406[Ruminococcus] torquesATCC 27756,ATCC:27756yes1811
19407[Scytonema hofmanni] UTEX B 1581nano100
\n", "
" ], "text/plain": [ " #scientific name \\\n", "19403 [Ruminococcus] faecis \n", "19404 [Ruminococcus] gnavus \n", "19405 [Ruminococcus] lactaris \n", "19406 [Ruminococcus] torques \n", "19407 [Scytonema hofmanni] UTEX B 1581 \n", "\n", " type materials and coidentical strains \\\n", "19403 Eg2,JCM 15917,JCM:15917,KCTC 5757,KCTC:5757 \n", "19404 ATCC 29149,ATCC:29149,VPI C7-9 \n", "19405 ATCC 29176,ATCC:29176 \n", "19406 ATCC 27756,ATCC:27756 \n", "19407 na \n", "\n", " has sequences from type material? number of assemblies per taxon \\\n", "19403 yes 10 \n", "19404 yes 80 \n", "19405 yes 7 \n", "19406 yes 18 \n", "19407 no 1 \n", "\n", " number of assemblies from type materials per taxon \\\n", "19403 1 \n", "19404 3 \n", "19405 1 \n", "19406 1 \n", "19407 0 \n", "\n", " number of assemblies from type materials per species \n", "19403 1 \n", "19404 3 \n", "19405 1 \n", "19406 1 \n", "19407 0 " ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.tail()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, I have just imported the table as a tab-delimited file, telling Pandas to put it into a dataframe named `df`. Once it's put into a dataframe, it becomes easier to extract information from the table. For example, if you want to see all headers of the columns, type:" ] }, { "cell_type": "code", "execution_count": 197, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['#scientific name', 'type materials and coidentical strains',\n", " 'has sequences from type material?', 'number of assemblies per taxon',\n", " 'number of assemblies from type materials per taxon',\n", " 'number of assemblies from type materials per species'],\n", " dtype='object')" ] }, "execution_count": 197, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will list all the headers of all the columns in this dataframe. Let's say you want to pull out all rows that contain *Escherichia coli*, you can type like this:" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#scientific nametype materials and coidentical strainshas sequences from type material?number of assemblies per taxonnumber of assemblies from type materials per taxonnumber of assemblies from type materials per species
5808Escherichia albanano100
5809Escherichia albertiiAlbert 19982,BCCM/LMG:20976,CCUG 46494,CCUG:46...yes22411
5810Escherichia coliATCC 11775,ATCC:11775,BCCM/LMG:2092,CCUG 24,CC...yes9729255
5811Escherichia fergusoniiATCC 35469,ATCC:35469,BCCM/LMG:7866,CDC 0568-7...yes8622
5812Escherichia marmotaeCGMCC 1.12862,CGMCC:1.12862,DSM 28771,DSM:2877...yes3522
\n", "
" ], "text/plain": [ " #scientific name \\\n", "5808 Escherichia alba \n", "5809 Escherichia albertii \n", "5810 Escherichia coli \n", "5811 Escherichia fergusonii \n", "5812 Escherichia marmotae \n", "\n", " type materials and coidentical strains \\\n", "5808 na \n", "5809 Albert 19982,BCCM/LMG:20976,CCUG 46494,CCUG:46... \n", "5810 ATCC 11775,ATCC:11775,BCCM/LMG:2092,CCUG 24,CC... \n", "5811 ATCC 35469,ATCC:35469,BCCM/LMG:7866,CDC 0568-7... \n", "5812 CGMCC 1.12862,CGMCC:1.12862,DSM 28771,DSM:2877... \n", "\n", " has sequences from type material? number of assemblies per taxon \\\n", "5808 no 1 \n", "5809 yes 224 \n", "5810 yes 97292 \n", "5811 yes 86 \n", "5812 yes 35 \n", "\n", " number of assemblies from type materials per taxon \\\n", "5808 0 \n", "5809 1 \n", "5810 5 \n", "5811 2 \n", "5812 2 \n", "\n", " number of assemblies from type materials per species \n", "5808 0 \n", "5809 1 \n", "5810 5 \n", "5811 2 \n", "5812 2 " ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['#scientific name'].str.contains('Escherichia')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, `jupyter-lab` will only print a few rows (perhaps up to 10 or 20). But when you write this code in a Python script, it will print everything to screen (if you run it in a terminal). Let's count how many rows satisfy this criterion." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[df['#scientific name'].str.contains('Escherichia')])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looks like there are only 5 with the genus *Escherichia*. How about *Salmonella*?" ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "10" ] }, "execution_count": 200, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(df[df['#scientific name'].str.contains('Salmonella')])" ] }, { "cell_type": "code", "execution_count": 203, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
#scientific nametype materials and coidentical strainshas sequences from type material?number of assemblies per taxonnumber of assemblies from type materials per taxonnumber of assemblies from type materials per species
7752Idiomarina abyssalisATCC BAA-312,ATCC:BAA-312,KMM 227,KMM:227,cult...yes1322
7753Idiomarina aestuariiJCM 16344,JCM:16344,KCTC 22740,KCTC:22740,KYW314yes711
7754Idiomarina andamanensisBCCM/LMG:29773,JCM 31645,JCM:31645,LMG 29773,L...yes100
7755Idiomarina aquaticaBCCM/LMG:27613,CCM 8471,CCM:8471,CECT 8360,CEC...yes211
7756Idiomarina aquimarisBCCM/LMG:25374,BCRC 80083,BCRC:80083,LMG 25374...yes111
7757Idiomarina atlanticaG5_TVMV8_7,KCTC 42141,KCTC:42141,MCCC 1A10513,...yes111
7758Idiomarina balticaBCCM/LMG:21691,DSM 15154,DSM:15154,LMG 21691,L...yes311
7759Idiomarina donghaiensis908033,CGMCC 1.7284,CGMCC:1.7284,JCM 15533,JCM...yes222
7760Idiomarina fontislapidosiBCCM/LMG:22169,CECT 5859,CECT:5859,F23,LMG 221...yes222
7761Idiomarina halophilaBH195,KACC 17610,KACC:17610,NCAIM B 02544,NCAI...yes111
7762Idiomarina homiensisDSM 17923,DSM:17923,KACC 11514,KACC:11514,PO-M2yes111
7763Idiomarina indicaCGMCC 1.10824,CGMCC:1.10824,JCM 18138,JCM:1813...yes111
7764Idiomarina insulisalsaeBCCM/LMG:23123,CIP 108836,CIP:108836,CVS-6,LMG...yes111
7765Idiomarina loihiensisATCC BAA-735,ATCC:BAA-735,DSM 15497,DSM:15497,...yes611
7766Idiomarina mangroviKCTC 62455,KCTC:62455,MCCC 1K03495,MCCC:1K0349...yes111
7767Idiomarina marinaBCRC 17749,BCRC:17749,JCM 15083,JCM:15083,PIM1yes111
7768Idiomarina maritima908087,CGMCC 1.7285,CGMCC:1.7285,JCM 15534,JCM...yes211
7769Idiomarina piscisalsiNBRC 108617,NBRC:108617,TISTR 2054,TISTR:2054,...yes311
7770Idiomarina planktonicaCGMCC 1.12458,CGMCC:1.12458,JCM 19263,JCM:1926...yes222
7771Idiomarina ramblicolaBCCM/LMG:22170,CECT 5858,CECT:5858,LMG 22170,L...yes111
7772Idiomarina salinarumCCUG 54359,CCUG:54359,ISL-52,KCTC 12971,KCTC:1...yes222
7773Idiomarina sediminumBCCM/LMG:24046,CICC 10319,CICC:10319,DSM 21906...yes222
7774Idiomarina seosinensisCL-SP19,JCM 12526,JCM:12526,KCTC 12296,KCTC:12296yes111
7775Idiomarina tainanensisBCRC 17750,BCRC:17750,JCM 15084,JCM:15084,PIN1yes311
7776Idiomarina taiwanensisBCRC 17465,BCRC:17465,JCM 13360,JCM:13360,PIT1yes111
7777Idiomarina tyrosinivoransBCRC 80745,BCRC:80745,CC-PW-9,JCM 19757,JCM:19757yes111
7778Idiomarina woeseiBCCM/LMG:27903,DSM 27808,DSM:27808,JCM 19499,J...yes222
7779Idiomarina xiamenensis10-D-4,BCCM/LMG:25227,CCTCC AB 209061,CCTCC:AB...yes111
7780Idiomarina zobelliiATCC BAA-313,ATCC:BAA-313,KMM 231,KMM:231,cult...yes222
\n", "
" ], "text/plain": [ " #scientific name \\\n", "7752 Idiomarina abyssalis \n", "7753 Idiomarina aestuarii \n", "7754 Idiomarina andamanensis \n", "7755 Idiomarina aquatica \n", "7756 Idiomarina aquimaris \n", "7757 Idiomarina atlantica \n", "7758 Idiomarina baltica \n", "7759 Idiomarina donghaiensis \n", "7760 Idiomarina fontislapidosi \n", "7761 Idiomarina halophila \n", "7762 Idiomarina homiensis \n", "7763 Idiomarina indica \n", "7764 Idiomarina insulisalsae \n", "7765 Idiomarina loihiensis \n", "7766 Idiomarina mangrovi \n", "7767 Idiomarina marina \n", "7768 Idiomarina maritima \n", "7769 Idiomarina piscisalsi \n", "7770 Idiomarina planktonica \n", "7771 Idiomarina ramblicola \n", "7772 Idiomarina salinarum \n", "7773 Idiomarina sediminum \n", "7774 Idiomarina seosinensis \n", "7775 Idiomarina tainanensis \n", "7776 Idiomarina taiwanensis \n", "7777 Idiomarina tyrosinivorans \n", "7778 Idiomarina woesei \n", "7779 Idiomarina xiamenensis \n", "7780 Idiomarina zobellii \n", "\n", " type materials and coidentical strains \\\n", "7752 ATCC BAA-312,ATCC:BAA-312,KMM 227,KMM:227,cult... \n", "7753 JCM 16344,JCM:16344,KCTC 22740,KCTC:22740,KYW314 \n", "7754 BCCM/LMG:29773,JCM 31645,JCM:31645,LMG 29773,L... \n", "7755 BCCM/LMG:27613,CCM 8471,CCM:8471,CECT 8360,CEC... \n", "7756 BCCM/LMG:25374,BCRC 80083,BCRC:80083,LMG 25374... \n", "7757 G5_TVMV8_7,KCTC 42141,KCTC:42141,MCCC 1A10513,... \n", "7758 BCCM/LMG:21691,DSM 15154,DSM:15154,LMG 21691,L... \n", "7759 908033,CGMCC 1.7284,CGMCC:1.7284,JCM 15533,JCM... \n", "7760 BCCM/LMG:22169,CECT 5859,CECT:5859,F23,LMG 221... \n", "7761 BH195,KACC 17610,KACC:17610,NCAIM B 02544,NCAI... \n", "7762 DSM 17923,DSM:17923,KACC 11514,KACC:11514,PO-M2 \n", "7763 CGMCC 1.10824,CGMCC:1.10824,JCM 18138,JCM:1813... \n", "7764 BCCM/LMG:23123,CIP 108836,CIP:108836,CVS-6,LMG... \n", "7765 ATCC BAA-735,ATCC:BAA-735,DSM 15497,DSM:15497,... \n", "7766 KCTC 62455,KCTC:62455,MCCC 1K03495,MCCC:1K0349... \n", "7767 BCRC 17749,BCRC:17749,JCM 15083,JCM:15083,PIM1 \n", "7768 908087,CGMCC 1.7285,CGMCC:1.7285,JCM 15534,JCM... \n", "7769 NBRC 108617,NBRC:108617,TISTR 2054,TISTR:2054,... \n", "7770 CGMCC 1.12458,CGMCC:1.12458,JCM 19263,JCM:1926... \n", "7771 BCCM/LMG:22170,CECT 5858,CECT:5858,LMG 22170,L... \n", "7772 CCUG 54359,CCUG:54359,ISL-52,KCTC 12971,KCTC:1... \n", "7773 BCCM/LMG:24046,CICC 10319,CICC:10319,DSM 21906... \n", "7774 CL-SP19,JCM 12526,JCM:12526,KCTC 12296,KCTC:12296 \n", "7775 BCRC 17750,BCRC:17750,JCM 15084,JCM:15084,PIN1 \n", "7776 BCRC 17465,BCRC:17465,JCM 13360,JCM:13360,PIT1 \n", "7777 BCRC 80745,BCRC:80745,CC-PW-9,JCM 19757,JCM:19757 \n", "7778 BCCM/LMG:27903,DSM 27808,DSM:27808,JCM 19499,J... \n", "7779 10-D-4,BCCM/LMG:25227,CCTCC AB 209061,CCTCC:AB... \n", "7780 ATCC BAA-313,ATCC:BAA-313,KMM 231,KMM:231,cult... \n", "\n", " has sequences from type material? number of assemblies per taxon \\\n", "7752 yes 13 \n", "7753 yes 7 \n", "7754 yes 1 \n", "7755 yes 2 \n", "7756 yes 1 \n", "7757 yes 1 \n", "7758 yes 3 \n", "7759 yes 2 \n", "7760 yes 2 \n", "7761 yes 1 \n", "7762 yes 1 \n", "7763 yes 1 \n", "7764 yes 1 \n", "7765 yes 6 \n", "7766 yes 1 \n", "7767 yes 1 \n", "7768 yes 2 \n", "7769 yes 3 \n", "7770 yes 2 \n", "7771 yes 1 \n", "7772 yes 2 \n", "7773 yes 2 \n", "7774 yes 1 \n", "7775 yes 3 \n", "7776 yes 1 \n", "7777 yes 1 \n", "7778 yes 2 \n", "7779 yes 1 \n", "7780 yes 2 \n", "\n", " number of assemblies from type materials per taxon \\\n", "7752 2 \n", "7753 1 \n", "7754 0 \n", "7755 1 \n", "7756 1 \n", "7757 1 \n", "7758 1 \n", "7759 2 \n", "7760 2 \n", "7761 1 \n", "7762 1 \n", "7763 1 \n", "7764 1 \n", "7765 1 \n", "7766 1 \n", "7767 1 \n", "7768 1 \n", "7769 1 \n", "7770 2 \n", "7771 1 \n", "7772 2 \n", "7773 2 \n", "7774 1 \n", "7775 1 \n", "7776 1 \n", "7777 1 \n", "7778 2 \n", "7779 1 \n", "7780 2 \n", "\n", " number of assemblies from type materials per species \n", "7752 2 \n", "7753 1 \n", "7754 0 \n", "7755 1 \n", "7756 1 \n", "7757 1 \n", "7758 1 \n", "7759 2 \n", "7760 2 \n", "7761 1 \n", "7762 1 \n", "7763 1 \n", "7764 1 \n", "7765 1 \n", "7766 1 \n", "7767 1 \n", "7768 1 \n", "7769 1 \n", "7770 2 \n", "7771 1 \n", "7772 2 \n", "7773 2 \n", "7774 1 \n", "7775 1 \n", "7776 1 \n", "7777 1 \n", "7778 2 \n", "7779 1 \n", "7780 2 " ] }, "execution_count": 203, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df['#scientific name'].str.contains('Idiomarina')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 10 entries for *Salmonella* in the table. Let's create a boxplot of some values in a column for all the *Salmonella*. First, we put the rows in the dataframe that match the given condition into another dataframe named `x`." ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [], "source": [ "x = df[df['#scientific name'].str.contains('Salmonella')]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we plot using the `boxplot` function provided by Pandas on two of the columns. " ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAUF0lEQVR4nO3df5RcZ13H8fe3SaE0bakYWLEFFgExWKHACEKxbCggAoeiohALFImuvwhQFQQDiscTLYICgr9CU4JQg5byuxxoLTsGsC1NSmjTLkhPG6BSaTloy1a0JP36x322TDezO7PzI9uHvl/n7Jk7d+69z/feeeYzd+69sxOZiSSpPoetdAGSpMEY4JJUKQNckiplgEtSpQxwSarU6kPZ2Nq1a3NycvJQNin15dZbb2XNmjUrXYbU1e7du7+ZmfddOP6QBvjk5CS7du06lE1KfWm320xNTa10GVJXEfGVbuM9hCJJlTLAJalSBrgkVcoAl6RKGeCSVKmeAR4RZ0fEjRGxt8tjvxcRGRFrx1OeNF47duzghBNO4JRTTuGEE05gx44dK12S1Ld+LiPcDrwD+IfOkRHxAOBpwFdHX5Y0fjt27GDz5s1s27aNAwcOsGrVKjZu3AjAhg0bVrg6qbeee+CZuRP4VpeH3gK8GvD/0apKW7ZsYdu2baxfv57Vq1ezfv16tm3bxpYtW1a6NKkvA32RJyKeA/xHZn4hInpNOw1MA0xMTNButwdpUhq52dlZDhw4QLvdZm5ujna7zYEDB5idnbWfqgrLDvCIOBLYDDy9n+kzcyuwFaDVaqXfdtNdxbp161i1ahVTU1N3fBNzZmaGdevW+a1MVWGQq1AeAjwY+EJE7AOOBy6PiB8aZWHSuG3evJmNGzcyMzPD/v37mZmZYePGjWzevHmlS5P6suw98My8Erjf/P0S4q3M/OYI65LGbv5E5aZNm5idnWXdunVs2bLFE5iqRj+XEe4ALgYeHhHXR8TG8ZclHRobNmxg7969XHTRRezdu9fwVlV67oFn5pI9OjMnR1aNJKlvfhNTkiplgEtSpQxwSaqUAS5JlTLAJalSBrgkVcoAl6RKGeCSVCkDXJIqZYBLUqUMcEmqlAEuSZUywCWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqlQ/P2p8dkTcGBF7O8a9KSK+GBFXRMQHI+LYsVYpSTpIP3vg24FnLBh3IXBCZj4S+HfgtSOuS5LUQ88Az8ydwLcWjLsgM/eXu5cAx4+hNknSElaPYBkvBf5psQcjYhqYBpiYmKDdbo+gSWm05ubm7JuqzlABHhGbgf3AOYtNk5lbga0ArVYrp6amhmlSGot2u419U7UZOMAj4nTg2cApmZmjK0mS1I+BAjwingH8PvDkzPyf0ZYkSepHP5cR7gAuBh4eEddHxEbgHcDRwIURsSci/m7MdUqSFui5B56ZG7qM3jaGWiRJy+A3MSWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpQxwSaqUAS5JlTLAJalSBrgkVcoAl6RKGeCSVKl+fpX+7Ii4MSL2doy7T0RcGBFfLrc/MN4yJUkL9bMHvh14xoJxrwEuysyHAReV+5KkQ6hngGfmTuBbC0afCry7DL8beO5oy5Ik9bJ6wPkmMvMGgMy8ISLut9iEETENTANMTEzQbrcHbFIan7m5OfumqjNogPctM7cCWwFarVZOTU2Nu0lp2drtNvZN1WbQq1C+ERH3Byi3N46uJElSPwYN8I8Ap5fh04EPj6YcSVK/+rmMcAdwMfDwiLg+IjYCZwJPi4gvA08r9yVJh1DPY+CZuWGRh04ZcS2SpGXwm5iSVCkDXJIqZYBLUqUMcEmqlAEuSZUywCWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpYYK8Ig4IyKuioi9EbEjIo4YVWGSpKUNHOARcRzwcqCVmScAq4AXjKowSdLShj2Eshq4V0SsBo4Evj58SZKkfqwedMbM/I+IeDPwVeA7wAWZecHC6SJiGpgGmJiYoN1uD9qkNDZzc3P2TVUnMnOwGSN+ADgPeD7w38C5wPsz872LzdNqtXLXrl0DtSeNU7vdZmpqaqXLkLqKiN2Z2Vo4fphDKE8FrsvMmzLzu8AHgCcOsTxJ0jIME+BfBX4qIo6MiABOAWZHU5YkqZeBAzwzLwXeD1wOXFmWtXVEdUmSehj4JCZAZv4R8EcjqkWStAx+E1OSKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpQxwSaqUAS5JlTLAJalSBrgkVcoAl6RKGeCSVCkDXJIqZYBLUqUMcEmq1FABHhHHRsT7I+KLETEbEU8YVWGSpKUN9aPGwNuAT2Tm8yLiHsCRI6hJktSHgQM8Io4BTgZeApCZtwG3jaYsSVIvw+yB/whwE/CuiHgUsBt4RWbe2jlRREwD0wATExO02+0hmpTGY25uzr6p6kRmDjZjRAu4BDgpMy+NiLcBt2Tm6xebp9Vq5a5duwarVBqjdrvN1NTUSpchdRURuzOztXD8MCcxrweuz8xLy/33A48ZYnmSpGUYOMAz8z+Br0XEw8uoU4CrR1KVJKmnYa9C2QScU65AuRb4leFLkiT1Y6gAz8w9wEHHZSRJ4+c3MSWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpQxwSaqUAS5JlTLAJalSBrgkVcoAl6RKDR3gEbEqIj4fER8bRUGSpP6MYg/8FcDsCJYjSVqGoQI8Io4HngWcNZpyJEn9Wj3k/G8FXg0cvdgEETENTANMTEzQbreHbFIavbm5OfumqjNwgEfEs4EbM3N3REwtNl1mbgW2ArRarZyaWnRSacW0223sm6rNMIdQTgKeExH7gPcBT4mI946kKklSTwMHeGa+NjOPz8xJ4AXApzLzhSOrTJK0JK8Dl6RKDXsSE4DMbAPtUSxLktQf98AlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpQxwSaqUAS5JlTLAJalSBrgkVcoAl6RKGeCSVCkDXJIqZYBLUqUMcEmqlAEuSZUywCWpUga4JFXKAJekShngklSpgQM8Ih4QETMRMRsRV0XEK0ZZmCRpacP8Kv1+4Hcz8/KIOBrYHREXZubVI6pNkrSEgffAM/OGzLy8DH8bmAWOG1VhkqSlDbMHfoeImAQeDVza5bFpYBpgYmKCdrs9iiZ1N7LpK5sOTUPvHn8Tb3/Q28ffiO42IjOHW0DEUcC/Alsy8wNLTdtqtXLXrl1Dtae7n8nXnM++M5811jba7TZTU1NjbeNQrIe+P0XE7sxsLRw/1FUoEXE4cB5wTq/wliSN1jBXoQSwDZjNzL8cXUmSpH4Mswd+EvAi4CkRsaf8PXNEdUmSehj4JGZmfgaIEdYiSVoGv4kpSZUywCWpUga4JFXKAJekShngklQpA1ySKmWAS1KlDHBJqpQBLkmVMsAlqVIGuCRVygCXpEoZ4JJUKQNckiplgEtSpQxwSaqUAS5JlTLAJalSA/+kmnQoTb7m/PE38onxtnHvex0+1uXr7meoAI+IZwBvA1YBZ2XmmSOpSuqw78xnjb2Nydecf0jakUZp4EMoEbEK+GvgZ4FHABsi4hGjKkyStLRhjoE/DrgmM6/NzNuA9wGnjqYsSVIvwxxCOQ74Wsf964HHL5woIqaBaYCJiQna7fYQTUr9Wb9+/bLniTcuv52ZmZnlzySNyDABHl3G5UEjMrcCWwFarVZOTU0N0aTUn8yDuuKS2u029k3VZphDKNcDD+i4fzzw9eHKkST1a5gAvwx4WEQ8OCLuAbwA+MhoypIk9TLwIZTM3B8RLwM+SXMZ4dmZedXIKpMkLWmo68Az8+PAx0dUiyRpGfwqvSRVygCXpEoZ4JJUKQNckioVy/3Cw1CNRdwEfOWQNSj1by3wzZUuQlrEgzLzvgtHHtIAl+6qImJXZrZWug5pOTyEIkmVMsAlqVIGuNTYutIFSMvlMXBJqpR74JJUKQNckiplgI9JRLQjYuyXpUXEyyNiNiLOGXdbyxURc4uM3x4RzyvDZw37W6oRcc+I+JeI2BMRzx9mWT3amYyIXx7X8kdhmBoj4t/6mKbrczpuNWz7lWCA3wVFxHL+S+RvAc/MzNPGVc84ZeavZubVQy7m0cDhmXliZv5T5wPlx7dHZRK4q4fIJMuscX4bZeYTx1HQILq8Bia562/7Qy8z77Z/NJ1iFngncBVwAXCv8lgbaJXhtcC+MvwS4EPAR4HrgJcBvwN8HrgEuE/H/G8F/g3YCzyujF8DnE3zgxifB07tWO65Zbmf6lLr75Tl7AVeWcb9HXAbcCVwRpd1+zRwefl7Yhl/f2AnsKcs66dp/p/79nL/jmUBDwE+Aewuy/qxMn478LfADHAt8OSyTrPA9o4a5oC/KO1fBNy3Y/7nddnOTwcuLtOfCxxVxp8JXA1cAbx5wXreD7gGuLms00OAfcAfAp+h+aGRDWW99gJvXFDfG8v6/QvND3W3yzo9p8tzcElHO2eUbXJix+OfBR4JvAF4D/Ap4MvAr3VM86ry3F8B/PEi/bJnXUs8vwtrXAW8qaPNXy/TTZXn7x+Bq+fbLbdHlefr8rLdTu2sbbF+1GU99pX1+Fz5e2gZf1/gvFLTZcBJZfwbaK4GugD4xx7bfrH1/7myzaLU+O/ADwEPKut0Rbl9YEdf/Cua1+m1lH5Zy9+KF7CiK990gv3zL0Lgn4EXluE2iwf4NcDRpSPeDPxGeewtfC9c28A7y/DJwN4y/KcdbRxbOtiastzrKW8AC+p8bHkhrSkvrquAR3e8SNZ2medI4Igy/DBgVxn+XWBzGV5V1uOxwIUd8x5bbi8CHlaGH095Yymd/n3lRXIqcAvwEzSf6HZ3bM8ETivDfwi8o2P+OwV42cY7gTVl/O+Xee4DfInvXTF1bJd1nQI+1nF/H/DqMvzDwFfLc7WaJlSf21Hfz5bhD9IEx+HAo4A9fbRzOvDWMvyjHdv4DcAXgHuV9fpaqePpNAEVZVt9DDi5Szs961ri+V1Y4zTwujJ8T2AX8OAy3a3AgzumnQ/n1cAxHX3/mo7tPz/NQf2oy3rs65jmxfN10bxpPKkMPxCY7dhuuyk7UT22fdf1L/ffS7Nj9TFgQxn3UeD0MvxS4EMdffHc8nw8ArhmpXNpOX9D/aDD94nrMnNPGd5NE+q9zGTmt4FvR8TNNJ0DmpB9ZMd0OwAyc2dEHBMRx9K8iJ8TEb9XpjmCphNDE6Lf6tLek4APZuatABHxAZo9588vUePhwDsi4kTgAE3AQLPHc3ZEHE7TifdExLXAj0TE24HzgQsi4ijgicC5EXf8fvU9O5b/0czMiLgS+EZmXllqu4pmG+4BbgfmD2m8F/jAEvX+FM0L6LOlvXvQ7I3fAvwvcFZEnE/zouzHfLs/CbQz86ZS3zk0b6gfovn08oky3ZXA/2Xmd8s6TfbRxrnA6yPiVTShsL3jsQ9n5neA70TEDM1e9JNonv/55+0omvDZuWC5/dS12PO70NOBR86fcwDuXdq8DfhcZl7XZZ4A/jQiTqZ5Do8DJoD/7JjmoH60SPs7Om7fUoafCjyio18dExFHl+GPlO3Wy1Lrv4nmU8ElmTnf/hOAny/D7wH+vGP6D2Xm7cDVETHRR9t3GQY4/F/H8AGavSZo9sznzxEcscQ8t3fcv507b9OFF9knzYvjFzLzS50PRMTjafaIuolFxi/lDOAbNHtth9GE4PybycnAs4D3RMSbMvMfIuJRwM8Avw38EvBK4L8z88RFlt+5zgu3x2L9aqkvHQTNG9iGgx6IeBxwCs3hkJcBT1liOfPmt+VS2+67WXbD6FiPzLy9n/MQmfk/EXEhzaeQX6L5JHHHwwsnL7X8WWb+fY9F91NX1+e3iwA2ZeYn7zQyYorF+9tpNJ9YHlveOPax4DWwWD/qsqzsMnwY8ISFQV0CfbGaFlpq/Y+j2W4TEXFYCeel6ursv4O81laMJzEXt4/m0ALA85aYbinPB4iIJwE3Z+bNNL8huilKb42IR/exnJ3AcyPiyIhYQ3Oc79M95rk3cEPpvC+i+ZhLRDwIuDEz3wlsAx4TEWuBwzLzPOD1wGMy8xbguoj4xTJflJBfjsP43rb7ZZpj0ou5BDgpIh5a2jsyIn60fBK4dzY/3/dK4MRl1nAp8OSIWFtO1m0A/nWZy5j3bZpDTp3OojmGetmCT0+nRsQREfGDNB//L6N57l9a1omIOC4i7jdgLV2f3y41fhL4zbKnTNmma/pY9o0lvNfTHD++k279aJFlPb/j9uIyfAHNG/H8sk7sUQ8cvF6L9e/VwLto+tsszbkjaI5xv6AMn8bSfbEa7oEv7s3AP0fEi2iOmw7iv8qlWcfQfMQG+BOak5tXlBDfBzx7qYVk5uURsZ3mRBDAWZm51OETgL8BzisBPMP39mymgFdFxHdpTpa9mGaP5V0RMf+G/tpyexrwtxHxOpqPrO+jObbbr1uBH4+I3TTnCha9xC8zb4qIlwA7ImL+UM3raF64H46II2j2js5YRvtk5g0R8VqabRDAxzPzw8tZRocrgP0R8QWak7VvyczdEXELTWh0+hzN4agHAn+SmV8Hvh4R64CLy/v3HPBC4MYBalns+b1TjcDbaA67XF76203Ac3ss+xzgoxGxi+ZQ2Be7TDPFwf2om3tGxKU0b+bzn65eDvx1RFxBk0E7gd/oUdPC9Vps/f8A+HRmfjoi9gCXlUNvL6c55PMqmm3wKz3aq4JfpZeGEBE/THMi9sfmP6pHxBtoTva9eQVLW3Hl0EsrM/0/62PiIRRpQBHxYppDNJsXOc4qjZV74JJUKffAJalSBrgkVcoAl6RKGeCSVCkDXJIq9f/LgWIqAsK0dwAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "x.boxplot(column='number of assemblies from type materials per taxon')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These are just a few examples on how to use a boxplot. See here for more examples: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.boxplot.html" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.24.2'" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Matplotlib" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Python has a very useful package known as \"matplotlib\" (see here: https://matplotlib.org/). It provides a variety of plotting functions and plot types. See some examples here to learn what it can do: https://matplotlib.org/gallery/index.html\n", "\n", "These are plots that you wouldn't be able to plot using Microsoft Excel. `R` has similar plot functionalities and maybe easier to use than Python libraries. Here, we will explore a few plotting functions. First, you need to import the library before you can use its features.\n" ] }, { "cell_type": "code", "execution_count": 213, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 213, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXAAAAD4CAYAAAD1jb0+AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAANFUlEQVR4nO3dX4hc9RnG8eepo7SxlSrZStXSURDbUuimDEFrEWkKKyjqRQ0WLFYWcrO0trQE7cUG9kqCiF4sQnC1QsVSoqCIOIpVbBECk2TBaAqCrjE2baaU/qEU7OLbizOS7OpuNjNn5px35/u5OZvf/jkvs5NvZs/s5OeIEAAgn89UPQAAoD8EHACSIuAAkBQBB4CkCDgAJNUY5cm2bt0azWZzlKcEgPQOHjz4t4iYWL0+0oA3m011Op1RnhIA0rP93qetcwkFAJIi4ACQFAEHgKQIOAAkRcABIKkzBtz2o7ZP2j5y2tpFtl+y/XbveOFwx6yBdlvauVPavr04tttVTwRgzG3kEfivJd2wau0eSS9HxJWSXu79efNqt6X5eWn3bun114vj/DwRB1CpMwY8Il6T9PdVy7dIerz39uOSbi13rJpZWJBmZ6VWS2o0iuPsbLEOABXp9xr4xRFxQpJ6xy+t9YG2d9nu2O50u90+T1expSVpcnLl2uRksQ4AFRn6k5gRsS8iWhHRmpj4xCtBc2g2pcXFlWuLi8U6AFSk34D/1faXJal3PFneSDU0PS3NzUmdjrS8XBzn5op1AKhIv/8XyrOS7pR0X+/4TGkT1dHUVHHcu7e4bNJsSjMzp9YBoAJnDLjtJyVdL2mr7eOS9qgI9+9sT0s6Jum2YQ5ZC1NTBBtArZwx4BHxwzXetaPkWQAAZ4FXYgJAUgQcAJIi4ACQFAEHgKQIOAAkRcABICkCDgBJEXAASIqAA0BSBBwAkiLgAJAUAQeApAg4ACRFwAEgKQIOAEkRcABIioADQFIEHACSIuAAkBQBB4CkCDgAJEXAASApAg4ASRFwAEiKgANAUgQcAJIi4ACQFAEHgKQIOAAkRcABICkCDgBJEXAASIqAA0BSBBwAkiLgAJDUQAG3/XPbb9o+YvtJ258tazAASbXb0s6d0vbtxbHdrnqiTavvgNu+VNJPJbUi4puSzpF0e1mDAUio3Zbm56Xdu6XXXy+O8/NEfEgGvYTSkPQ52w1JWyT9efCRAKS1sCDNzkqtltRoFMfZ2WIdpes74BHxgaT7JR2TdELSPyPixdUfZ3uX7Y7tTrfb7X9SAPW3tCRNTq5cm5ws1lG6QS6hXCjpFkmXS7pE0vm271j9cRGxLyJaEdGamJjof1IA9ddsSouLK9cWF4t1lG6QSyjfl/RuRHQj4n+Snpb0nXLGApDS9LQ0Nyd1OtLycnGcmyvWUbrGAJ97TNLVtrdI+q+kHZI6pUwFIKepqeK4d29x2aTZlGZmTq2jVH0HPCIO2N4v6ZCkZUmHJe0razAASU1NEewRGeQRuCJij6Q9Jc0CADgLvBITAJIi4ACQFAEHgKQIOAAkRcABICkCDgBJEXAASIqAA0BSBBwAkiLgAJAUAQeApAg4ACRFwAEgKQIOAEkRcABIioADQFIEHACSIuAAkBQBB4CkCDgAJEXAASApAg4ASRFwAEiKgANAUgQcAJIi4ACQFAEHgKQIOAAkRcABICkCDgBJEXAASIqAA0BSBBwAkiLgAJAUAQeApAYKuO0v2t5v+0+2j9q+pqzBAADrawz4+Q9JeiEifmD7PElbSpgJALABfQfc9gWSrpP0Y0mKiA8lfVjOWACAMxnkEsoVkrqSHrN92PYjts9f/UG2d9nu2O50u90BTgcAON0gAW9I+rakhyNim6T/SLpn9QdFxL6IaEVEa2JiYoDTAQBON0jAj0s6HhEHen/eryLoAIAR6DvgEfEXSe/bvqq3tEPSW6VMBQA4o0F/C+Unkp7o/QbKO5LuGnwkAMBGDBTwiFiU1CpnFADA2eCVmACQFAEHgKQIOAAkRcABICkCDgBJEXAASIqAA0BSBBwAkiLgAJAUAQeApAg4ACRFwAEgKQIOAEnVP+DttrRzp7R9e3Fst6ueCBLfF6AGBv3/wIer3Zbm56XZWWlyUlpclObmivdNTVU52Xjj+wLUQr0fgS8sFJFotaRGozjOzhbrqA7fF6AW6h3wpaXiEd7pJieLdVSH7wtQC/UOeLNZ/Hh+usXFYh3V4fsC1EK9Az49XVxb7XSk5eXiODdXrKM6fF+AWqj3k5gfPyG2d2/x43mzKc3M8ERZ1fi+ALXgiBjZyVqtVnQ6nZGdDwA2A9sHI+ITG8jX+xIKAGBNBBwAkiLgAJAUAQeApAg4ACRFwAEgKQIOAEkRcABIioADQFIEHACSIuAAkBQBB4CkCDgAJEXAASCpgQNu+xzbh20/V8ZAwIa129LOndL27cWx3a56ImCkytjQ4W5JRyVdUMLXAjam3Zbm54vNlCcniy3d5uaK97GxBMbEQI/AbV8m6UZJj5QzDrBBCwtFvFstqdEojrOzxTowJga9hPKgpN2SPlrrA2zvst2x3el2uwOeDuhZWioeeZ9ucrJYB8ZE3wG3fZOkkxFxcL2Pi4h9EdGKiNbExES/pwNWajaLyyanW1ws1oExMcgj8Gsl3Wx7SdJvJX3P9m9KmQo4k+np4pp3pyMtLxfHubliHRgTfT+JGRH3SrpXkmxfL+mXEXFHOWMBZ/DxE5V79xaXTZpNaWaGJzAxVsr4LRSgGlNTBBtjrZSAR8Srkl4t42sBADaGV2ICQFIEHACSIuAAkBQBB4CkCDgAJEXAASApAg4ASRFwAEiKgANAUgQcAJIi4ACQFAEHgKQIOAAkRcCzYSd2rIX7xtjh/wPPhJ3YsRbuG2OJR+CZsBM71sJ9YywR8EzYiR1r4b4xlgh4JuzEjrVw3xhLBDwTdmLHWrhvjCWexMyEndixFu4bY8kRMbKTtVqt6HQ6IzsfAGwGtg9GRGv1OpdQACApAg4ASRFwAEiKgANAUgQcAJIi4ACQFAEHgKQIOAAkRcABICkCDgBJEXAASIqAA0BSBBwAkiLgAJBU3wG3/RXbr9g+avtN23eXORiQBrvBYz1DvH8MsqHDsqRfRMQh21+QdND2SxHxVkmzAfXHbvBYz5DvH30/Ao+IExFxqPf2vyUdlXTpwBMBmbAbPNYz5PtHKdfAbTclbZN04FPet8t2x3an2+2WcTqgPtgNHusZ8v1j4IDb/rykpyT9LCL+tfr9EbEvIloR0ZqYmBj0dEC9sBs81jPk+8dAAbd9rop4PxERT5cyEZAJu8FjPUO+f/T9JKZtS1qQdDQiHihlGiAbdoPHeoZ8/+h7V3rb35X0B0lvSPqot/yriHh+rc9hV3oAOHtr7Urf9yPwiPijJA80FQCgb7wSEwCSIuAAkBQBB4CkCDgAJEXAASApAg4ASRFwAEiKgANAUgQcAJIi4ACQFAEHgKQIOAAkRcABICkCDmBzGuJu8HUxyK70AFBPQ94Nvi54BA5g8xnybvB1QcABbD5D3g2+Lgg4gM1nyLvB1wUBB7D5DHk3+LrgSUwAm8+Qd4OvCwIOYHOamtp0wV6NSygAkBQBB4CkCDgAJEXAASApAg4ASTkiRncyuyvpvZGdcDi2Svpb1UPUCLfHKdwWK3F7rDTI7fHViJhYvTjSgG8GtjsR0ap6jrrg9jiF22Ilbo+VhnF7cAkFAJIi4ACQFAE/e/uqHqBmuD1O4bZYidtjpdJvD66BA0BSPAIHgKQIOAAkRcA3yPZXbL9i+6jtN23fXfVMVbN9ju3Dtp+repaq2f6i7f22/9S7j1xT9UxVsf3z3t+RI7aftP3ZqmcaJduP2j5p+8hpaxfZfsn2273jhWWci4Bv3LKkX0TE1yVdLWnG9jcqnqlqd0s6WvUQNfGQpBci4muSvqUxvV1sXyrpp5JaEfFNSedIur3aqUbu15JuWLV2j6SXI+JKSS/3/jwwAr5BEXEiIg713v63ir+gl1Y7VXVsXybpRkmPVD1L1WxfIOk6SQuSFBEfRsQ/Kh2qWg1Jn7PdkLRF0p8rnmekIuI1SX9ftXyLpMd7bz8u6dYyzkXA+2C7KWmbpAMVj1KlByXtlvRRxXPUwRWSupIe611SesT2+VUPVYWI+EDS/ZKOSToh6Z8R8WK1U9XCxRFxQioeDEr6UhlflICfJdufl/SUpJ9FxL+qnqcKtm+SdDIiDlY9S000JH1b0sMRsU3Sf1TSj8jZ9K7t3iLpckmXSDrf9h3VTrV5EfCzYPtcFfF+IiKernqeCl0r6WbbS5J+K+l7tn9T7UiVOi7peER8/BPZfhVBH0ffl/RuRHQj4n+Snpb0nYpnqoO/2v6yJPWOJ8v4ogR8g2xbxTXOoxHxQNXzVCki7o2IyyKiqeIJqt9HxNg+yoqIv0h63/ZVvaUdkt6qcKQqHZN0te0tvb8zOzSmT+iu8qykO3tv3ynpmTK+KJsab9y1kn4k6Q3bi721X0XE89WNhBr5iaQnbJ8n6R1Jd1U8TyUi4oDt/ZIOqfjNrcMas5fU235S0vWStto+LmmPpPsk/c72tIp/5G4r5Vy8lB4AcuISCgAkRcABICkCDgBJEXAASIqAA0BSBBwAkiLgAJDU/wGh67zrUC2obQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]\n", "y = [5, 10, 3, 5, 4, 2, 3, 8, 1, 2]\n", "z = [10, 20, 30, 50, 10, 50, 20, 10, 10, 5]\n", "\n", "\n", "#plt.scatter(x, y, s=z)\n", "plt.scatter(x, y, c=\"w\", alpha=0.8, edgecolors=\"r\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, I have created a scatter plot with 3 variables: `x`, `y`, and `z`. `x` contains coordinates of the x-axis, `y` coordinates of the y-axis, and `z` the sizes of the scatter dots. If you don't specify the `s` in the scatter function it will produce dots with the same size. To learn more about this scatter function, you can type like this:" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Help on function scatter in module matplotlib.pyplot:\n", "\n", "scatter(x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None, verts=, edgecolors=None, *, plotnonfinite=False, data=None, **kwargs)\n", " A scatter plot of *y* vs. *x* with varying marker size and/or color.\n", " \n", " Parameters\n", " ----------\n", " x, y : float or array-like, shape (n, )\n", " The data positions.\n", " \n", " s : float or array-like, shape (n, ), optional\n", " The marker size in points**2.\n", " Default is ``rcParams['lines.markersize'] ** 2``.\n", " \n", " c : array-like or list of colors or color, optional\n", " The marker colors. Possible values:\n", " \n", " - A scalar or sequence of n numbers to be mapped to colors using\n", " *cmap* and *norm*.\n", " - A 2-D array in which the rows are RGB or RGBA.\n", " - A sequence of colors of length n.\n", " - A single color format string.\n", " \n", " Note that *c* should not be a single numeric RGB or RGBA sequence\n", " because that is indistinguishable from an array of values to be\n", " colormapped. If you want to specify the same RGB or RGBA value for\n", " all points, use a 2-D array with a single row. Otherwise, value-\n", " matching will have precedence in case of a size matching with *x*\n", " and *y*.\n", " \n", " If you wish to specify a single color for all points\n", " prefer the *color* keyword argument.\n", " \n", " Defaults to `None`. In that case the marker color is determined\n", " by the value of *color*, *facecolor* or *facecolors*. In case\n", " those are not specified or `None`, the marker color is determined\n", " by the next color of the ``Axes``' current \"shape and fill\" color\n", " cycle. This cycle defaults to :rc:`axes.prop_cycle`.\n", " \n", " marker : `~.markers.MarkerStyle`, default: :rc:`scatter.marker`\n", " The marker style. *marker* can be either an instance of the class\n", " or the text shorthand for a particular marker.\n", " See :mod:`matplotlib.markers` for more information about marker\n", " styles.\n", " \n", " cmap : str or `~matplotlib.colors.Colormap`, default: :rc:`image.cmap`\n", " A `.Colormap` instance or registered colormap name. *cmap* is only\n", " used if *c* is an array of floats.\n", " \n", " norm : `~matplotlib.colors.Normalize`, default: None\n", " If *c* is an array of floats, *norm* is used to scale the color\n", " data, *c*, in the range 0 to 1, in order to map into the colormap\n", " *cmap*.\n", " If *None*, use the default `.colors.Normalize`.\n", " \n", " vmin, vmax : float, default: None\n", " *vmin* and *vmax* are used in conjunction with the default norm to\n", " map the color array *c* to the colormap *cmap*. If None, the\n", " respective min and max of the color array is used.\n", " It is deprecated to use *vmin*/*vmax* when *norm* is given.\n", " \n", " alpha : float, default: None\n", " The alpha blending value, between 0 (transparent) and 1 (opaque).\n", " \n", " linewidths : float or array-like, default: :rc:`lines.linewidth`\n", " The linewidth of the marker edges. Note: The default *edgecolors*\n", " is 'face'. You may want to change this as well.\n", " \n", " edgecolors : {'face', 'none', *None*} or color or sequence of color, default: :rc:`scatter.edgecolors`\n", " The edge color of the marker. Possible values:\n", " \n", " - 'face': The edge color will always be the same as the face color.\n", " - 'none': No patch boundary will be drawn.\n", " - A color or sequence of colors.\n", " \n", " For non-filled markers, the *edgecolors* kwarg is ignored and\n", " forced to 'face' internally.\n", " \n", " plotnonfinite : bool, default: False\n", " Set to plot points with nonfinite *c*, in conjunction with\n", " `~matplotlib.colors.Colormap.set_bad`.\n", " \n", " Returns\n", " -------\n", " `~matplotlib.collections.PathCollection`\n", " \n", " Other Parameters\n", " ----------------\n", " **kwargs : `~matplotlib.collections.Collection` properties\n", " \n", " See Also\n", " --------\n", " plot : To plot scatter plots when markers are identical in size and\n", " color.\n", " \n", " Notes\n", " -----\n", " * The `.plot` function will be faster for scatterplots where markers\n", " don't vary in size or color.\n", " \n", " * Any or all of *x*, *y*, *s*, and *c* may be masked arrays, in which\n", " case all masks will be combined and only unmasked points will be\n", " plotted.\n", " \n", " * Fundamentally, scatter works with 1-D arrays; *x*, *y*, *s*, and *c*\n", " may be input as N-D arrays, but within scatter they will be\n", " flattened. The exception is *c*, which will be flattened only if its\n", " size matches the size of *x* and *y*.\n", " \n", " .. note::\n", " In addition to the above described arguments, this function can take\n", " a *data* keyword argument. If such a *data* argument is given,\n", " the following arguments can also be string ``s``, which is\n", " interpreted as ``data[s]`` (unless this raises an exception):\n", " *x*, *y*, *s*, *linewidths*, *edgecolors*, *c*, *facecolor*, *facecolors*, *color*.\n", " \n", " Objects passed as **data** must support item access (``data[s]``) and\n", " membership test (``s in data``).\n", "\n" ] } ], "source": [ "help(plt.scatter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This prints the help document associated with the `scatter` function from Matplotlib. You can also go to the Matplotlib page to see examples on how to customize your plots." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Seaborn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another Python package I really like using is known as \"Seaborn\" and it is designed to visualize statistical data. See here: https://seaborn.pydata.org/\n", "\n", "Here, we will try to plot the x, y, and z data we just plotted earlier. First, you need to put the data into a dataframe before you can use Seaborn functions." ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAZEAAAFgCAYAAAB670TrAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjMuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8vihELAAAACXBIWXMAAAsTAAALEwEAmpwYAAAaPklEQVR4nO3de5RdZZnn8e+TqkoqKVIkIQGBqIAyCKJcPDjexmZE2wg0qD3aTg8O2q4WVguio03Tdk/jaruV5dheVjsLBVSY0cFGGhsUBeWmslqRIyByEUUQCAmkQghJKpfK5Zk/zg6GMiGVt1Jnnzr1/axV69R+98l7npNU6nf2ft9378hMJEkqMa3uAiRJk5chIkkqZohIkooZIpKkYoaIJKlYb90FjMWiRYvymmuuqbsMSSoVdRcwUSbFkcjy5cvrLkGStB2TIkQkSZ3JEJEkFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVIxQ0SSVGzCQiQivhwRyyLirm3a5kXE9yPi19Xj3Il6/Tps2byFJx9fx9L7V7FiyVrWr9lYd0mSNKEm8kjkYmDRqLZzgOsz82Dg+mq7ayz9zWrOee3VfPCYqzjziG/yg68/wLrVBomk7jVhIZKZPwRWjGo+Gbik+v4S4M0T9frttnrFBi446yc8tWw9AJs3JZec02Tt6pGaK5OkidPuMZF9MnMpQPW4946eGBHvjYhmRDSHhobaVmCpTRu3sPjep57RtmVzMvykISKpe3XswHpmXpCZjcxsLFiwoO5ydqp/oJcj/3D/Z7TNHOxj9l4zaqpIkiZeu0Pk8YjYF6B6XNbm158wM/fo45SPHc1/PPl59E6fxvNePIf/eeXrDRFJXa3dN6W6CjgVOK96vLLNrz+h5u4zk9P/+ZVsWLuJadOCwQX9dZckSRNqwkIkIi4FjgXmR8Ri4Fxa4XFZRLwHeBh420S9fl1mzu5j5uy+usuQpLaYsBDJzP+6g13HTdRrSpLaq2MH1iVJnc8QkSQVM0QkScUMEUlSMUNEklTMEJEkFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVIxQ0SSVMwQkSQVM0QkScUMEUlSMUNEklTMEJEkFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVIxQ0SSVMwQkSQVM0QkScUMEUlSMUNEklTMEJEkFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVIxQ0SSVMwQkSQVM0QkScUMEUlSMUNEklTMEJEkFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVIxQ0SSVMwQkSQVM0QkScUMEUlSMUNEklSslhCJiA9GxN0RcVdEXBoR/XXUIUkan7aHSETsD7wfaGTm4UAP8I521yFp123auJlHf/UUl/797dz8jQdZ/cT6uktSzXprfN2ZEbERmAUsqakOSbtg1fINfOR132XD8CYA/uRvj+CPzjyM3uk9NVemurT9SCQzHwU+BTwMLAWeyszvjX5eRLw3IpoR0RwaGmp3mZK244nFw08HCMAvbnyMDWs311iR6lbH6ay5wMnAgcB+wEBEnDL6eZl5QWY2MrOxYMGCdpcpaTvmP3eAWYN9T283TljIjIG6TmioE9Txr/964MHMHAKIiCuAVwFfraEWSbtgcH4/n/jB8fzsO4vZ9+BBDn7ZfHr7nOQ5ldURIg8Dr4iIWcA64DigWUMdknZRT+809jlgNsf/xaF1l6IOUceYyC3A5cBtwC+qGi5odx2SpPGr5WRmZp4LnFvHa0uSdh9PZkqSihkikqRihogkqZghIkkqZohIkooZIpKkYoaIJKmYISJJKmaISJKKGSKSpGKGiCSpmCEiSSpmiEiSihkikqRihogkqZghIkkqZohIkooZIpKkYoaIJKmYISJJKmaISJKKGSKSpGKGiCSpmCEiSSpmiEiSihkikqRihogkqZghIkkqZohIkooZIpKkYoaIJKmYISJJKmaISJKKGSKSpGKGiCSpmCEiSSpmiEiSihkikqRihogkqZghIkkqZohIkooZIpKkYoaIJKmYISJJKmaISJKKGSKSpGKGiCSpmCEiSSpmiEjSFBERp0fEHdXXgxFx43j7NEQkaYrIzC9k5pHAMcBi4NPj7dMQkaSp53PADZn5rfF21LsbipEkTRIR8S7g+cAZu6M/Q0SSpoiIeBnwYeA/ZeaW3dFnLaezImJORFweEb+MiHsj4pV11CFJU8wZwDzgxmpw/aLxdljXkcjngGsy879ExHRgVk11SNKUkZnv3t19tj1EImIQeC3wLoDMHAFG2l2HJGn86jiddRAwBHwlIm6PiIsiYqCGOiRJ41RHiPQCRwPnZ+ZRwDBwzugnRcR7I6IZEc2hoaF21yhJGoM6QmQxsDgzb6m2L6cVKs+QmRdkZiMzGwsWLGhrgZKksWl7iGTmY8AjEXFI1XQccE+765AkjV9ds7POBL5Wzcx6ANjtMwYkSROvlhDJzDuARh2vLUlTWUQsorXMoge4KDPPG09/rliXpA7WaDR6gfnA8mazuWk8fUVED/C/gTfQGp++NSKuysziIQUvwChJHarRaLyK1pKIB4Ghans8Xg7cn5kPVGv0vg6cPJ4ODRFJ6kDVEcjVwBygv3q8utFo9Iyj2/2BR7bZXly1FTNEJKkzzacVHtvqB8az5iG205bj6M8QkaQOtRxYP6ptPa3TW6UWA8/dZnshsGQc/RkiktSJqkH0E4CVtMJjJXBCs9ncPI5ubwUOjogDqyUW7wCuGk+dhogkdahms/nvtE5rHQjMr7aLZeYmWpeDvxa4F7gsM+8eT5+ROa7TYW3RaDSy2WzWXYYkldreWERX8EhEklRspyESEWdExNx2FCNJmlzGciTyHFqrGi+LiEUR0bWHZZKkXbPTEMnMvwUOBr5E626Ev46Ij0fECya4NklShxvTmEi2Rt8fq742AXOByyPikxNYmySpw+30AowR8X7gVFoLXy4C/jIzN0bENODXwNkTW6IkqVON5UhkPvDWzHxjZn4jMzcCZOYW4MQJrU6StFtFxJcjYllE3LVN20cj4tGIuKP6On6s/e30SCQz/+5Z9t071heSJO2aRqPxXFpne14O/BT4ZLPZfOTZ/9ROXQx8Hvg/o9o/k5mf2tXOXCciSR2oCpCfA6fRCpHTgJ9X7cUy84fAivFX2GKISFJnOhvYA+irtvuq7Ykahz4jIu6sTneNeW2gISJJnenl/C5Atuqr2ne384EXAEcCS4F/GusfNEQkqTP9FNg4qm1j1b5bZebjmbm5mjB1IbsQVIaIJHWmTwJr+F2QbKy2d/v6vIjYd5vNtwB37ei5o+10dpa0M2ue3MCG4U2MrN/MjIFeBuZMZ8ZMf7Sk8Wg2m480Go0j2M2zsyLiUuBYYH5ELAbOBY6NiCNp3eXwt7QG8cfWn5eC13g8sWSYL575E+68YSkA02f28KbTX8SJ7zuU2XuNvrOnNGV17TUHu/Lj4spl67i/uZzB+f3s98JB9pg3o+6SutJTQ+v4+Ftv4NH7nnq6bWTdZq78zN309EzjzR86nOn9PTVWKGmidd2YyMpl6/jYSdfxqf/2A/7ujdfyvS/dx4a1o8emtDs8/uCaZwTItq4+/17WPLmhzRVJareuC5H1qzc+4xfbDy59gHVrNtVYUff6zW3Ld7hvw/AmRtb59y51u64LkRkDvcwc/N3U6gNeOo++GZ5SmQh7P3/2DvfFtKCvvyvPlkraRtf9Lx+c38/Hrn0j//bpu5i77yxOPONQBvacXndZXemgo+Yxa7CPtat+/3Thy960kJkDXffjJWmUrp2dtWnjFqb1BNOmde2kiNpt3rSFh+56kn98y/UMrxx5uv3AI+Zx9tf/M3OfM7PG6qSO0rW/iLr2o2JvX9edqes4Pb3TeP7hc/nUj0/kobue5IlH13LQkfOYt98Aey5weq80Xo1Go4/WLTdeANwPXN1sNotnCkVEP/BDYAat3/+XZ+a5ETEP+BfgAFrrRN6emU+Oqc9uPRKRpA6yy0cijUbjJcB1wEygH1gPrAOOazabY15R/owiIgIYyMw1EdEH3AycBbwVWJGZ50XEOcDczPyrsfTpx3VJ6jDVEch1wAJgNq0LL86utq+v9u+ybFlTbfZVXwmcDFxStV8CvHmsfRoiktR5TqB1BDL6CCaq9hNKO46Inoi4A1gGfD8zbwH2ycylANXj3mPtzxCRpM7zQlqnsLZnBq0xkiLV1XqPBBYCL4+Iw0v7AkNEkjrR/bTGQLZnA/Cb8b5AZq4EbgIWAY9vvZJv9bhsrP0YIpLUea6mNYg+euZTVu1Xl3QaEQsiYk71/Uzg9cAvgauAU6unnQpcOdY+DRFJ6jDVNN7jgCFgNTBSPQ7Rmp1VOs13X+DGiLgTuJXWmMi3gfOAN0TEr4E3VNtj4hRfSZp4RYsNq1lYJ9AaA/kN41wnMhG6drGhJE12VWD8W911PBtPZ0mSihkikqRihogkqZghIkkq5sC6JHWwRqMxDRgAhpvN5pa66xnNIxFJ6kCNRmN+o9H4PLAKeAJY1Wg0Pt9oNPYab9/V9bNuj4hvV9sfjYhHI+KO6uv4sfZliEhSh2k0GvOB24A/p3UU0lc9/jlwe7V/PM4C7h3V9pnMPLL6+s5YOzJEJKnzfBTYBxh9b+/pVftHSzuOiIW0FjBeVNrHtgwRSeog1RjIu/j9ANlqOvCu6nklPgucDYweXzkjIu6MiC9HxNyxdmaISFJnGWDHAbLVdGDWrnYcEScCyzLzZ6N2nU/r0ipHAkuBfxprn87OkqTOMkzrgovPdvfCEWBtQd+vBk6qBs77gcGI+GpmnrL1CRFxIfDtsXbokYgkdZBqGu/FtIJie0aAi0um+2bmX2fmwsw8AHgHcENmnrL1XiKVtwBjvoe7RyKS1HnOBU7i9wfXR4DHq/270ycj4kha9yv5LXDaWP+gRyKS1GGazeYTwNHAhbROb22sHi8Ejqr2j0tm3pSZJ1bfvzMzX5KZL83Mk7beb30sarufSET0AE3g0a1vZEe8n4ikSa7ofiLw9GytWcDaTlyxXufprK2LXQZrrEGT3PDKEVYsXcuq5etZeMie7Ln3zLpLknarKjjW1F3HjtQSItssdvlH4H/UUYMmv5F1m7j58gf5yl/eCsDzXjyHj1xxHHMMEqlt6hoT+SzbX+wijdm6NRu59sL7nt5++O6VrF3VUXcOlbpe20PkWRa7jH7eeyOiGRHNoaGhNlWnyaRveg8LD9nz6e3pM3voH3DCodRObR9Yj4hPAO8ENlEtdgGu2Haxy2gOrGtHnlq2jm+cdycrlq7lj89+Cc87bC59M3rqLksabTwD6wcC+wFLms3mg7uvpN2jttlZABFxLPBhZ2dpPDaNbGbTxi30DzzbAl+pVrscIo1GowF8ETiU1vqQ6bQmI53WHMcvxIiYQ+vii4fTWhfyZ8B9wL8AB9BaJ/L2zHxyLP25TkSTXu/0HgNEXaUKkJtorRWZCexZPR4N3FTtL/U54JrMfBFwBK1gOge4PjMPBq6vtsek1hDZdrGLJOlpX6R1IcbtGQC+UNJpRAwCrwW+BJCZI5m5EjgZuKR62iXAm8fap0ciktRBqjGQQ3fytMOq5+2qg4Ah4CvVnQ0viogBYJ+tq9Srx73H2qEhIkmdZT92fPHFrUaq5+2qXlqnxM7PzKNoXUplzKeutscQkaTOsoSx3U9kSUHfi4HFmXlLtX05rVB5fOuVfKvHZWPt0BCRpA5STeMdff/z0e4pme6bmY8Bj0TEIVXTccA9wFXAqVXbqcCVY+3TlVmS1HlOozU7a3uD68PA6ePo+0zgaxExHXgAeDetA4rLIuI9wMPA28baWa3rRMbKdSKSJrnSdSJfAA7jd+tE7gFOH886kd3NIxFJ6kBVUDQ6fcW6ISJJHawKjo4Lj60cWJckFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVIxQ0SSVMwQkSQVM0QkScUMEUlSMUNEklTMEJEkFTNEJEnFvBR8lxh+aoS1T40w9PAw8583wKzBPvaYM6PusjRBNm/ewponNhA9weBe/XWXoynMEOkCq1ds4MrP3M23P3/P022LTjuEPz77Jcye5y+YbrN6xQZ+/M3f8t0v3MeswT5O+djRHHjkPPpn9dVdmqYgT2d1gaGH1zwjQACu+eJ9PPbAmpoq0kS650eP8eUP38rS+1fxm9ue4GMnXceqZRvqLktTlCHSBW69+pHttv/kyofaXIkm2rrVI9zwf+9/RtuWzcmdNy2tqSJNdYZIF3jOQbO3277vC7bfrsmrb0YP+x08+HvtO/oZkCaaIdIFXvq6/Zizz8xntA3On8HLFi2sqSJNlN7pPZx4xmHstXDW022HH/scnnvonPqK0pQWmVl3DTvVaDSy2WzWXUbHykxWLFnLt/75Hn51yxAvPGY+J73/xey1/ywiou7yNAFWLlvHE4+uZcbMHgYX9DtDq/N17X9EQ6SLjKzfzIbhjcwY6GN6f0/d5Uj6na4NEaf4dpHp/T2Gh6S2ckxEklTMEJEkFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVIxQ0SSVMwQkSQVM0QkScUMEUlSMUNEklTMEJEkFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVIxQ0SSVKy37gKkyWr4qRE2rN0EwPT+HvaYO6PmiqT2M0SkXbRl8xaWPbSGi/+qyc+vX0ImHPaaffiz/3UMz3nBbHr7euouUWqbtp/OiojnRsSNEXFvRNwdEWe1uwZpPJ58bB1/87pruOO6VoAA3HPz4/zNcd9lxaPr6i1OE2bNkxt48vF1rFu9se5SOkodYyKbgA9l5qHAK4D3RcRhNdQh7bJNI5v53kW/Yvipkd/bt2HtZq749C+ePsWl7rFq+Xq+9KGf8uFXfItrL7qP4ZUb6i6pY7Q9RDJzaWbeVn2/GrgX2L/ddUgl1q7ayM+vX7LD/Xfd9JifVLvQ8sXD/PibDzG8coSv//0drFvjB4Wtap2dFREHAEcBt2xn33sjohkRzaGhobbXJm1PT28wMGf6DvcP7DmdmBZtrEjtMHuvfnp6W/+uey7op6fPia1b1TawHhF7AP8KfCAzV43en5kXABcANBqNbHN50nYNzJnB8X9xKHf/6PHt7n/T6YcwON9ZWt1mcK8ZfPym47nvJ8s48vX7s+eC/rpL6hi1hEhE9NEKkK9l5hV11CCV+g/HzOc1bz+Amy/77TPaj/rD/Tn6jQuJ8Eik28yY1cvzXzyX5794bt2ldJzIbO+H/Gj9D7sEWJGZHxjLn2k0GtlsNie0LmlXrF6xgeWLh7npq/ezeVPyB396EPscOJvBvfyEqu3q2k8WdYTIa4AfAb8AtlTNH8nM7+zozxgikia5rg2Rtp/Oysyb6eK/UEmaSpxiIEkqZohIkooZIpKkYoaIJKmYISJJKmaISJKKGSKSpGKGiCSpmCEiSSpmiEiSihkikqRihogkqZghIkkqZohIkooZIpKkYoaIJKmYISJJKmaISJKKGSKSpGKGiCSpmCEiSSpmiEiSivXWXYAkjdXKx9exfs1GZgz0sefe/UybFnWXNOUZIpImhRVL13LuomsZeniY2XvN4B++v4h9Dpxdd1lTnqezJE0Kd1y3hKGHhwFY/cQGrr3wvporEhgikiaJPRf0P2N73v6zaqpE2zJEJE0KBx8znz868zD2OXAP/uBPD+K1f3JQ3SUJiMysu4adajQa2Ww26y5DUs3WD29k/fAmpvf3MGtwet3l7IqunQHgwLqkSaN/oI/+gb66y9A2PJ0lSSpmiEiSihkikqRihogkqZghIkkqZohIkooZIpKkYoaIJKmYISJJKmaISJKKTYprZ0XEEPBQ3XXsovnA8rqLaLOp+J7B9z2VlL7n5Zm5aHcX0wkmRYhMRhHRzMxG3XW001R8z+D7rruOdpqK73lnPJ0lSSpmiEiSihkiE+eCuguowVR8z+D7nkqm4nt+Vo6JSJKKeSQiSSpmiEiSihkiu1FEPDciboyIeyPi7og4q+6a2ikieiLi9oj4dt21tEtEzImIyyPil9W/+yvrrmmiRcQHq5/vuyLi0ojor7umiRARX46IZRFx1zZt8yLi+xHx6+pxbp01dgJDZPfaBHwoMw8FXgG8LyIOq7mmdjoLuLfuItrsc8A1mfki4Ai6/P1HxP7A+4FGZh4O9ADvqLeqCXMxMHqB4DnA9Zl5MHB9tT2lGSK7UWYuzczbqu9X0/qFsn+9VbVHRCwETgAuqruWdomIQeC1wJcAMnMkM1fWWlR79AIzI6IXmAUsqbmeCZGZPwRWjGo+Gbik+v4S4M3trKkTGSITJCIOAI4Cbqm5lHb5LHA2sKXmOtrpIGAI+Ep1Gu+iiBiou6iJlJmPAp8CHgaWAk9l5vfqraqt9snMpdD60AjsXXM9tTNEJkBE7AH8K/CBzFxVdz0TLSJOBJZl5s/qrqXNeoGjgfMz8yhgmC4/vVGNAZwMHAjsBwxExCn1VqU6GSK7WUT00QqQr2XmFXXX0yavBk6KiN8CXwdeFxFfrbektlgMLM7MrUebl9MKlW72euDBzBzKzI3AFcCraq6pnR6PiH0BqsdlNddTO0NkN4qIoHV+/N7M/HTd9bRLZv51Zi7MzANoDbLekJld/+k0Mx8DHomIQ6qm44B7aiypHR4GXhERs6qf9+Po8skEo1wFnFp9fypwZY21dITeugvoMq8G3gn8IiLuqNo+kpnfqa8kTbAzga9FxHTgAeDdNdczoTLzloi4HLiN1mzE2+nSS4FExKXAscD8iFgMnAucB1wWEe+hFahvq6/CzuBlTyRJxTydJUkqZohIkooZIpKkYoaIJKmYISJJKmaISJKKGSKSpGKGiKasiDgmIu6MiP6IGKjukXF43XVJk4mLDTWlRcQ/AP3ATFrXwfpEzSVJk4ohoimtulzJrcB64FWZubnmkqRJxdNZmurmAXsAs2kdkUjaBR6JaEqLiKtoXb7+QGDfzDyj5pKkScWr+GrKioj/DmzKzP8XET3Av0fE6zLzhrprkyYLj0QkScUcE5EkFTNEJEnFDBFJUjFDRJJUzBCRJBUzRCRJxQwRSVKx/w97lOI2YrUNAgAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import seaborn as sns\n", "import pandas as pd\n", "\n", "df1 = pd.DataFrame(\n", " {'x': x,\n", " 'y': y,\n", " 'z': z}\n", ")\n", "\n", "g = sns.relplot(x=\"x\", y=\"y\", size=\"z\", data=df1, color=\"#5710a3\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, if you look at the code snippet, I am importing both `seaborn` and `pandas` to use features provided by these two packages. First, I used Pandas' `DataFrame` function to put the list into a dataframe. For this to work, I had to put the list into a dictionary first (see the curly brackets). Then I use Seaborn's `relplot` function to create a scatter plot. I provided a custom color code \"#5710a3\", which is a hexadecimal code for colors. See here for other options: https://htmlcolorcodes.com/\n", "\n", "For additional examples of what Seaborn can do, look here: https://seaborn.pydata.org/examples/index.html" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Biopython" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Last thing we will go over today is a package known as Biopython. It provides a plethora of useful functions and tools to analyze biological data. There is a vast range of capabilities provided by the library. See here (https://biopython.org/) and here (http://biopython.org/DIST/docs/tutorial/Tutorial.html).\n", "\n", "This is my go-to package to analyze biological sequence data. You can browse through the tutorial to see what the package can offer. For example, we can download a genome from NCBI to do some sequence analysis. Download this file (as an example) into your folder where you have this Jupyter lab instance running. \n", "\n", "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/006/665/GCA_000006665.1_ASM666v1/GCA_000006665.1_ASM666v1_genomic.gbff.gz\n", "\n", "Then unzip the file using `gunzip`. Now, we can use this unzipped `gbff` file, which is in Genbank format. First, let's look at basic statistics of this organism's genome." ] }, { "cell_type": "code", "execution_count": 214, "metadata": {}, "outputs": [], "source": [ "from Bio import SeqIO\n", "\n", "records = list(SeqIO.parse(\"GCA_000006665.1_ASM666v1_genomic.gbff\", \"genbank\"))" ] }, { "cell_type": "code", "execution_count": 215, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 2 records\n" ] } ], "source": [ "print(\"Found %i records\" % len(records))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, we are \"parsing\" the Genbank file and putting the sequence objects into a list named `records`. It found 2 sequence records. Let's see what they contain." ] }, { "cell_type": "code", "execution_count": 216, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[SeqRecord(seq=Seq('AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAG...TTC', IUPACAmbiguousDNA()), id='AE005174.2', name='AE005174', description='Escherichia coli O157:H7 str. EDL933 genome', dbxrefs=['BioProject:PRJNA259', 'BioSample:SAMN02604092']),\n", " SeqRecord(seq=Seq('ATGGCAGAGCAAAAACGACCGGTACTGACACTGAAGCGGAAAACGGAAGGAGAG...AGC', IUPACAmbiguousDNA()), id='AF074613.1', name='AF074613', description='Escherichia coli O157:H7 str. EDL933 plasmid pO157, complete sequence', dbxrefs=['BioProject:PRJNA259', 'BioSample:SAMN02604092'])]" ] }, "execution_count": 216, "metadata": {}, "output_type": "execute_result" } ], "source": [ "records" ] }, { "cell_type": "code", "execution_count": 217, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Escherichia coli O157:H7 str. EDL933 genome\n", "Escherichia coli O157:H7 str. EDL933 plasmid pO157, complete sequence\n" ] } ], "source": [ "for i in records:\n", " print(i.description)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So this Genbank file contains two sequences from *E. coli* O157:H7, a pathogenic form of the bacterium that causes food poisonings every year. Let's take a closer look at the plasmid pO157 to see what genes it contains." ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Z_L7001 fertility inhibition protein (conjugal transfer repressor)\n", "Z_L7002 unknown\n", "Z_L7003 hypothetical protein 15.6 kDa protein in finO 3' region precursor\n", "Z_L7004 putative hemolysin expression modulating protein\n", "Z_L7005 hypothetical protein\n", "Z_L7006 CopB protein (RepA2 protein)\n", "Z_L7007 replication initiation protein\n", "Z_L7008 replication initiation protein\n", "Z_L7009 replication protein\n", "Z_L7010 unknown\n", "Z_L7011 hypothetical protein\n", "Z_L7012 unknown\n", "Z_L7013 transposase (partial)\n", "Z_L7014 hypothetical protein\n", "Z_L7015 transposase\n", "Z_L7016 unknown\n", "Z_L7017 EHEC-catalase/peroxidase\n", "Z_L7018 hypothetical protein\n", "Z_L7019 transposase\n", "Z_L7020 putative exoprotein-precursor\n", "Z_L7021 unknown\n", "Z_L7022 transposase\n", "Z_L7023 hypothetical protein\n", "Z_L7024 regulatory protein\n", "Z_L7025 unknown\n", "Z_L7026 hypothetical protein\n", "Z_L7027 hypothetical protein\n", "Z_L7028 hypothetical protein\n", "Z_L7029 putative acyltransferase\n", "Z_L7030 unknown\n", "Z_L7031 hypothetical protein\n", "Z_L7032 type II secretion protein\n", "Z_L7033 type II secretion protein\n", "Z_L7034 type II secretion protein\n", "Z_L7035 type II secretion protein\n", "Z_L7036 type II secretion protein\n", "Z_L7037 type II secretion protein\n", "Z_L7038 type II secretion protein\n", "Z_L7039 type II secretion protein\n", "Z_L7040 type II secretion protein\n", "Z_L7041 type II secretion protein\n", "Z_L7042 type II secretion protein\n", "Z_L7043 type II secretion protein\n", "Z_L7044 type II secretion protein\n", "Z_L7045 hypothetical protein\n", "Z_L7046 ORFB of IS911\n", "Z_L7047 hemolysin transport protein\n", "Z_L7048 hemolysin toxin protein\n", "Z_L7049 hemolysin transport protein\n", "Z_L7050 hemolysin transport protein\n", "Z_L7051 hypothetical protein\n", "Z_L7052 hypothetical protein\n", "Z_L7053 hypothetical serine-threonine protein kinase\n", "Z_L7054 replication protein\n", "Z_L7055 replication protein\n", "Z_L7056 replication protein\n", "Z_L7057 replication protein\n", "Z_L7058 hypothetical protein\n", "Z_L7059 hypothetical protein\n", "Z_L7060 hypothetical protein\n", "Z_L7061 hypothetical protein\n", "Z_L7062 plasmid maintenance protein\n", "Z_L7063 plasmid maintenance protein\n", "Z_L7064 resolvase (protein d)\n", "Z_L7065 transposase\n", "Z_L7066 hypothetical protein\n", "Z_L7067 Rep protein E1\n", "Z_L7068 plasmid partitioning protein\n", "Z_L7069 plasmid partitioning protein\n", "Z_L7070 hypothetical protein\n", "Z_L7071 unknown\n", "Z_L7072 unknown\n", "Z_L7073 unknown\n", "Z_L7074 hypothetical protein\n", "Z_L7075 unknown\n", "Z_L7076 unknown\n", "Z_L7077 unknown\n", "Z_L7078 unknown\n", "Z_L7079 unknown\n", "Z_L7080 unknown\n", "Z_L7081 unknown\n", "Z_L7082 unknown\n", "Z_L7083 unknown\n", "Z_L7084 single strand binding protein\n", "Z_L7085 hypothetical protein\n", "Z_L7086 hypothetical protein\n", "Z_L7087 putative regulator of SOS induction\n", "Z_L7088 hypothetical protein\n", "Z_L7089 hypothetical protein\n", "Z_L7090 stable plasmid inheritance protein\n", "Z_L7091 putative nickase\n", "Z_L7092 hypothetical protein\n", "Z_L7093 putative transposase\n", "Z_L7094 transposase\n", "Z_L7095 putative cytotoxin\n", "Z_L7096 putative transposase\n", "Z_L7097 hypothetical protein\n", "Z_L7098 DNA helicase\n", "Z_L7099 acetylase for F pilin\n", "Z_L7100 hypothetical protein 31.7 kDa protein in traX-finO intergenic region\n" ] } ], "source": [ "plasmid = records[1]\n", "for f in plasmid.features:\n", " if f.type == \"CDS\":\n", " print(f.qualifiers['locus_tag'][0], f.qualifiers['product'][0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, what you can see here is that I have listed all the coding sequences within this plasmid and printed their annotations. As you can see, it contains a lot of virulence factors such as \"hemolysin toxin protein\", \"type II secretion protein\", and \"putative cytotoxin\", etc. There are also a bunch of hypothetical and unknown proteins. This is quite typical of microbial genomes. Roughly about 30% of the genes in microbial genomes tend to have no known function. \n", "\n", "So this plasmid is the reason this *E. coli* strain causes food poisoning in humans and animals. In non-pathogenic *E. coli* strains, you will not find this plasmid. Let's say if you want to list the coding regions (gene coordinates) of these genes, you can type something like this:" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Z_L7001 [0:561](+)\n", "Z_L7002 [697:949](+)\n", "Z_L7003 [1150:1612](+)\n", "Z_L7004 [1657:1867](+)\n", "Z_L7005 [1904:2243](+)\n", "Z_L7006 [2482:2737](+)\n", "Z_L7007 [2972:3047](+)\n", "Z_L7008 [3039:3897](+)\n", "Z_L7009 [4258:4453](+)\n", "Z_L7010 [4809:5094](+)\n", "Z_L7011 [5093:5369](+)\n", "Z_L7012 [5439:5727](-)\n", "Z_L7013 [5754:5976](-)\n", "Z_L7014 [6028:6763](-)\n", "Z_L7015 [6546:7056](-)\n", "Z_L7016 [7040:7280](+)\n", "Z_L7017 [7372:9583](+)\n", "Z_L7018 [9626:10016](+)\n", "Z_L7019 [10173:11061](+)\n", "Z_L7020 [11241:15144](+)\n", "Z_L7021 [15306:15414](-)\n", "Z_L7022 [15608:15881](-)\n", "Z_L7023 [16147:16297](-)\n", "Z_L7024 [16586:16721](+)\n", "Z_L7025 [16881:17055](+)\n", "Z_L7026 [17322:18144](+)\n", "Z_L7027 [18146:19250](+)\n", "Z_L7028 [19339:21061](+)\n", "Z_L7029 [21101:22133](+)\n", "Z_L7030 [22420:22594](+)\n", "Z_L7031 [23015:25712](+)\n", "Z_L7032 [25798:26674](+)\n", "Z_L7033 [26713:28642](+)\n", "Z_L7034 [28641:30147](+)\n", "Z_L7035 [30148:31372](+)\n", "Z_L7036 [31402:31837](+)\n", "Z_L7037 [31833:32388](+)\n", "Z_L7038 [32384:32750](+)\n", "Z_L7039 [32746:33346](+)\n", "Z_L7040 [33342:34320](+)\n", "Z_L7041 [34253:35531](+)\n", "Z_L7042 [35517:36030](+)\n", "Z_L7043 [36087:36921](+)\n", "Z_L7044 [37012:37414](+)\n", "Z_L7045 [37556:37898](+)\n", "Z_L7046 [37942:38764](+)\n", "Z_L7047 [39304:39820](+)\n", "Z_L7048 [39821:42818](+)\n", "Z_L7049 [42867:44988](+)\n", "Z_L7050 [44991:46431](+)\n", "Z_L7051 [46497:46692](+)\n", "Z_L7052 [46721:47006](+)\n", "Z_L7053 [47174:47405](+)\n", "Z_L7054 [47909:48266](+)\n", "Z_L7055 [47954:48272](-)\n", "Z_L7056 [48550:49504](-)\n", "Z_L7057 [49265:49595](+)\n", "Z_L7058 [49935:50172](-)\n", "Z_L7059 [50132:50753](-)\n", "Z_L7060 [50749:51433](-)\n", "Z_L7061 [51488:51683](+)\n", "Z_L7062 [51891:52110](+)\n", "Z_L7063 [52111:52417](+)\n", "Z_L7064 [52417:53224](+)\n", "Z_L7065 [53310:54201](-)\n", "Z_L7066 [54197:54524](-)\n", "Z_L7067 [54668:54824](+)\n", "Z_L7068 [55411:56578](+)\n", "Z_L7069 [56577:57549](+)\n", "Z_L7070 [57933:58206](+)\n", "Z_L7071 [58289:58586](-)\n", "Z_L7072 [58848:60573](+)\n", "Z_L7073 [60649:61552](+)\n", "Z_L7074 [61937:62621](+)\n", "Z_L7075 [62736:63291](+)\n", "Z_L7076 [63336:64113](+)\n", "Z_L7077 [64530:64956](+)\n", "Z_L7078 [65002:65425](+)\n", "Z_L7079 [65582:66113](-)\n", "Z_L7080 [66626:66887](-)\n", "Z_L7081 [66890:68252](+)\n", "Z_L7082 [68298:68862](+)\n", "Z_L7083 [68947:69403](-)\n", "Z_L7084 [69665:70157](+)\n", "Z_L7085 [70186:70447](+)\n", "Z_L7086 [70452:72471](+)\n", "Z_L7087 [72522:72960](+)\n", "Z_L7088 [72956:73718](+)\n", "Z_L7089 [73891:74104](+)\n", "Z_L7090 [73949:74108](+)\n", "Z_L7091 [74289:76263](+)\n", "Z_L7092 [76330:76762](+)\n", "Z_L7093 [77321:77681](-)\n", "Z_L7094 [77862:78273](+)\n", "Z_L7095 [78541:88051](+)\n", "Z_L7096 [88142:88415](+)\n", "Z_L7097 [88341:88845](-)\n", "Z_L7098 [88991:90290](+)\n", "Z_L7099 [90309:91056](+)\n", "Z_L7100 [91114:91975](+)\n" ] } ], "source": [ "for f in plasmid.features:\n", " if f.type == \"CDS\":\n", " print(f.qualifiers['locus_tag'][0], f.location)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This prints a list of gene coordinates for the coding sequences in this plasmid and also the coding strands (either plus or minus). You can even draw a genome diagram using Biopython. Here is an example." ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [], "source": [ "from reportlab.lib import colors\n", "from reportlab.lib.units import cm\n", "from Bio.Graphics import GenomeDiagram\n", "from Bio import SeqIO\n", "\n", "gd_diagram = GenomeDiagram.Diagram(\"E. coli plasmid\")\n", "gd_track_for_features = gd_diagram.new_track(1, name=\"Annotated Features\")\n", "gd_feature_set = gd_track_for_features.new_set()\n", "\n", "for feature in plasmid.features:\n", " if feature.type != \"gene\":\n", " # Exclude this feature\n", " continue\n", " if len(gd_feature_set) % 2 == 0:\n", " color = colors.blue\n", " else:\n", " color = colors.lightblue\n", " gd_feature_set.add_feature(feature, color=color, label=True)\n", "\n", "gd_diagram.draw(\n", " format=\"linear\",\n", " orientation=\"landscape\",\n", " pagesize=\"A4\",\n", " fragments=4,\n", " start=0,\n", " end=len(plasmid),\n", ")\n", "\n", "gd_diagram.write(\"plasmid_linear.pdf\", \"PDF\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will produce a file named \"plasmid_linear.pdf\" and will display the genes found in this plasmid in order. You can also create a circular genome diagram like this:" ] }, { "cell_type": "code", "execution_count": 141, "metadata": {}, "outputs": [], "source": [ "gd_diagram.draw(\n", " format=\"circular\",\n", " circular=True,\n", " pagesize=(20 * cm, 20 * cm),\n", " start=0,\n", " end=len(plasmid),\n", " circle_core=0.7,\n", ")\n", "gd_diagram.write(\"plasmid_circular.pdf\", \"PDF\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These produced the PDF files in the same folder as your Jupyter lab instance. Here's an example of what it looks like:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "![Circular Plasmid](plasmid_circular.png)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, I am only showing you very briefly what Biopython is capable of doing. Try yourself some examples shown in the tutorials to see what else you can do with it." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 1 - Write a simple Python script" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "All the Python examples I have shown here are in the `jupyter-lab` environment. What if you want to create a Python script and reuse it for some tasks that you frequently need to perform? You can create a standalone python script for that. An example of a very simple standalone python script that can do things we just went over looks like this:\n", "\n", "```Python\n", "#!/usr/bin/env python\n", "\n", "print(\"Hello, World!\")\n", "\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And you need to save it in a plain text file with extension `.py`. For example, name it: `hello.py`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run it, you type: \n", "\n", "```bash\n", "python hello.py\n", "```\n", "\n", "And it simply prints `Hello, World!` to the screen if you run it in your terminal. Try it and see if you can create a simple script like this one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 2 - Write a standalone Python script to parse a Genbank file" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "I've given you an example below and you should save it as something like `parse_gb.py`.\n", "\n", "```Python\n", "#!/usr/bin/env python\n", "\n", "import sys\n", "from Bio import SeqIO\n", "\n", "records = list(SeqIO.parse(sys.argv[1], \"genbank\"))\n", "\n", "for r in records:\n", " print(r.description)\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To run this script with the Genbank file we used as an example, type:\n", "\n", "```bash\n", "python parse_gb.py GCA_000006665.1_ASM666v1_genomic.gbff\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And this script will simply print the description of the records found in this Genbank file. Play around with the different features and functions we've gone over to see if you can make a script that you can reuse it over and over again." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What to do when you are lost?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A lot of the time, you can just Google for solutions to a particular problem you are having. But I would highly recommend this website known as \"Stackoverflow\". See here: https://stackoverflow.com/\n", "\n", "There, you can search for solutions to all coding related questions or problems. There is a good chance that someone may have posted a question you have and someone may have already provided solution(s) to the problem." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.6" } }, "nbformat": 4, "nbformat_minor": 4 }