Introduction to Python - Data Science, Quantitative Finance (2.0) 9798486557187

673 218 3MB

English Pages [82]

Report DMCA / Copyright

DOWNLOAD FILE

Polecaj historie

Introduction to Python - Data Science, Quantitative Finance (2.0)
 9798486557187

Table of contents :
Preface to the Second Edition
1Introduction
1.1Basic of programming language
1.2What is Python?
1.3Python Development Environment
1.4Basic data types in Python
1.5Looping and Condition
1.6Python functions
1.7Further examples
1.7.1First example: Euler number e
1.7.2Second example: Euler identity
2Python Packages
2.1Mathematics - NumPy
2.1.1Introduction to Numpy
2.1.2Other NumPy Examples:
2.2Data Analysis / Science - Pandas
2.2.1An example of stock market analysis
2.2.2Further exercises on the example
2.2.3Reports generation
2.3Visual Plot – Matplotlib
2.4Conclusion
3Advanced Python Examples
3.1Mathematical Modelling
3.2Visual graphic examples
3.3Model Parameter Analysis
3.4Object Oriented Programming (OOP)
About the Author

Citation preview

Table of Contents Preface to the Second Edition 1

2

Introduction 1.1

Basic of programming language

1.2 1.3

What is Python? Python Development Environment

1.4

Basic data types in Python

1.5

Looping and Condition

1.6 1.7

Python functions Further examples

1.7.1

First example: Euler number e

1.7.2

Second example: Euler identity

Python Packages 2.1 Mathematics - NumPy 2.1.1

Introduction to Numpy

2.1.2

Other NumPy Examples:

2.2 Data Analysis / Science - Pandas 2.2.1 An example of stock market analysis 2.2.2

Further exercises on the example

2.2.3

Reports generation

2.3 2.4 3

Visual Plot – Matplotlib Conclusion

Advanced Python Examples 3.1

Mathematical Modelling

3.2 3.3

Visual graphic examples Model Parameter Analysis

3.4

Object Oriented Programming (OOP)

About the Author

Preface to the Second Edition Learning Python is not the purpose behind this book, the underlying problem solving is. Python dominates among all those script languages thanks to collaborative social engineering: it is open source which can providing free and unlimited support from the developers' world and able to attract more and more data scientists and engineers to support its growth with industry-specific functionality. This small book begins with an introduction to programming and with the help of some examples, it shows how those fit into the finance professionals' day-to-day work. Since most of the examples are tailored to the real work-life in the investment bank, this book suits beginner as well as intermediate finance professionals. In the third chapter, the learning curve becomes steeper and the quantitative modelling and parameter analysis examples are introduced. If you are a finance graduate or somebody interested in working in the quantitative finance world, this book also presents an opportunity for you to boost your skills in advance of applying to investment banking roles. Having said that, this is not an interview preparation book. It is an illustration or an introduction that aims to prepare you for a big leap towards quantitative finance. Updates from the Edition 1.0 More data science and machine learning related packages Mathematics example: natural logarithmic and Euler identity

Introduction to Python – Data Science, Quantitative Finance (2.0) By Lilan Li

Copyright © 2021 Lilan Li. All rights reserved. Second Edition The information in this book is distributed on an “As is” basis, without warranty. While every precaution has been taken in the preparation of this book, neither the author nor the publisher shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the instructions contained in this book. If any code samples or other technology this work contains or describes is subject to open source licences or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licences and / or rights. ISBN: 9798486557187 First Edition: April 2020

This book is right for you if you: Have some exposure to programming in general either through education or/and professional experiences, or You prefer working examples to reference books, or Have an interest in data analysis and would like to discover Python’s capacity, or Are curious about finance mathematics modelling

What you need: Python 3.x.x https://www.python.org/downloads/ Anaconda https://www.anaconda.com/ You are free use your own python editor if you are used to one Jupyter Notebook https://jupyter.org/ if you use your own editor which doesn’t have Jupyter notebook integrated, you need to install Jupyter separately. Financial data for Dow Jones Index https://finance.yahoo.com/quote/%5EDJI?p=%5EDJI

Special thanks to: Damien Loison, Christina Stolz, Giles Chapman, Dr. Orest Dubay, Andrew French, Rene Eriksen, Dr. Günter Umlauf and Philippe Aldeguer for their feedback

1 Introduction What is Python, how much do you need to know about Python or programming, in order to apply for a quant job in finance or a data scientist position? This small book begins with an introduction to programming and some basics about Python. In the second chapter, those most commonly used Python packages: Numpy, Pandas, and Matplotlib are introduced with relevant examples and real case studies. In the third chapter, the learning curve becomes steeper and some simplified quantitative finance models and data analysis study cases are further explored in details. The purpose here is to explain the basics of quantitative modelling within practical examples. This book is neither a Python reference nor a short cut to a job interview in the quantitative finance arena. It is an iceberg view to some of the practical usage of python in the financial market and mathematical modelling, from beginner to intermedia level.

1.1 Basic of programming language When I showed my mum my first programming homework in high school, it was some command line-based maths game, the purpose is to guess a number, a real number, not a complex number. It involves some randomly generated seeds with some predefined calculation, and another randomly generated real number became the target answer. I could also set the difficulty level to make it easier for players. I remember well that my mum was looking at the black and white DOS-like screen, hoping a cheerful song would be played. Yes, once I got it right, that sound was played to reward the winner. That was a story from the last century, and I cannot remember the name of the language that I was using. It was on the first computer that I had and it has probably been recycled (hope so) decades ago. Stories apart, let's look at the definition of programming: Programming is the process of taking an algorithm and encoding it into a notation, a programming language so that it can be executed by a computer. The important first step is the need to find a solution to a problem or requirement. Programming is often the way we create a representation of our solutions. Let’s start with a small example: Armstrong Numbers. An Armstrong number of three digits is an integer such that the sum of the cubes of its digits is equal to the number itself. For example, 371 is an Armstrong number because 33 + 73 + 13 = 371. How shall we program the algorithm or solution for this? Now we have a requirement and a need to solve it. Second, we must break down the problem (abstract it) before attempting a solution. The smallest Armstrong number is 1 since 1^3 = 1. It is easy for a single digital integer, we just need to get the input and multiply it by itself three times. If it is equal to the input integer, it is Armstrong number, otherwise, it is not. How about two-digit integers, or three, four or even five and what if we would like to have all the Armstrong numbers printed out within a given

upper limit? Now we need an algorithm or a more structured plan: First of all, we read the input of the integer number digit by digit (to do so, we must convert the integer into a string then we repeat the i3 for each digit and sum up. This involves a for loop: l=0 s = str(n) for i in s: l = int(i)* int(i)* int(i) + l

We have converted the multi-digit integer into a string s by having s = str(n) where n is the input integer. The for loop reads each character of the string and calculates the i3. Here there is a data type conversion from string to integer by writing int(i) so that an arithmetic operation can be performed. Since the purpose is to sum up the i3 for each digit, we need to add up the previous calculation in each for loop step since otherwise, with l=int(i)*int(i)* int(i) only, it gives the last digit’s cube value. For n = 123, the result of the above code gives 13 + 23 + 33 = 36 Now we need to wrap it into a function so that it can be reused later from the main function. Let's say the next step is to find all the Armstrong numbers between 1 and an upper limit. Now you must be visualizing that there is a kind of for loop involved to calculate each integer within this range and if it fits the Armstrong number criteria, we keep it and print it out. If not, we continue. Consequently, a sort of Armstrong number test function is needed to check whether the criteria is met or not. First, we wrap the algorithm that we have just mentioned into a function arm(n) (Python function will be introduced in chapter 1.6) with the case where the input n is single digit integer: def arm(n): l=0 if len(str(n)) 2 >>> for i in range(3): ... print (i) ... 0 1 2

While statement: >>> j = 4 >>> while j >0: ... print (j) ... j = j-1 ... 4 3 2 1

For or while? We use a for loop if we know the maximum number of times that we will need to execute the body. For example, if we are traversing a list of elements, we know that the maximum number of loop iterations we can need is “all the elements in the list”. By contrast, if we are required to repeat some computation until some condition is met, and we cannot calculate in advance when this will happen, we will need a while loop. We call the first case (for loop) definite iteration — we have some definite bounds for what is needed. The latter case (while loop) is called indefinite iteration — we’re not sure how many iterations we’ll need — we cannot even establish an upper bound!

Condition statement In order to write useful programs, we almost always need the ability to check conditions and change the behaviour of the program accordingly. A condition gives us this ability. The simplest form is the if statement: if boolean expression: statements

The boolean expression after the if statement is called the condition. If it is true, then all the indented statements get executed. What happens should the condition be false; shall we do something else or put another condition? The other condition is headed by elif and the default statements are headed by else: >>> if type(i) == list: ... print ("It is list type") ... elif type(i)== int: ... print ("It is integer type") ... else: ... print ("It is other type")

1.6 Python functions A function is a block of code that only runs when it is called. We can pass parameters (also called arguments) when it is called from another function or directly from the Python prompt. The function can perform tasks like printing out, modifying value, doing algebra operations and alternatively return data as a result. We could type the code in the Python console: def first_function(a): print ("the input is", a) result = a+a*2 return result

indicates it is a function which can be called elsewhere in the code by providing the right input. It is an example of code reusability compared to the previous command line coding style. def

When we execute or call this function, we just need to pass the data/parameter into the function: >>> first_function(2) the input is 2 6

There is also the possibility of setting up a default argument instead. A default argument is an argument that assumes a default value if a value is not provided in the function call for that argument. For the above example, we could increase its sophistication: def first_function(a, type = 'int'): print ("the input is", a, "and the data type is", type(a)) if type == 'int': result = a + a*2 else: result = a[0] + a[0]*2 return result

a[] is a way to access the element in the list. a[0] is the first element of the list and a[1] indicates the second element, etc. if one tries to access the element position outside the list length, it will typically throw out an error “list index out of range“. If we call the function without specifying “type = ”, then it uses the default value which is set in the function’s definition argument (in this case, it is ‘type = int’. Be aware that lots of python predefined function has its default parameter, by

using it, we should be aware of their default value. >>> first_function(2) the input is 2 and the type is int 6

Here 6 is the result from a + a*2 = 2 + 2*2 = 6 >>> first_function([2,3,4]) the input is [2, 3, 4] and the type is int [2, 3, 4, 2, 3, 4, 2, 3, 4]

In this case, the type is not specified in the function, hence the type is the default value “int”, hence the result from the calculation a + a*2 where a = [2,3,4] becomes: [2,3,4] + [2,3,4] * 2 = [2, 3, 4, 2, 3, 4, 2, 3, 4] But if we modify the parameter and overwrite its default value, the result is different: >>> first_function([2,3,4], 'list') the input is [2, 3, 4] and the type is list 6

Here 6 is the result from a[0] + a[0]*2 where a[0] = 2, the first element of the input list [2,3,4]. For the Armstrong number case, the full code is shown. You should be able to run this code on an IDE like PyCharm or Spyder.

""" arm() is to calculat the Armstrong number for the input number """ def arm(num): calnumber = 0 if len(str(num)) < 2: # single digit case return num*num*num else: string = str(num) for i in string: calnumber = int(i)*int(i)*int(i)+calnumber return calnumber """ arm_test() is to test whether an calculated arm strong number from the input is an arm strong number """ def arm_test(num): if num == arm(num): print (num, "is an Armstrong number! ") else: print (num, "isn't an Armstrong number") def arm_list(limit): for i in range(1, limit): if i == arm(i): print (i, "is an Armstrong number!" ) else: continue while True: while True: try: NUMBER = int(input("Enter the number:")) break except ValueError: print (" Opps that was not a valid number, try again....") print ("You have just entered: %d " %NUMBER) arm_test(NUMBER) LIMIT = int(input("Enter the upper limit:")) arm_list(LIMIT) VAR = input("Would you like to play again? (Y/N): ") if VAR == "N" or VAR == "n": break

Depends on the interaction you have with the programme, the output is something like: Enter the number:12 You have just entered: 12 12 isn't an Armstrong number

Enter the upper limit:1000 1 is an Armstrong number! 153 is an Armstrong number! 370 is an Armstrong number! 371 is an Armstrong number! 407 is an Armstrong number! Would you like to play again? (Y/N): n If you answer the last question: Would you like to play again? as Y, then the whole programme will run again as the whole loop keeps on going. Be aware that if you choose a very high upper limit in the second input, it takes a while for the python code to compute and find all the Armstrong numbers.

1.7 Further examples After introducing some Python basics using the Armstrong number example, more examples utilizing Python function and Looping are going to be demonstrated as these are fundamental blocs in understanding functional programming.

1.7.1

First example: Euler number e

Leonhard Euler was a Swiss mathematician from the 18th century. He has many contribution to different branches of mathematics. Among his many discoveries and developments, we will have a look at one of the most famous mathematical constant, commonly known as Euler number e which is roughly equivalent to 2.71828 representing a logarithm’s natural base. We try to calculate or estimate the mathematical constant e which is linked to and trigonometry. In this example, we will take the inspiration from the compound interest in finance. The formula for e estimation based on the compounding interest rate concept is illustrated in Figure 1‑1:

Figure 1‑1 mathematical constant e - derived from compounding interest rate

A more general exponential function ex or exp(x) can be written:

Figure 1‑2 Exponential function exp(x) derived from Figure 1‑1

From the fact that the derivative of ex equals to ex itself and the famous Taylor series definition, ex can be calculated as the sum of the infinite series when x = 1:

Figure 1‑3 Taylor series expansion of ex when x = 1: ex = e

The continuous fractional function n! is solved in a python function using recursion. Recursion is a method of solving a problem where the solution depends on solutions to smaller instances of the same problem. Such problems can generally be solved by iteration (looping). It is typically using functions that call themselves from with their own code. Python function recur_factorial(n) calculates n! # factorial of a number using recursion def recur_factorial(n): if n ==1: return n else: return n*recur_factorial(n-1)

Now, let‘s have a look how to calculate or estimate the mathematical constant e. From the equation, an iteration is needed depends on the iteration n which determines the numbers of loop. The first method is a very visual but not dynamic method using if Condition and Looping.

j=1 e2 = 0 e = 1+1/1 #initialize e while j> import numpy as np

Then we create a multi-dimension array(matrix)ranging from 0 to 15 with dimension 3x5: >>> a = np.arange(15).reshape(3, 5)

To see the created object a, simply type it under the console and press key the Enter: >>> a array([[ 0, 1, 2, 3, 4], [ 5, 6, 7, 8, 9], [10, 11, 12, 13, 14]])

If we would like to check the dimensions and the size of the array, we use the shape command and it returns a tuple of integers indicating the size of the array in each dimension: >>> a.shape (3, 5)

You could also use size to get the number of elements inside the array. In the above example, it is simply 3 x 5 = 15. But size doesn’t tell the dimension of the array: >>> a.size 15

The command ndim returns the number of axes of the array – the number of the dimension, while it doesn’t show the total number of elements in each

axis, for example, the array a has two dimension: >>> a.ndim 2

We could also create a three-dimension array and apply the same function. If you compare the following example and the one above, reshape does the job of matrix shape manipulation. >>> b = np.arange(16).reshape(2, 4, 2) >>> b array([[[ 0, 1], [ 2, 3], [ 4, 5], [ 6, 7]], [[ 8, 9], [10, 11], [12, 13], [14, 15]]])

The array b has three dimensions: >>> b.ndim 3

The array b has three dimensions and each dimension has 2, 4, 2 elements: >>> b.shape (2L, 4L, 2L) >>> b.size 16

In order to access the value of a specific location of the 3D array, we have to specify the 3 dimensions: >>> b[0][0][1] 1 >>> b[0][1][1] 3

Or just a subset of the array: >>> b[1][2] array([12, 13]) >>> b[1] array([[ 8, 9], [10, 11], [12, 13], [14, 15]])

There are multiple ways to create an array: >>> a = np.array([2,3,4]) >>> a

array([2, 3, 4])

Array transforms sequences of sequences into two-dimensional arrays, sequences of sequences of sequences into 3-dimensional arrays, and so on: >>> b = np.array([(1.5,2,3), (4,5,6)]) >>> b array([[ 1.5, 2. , 3. ], [ 4. , 5. , 6. ]]) >>> b.ndim 2

The type of the array can also be explicitly specified at creation time: >>> c = np.array( [ [1,2], [3,4] ], dtype = float) >>> c array([[ 1., 2.], [ 3., 4.]]) >>> c.dtype dtype('float64')

2.1.2 Other NumPy Examples: Often, the elements of an array are originally unknown, but its size is known. Hence, NumPy offers several functions to create arrays with initial place holder content. These minimize the necessity of growing arrays, an expensive operation. The function zeros create an array full of zeros, the function ones create an array full of ones, and the function empty creates an array whose initial content is random and depends on the state of the memory. By default, the dtype of the created array is float64. An all zero valued array: >>> np.zeros((3,4)) array([[ 0., 0., 0., 0.], [ 0., 0., 0., 0.], [ 0., 0., 0., 0.]])

An all 1- valued array: >>> a = np.ones( (2,3,4), dtype=np.int16) #dtype can also be specified >>> a array([[[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]], [[ 1, 1, 1, 1], [ 1, 1, 1, 1], [ 1, 1, 1, 1]]], dtype=int16)

We could also easily change the value of the created ndarray: >>> a[:] = 3.41 >>> a array([[[3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3]], [[3, 3, 3, 3], [3, 3, 3, 3], [3, 3, 3, 3]]], dtype=int16)

You must be wondering why the values are 3 instead of 3.41 as defined in the command? This is because we have created a ndarray with its data type specified as dtype = int16. The following code can solve the issue by reassigning the data type of the ndarray using astype: >>> a = a.astype('float64') >>> a[:] = 3.41 >>> a array([[[ 3.41, 3.41, 3.41, 3.41], [ 3.41, 3.41, 3.41, 3.41],

[ 3.41, 3.41, 3.41, 3.41]], [[ 3.41, 3.41, 3.41, 3.41], [ 3.41, 3.41, 3.41, 3.41], [ 3.41, 3.41, 3.41, 3.41]]])

To create sequences of numbers, NumPy provides a function analogous to range() that returns an array instead of a list: >>> np.arange(10, 30, 5) array([10, 15, 20, 25]) >>> np.arange(0, 2, 0.3) #it accepts float arguments array([ 0. , 0.3, 0.6, 0.9, 1.2, 1.5, 1.8])

When arange is used with floating point arguments, it is generally not possible to predict the number of elements obtained, due to the finite floatingpoint precision. For this reason, it is usually better to use the function linspace that receives as an argument the number of elements that we want: >>> np.linspace(0, 2, 6) # 6 numbers from 0 to 2 array ([ 0. , 0.4, 0.8, 1.2, 1.6, 2. ])

Basic operations: Arithmetic operators on arrays can apply elementwise. A new array is created and filled with the result. >>> a = np.array([20,30,40,50]) >>> b = np.arange(4) >>> b array([0,1,2,3])

After the creation of a and b, we try to do a subtraction: >>> c = a - b >>> c array([20, 29, 38, 47])

Multiplication: square each element of the b array: >>> b**2 array ([0, 1, 4, 9])

Other kinds of mathematic operations are: >>> 10*np.sin(a) array([ 9.12945251, -9.88031624, 7.4511316,

-2.62374854])

We could also apply logic function to test each element of an array: >>> a < 35 array([ True, True, False, False])

Unlike in many matrix languages, the product operator * operates elementwise in NumPy: >>> A = np.array([[1,1], [0,1]])

>>> B = np.array( [[2,0],[3,4]]) >>> A * B # this is elementwise product array([[2, 0], [0, 4]])

The matrix product can be performed using the dot function: >>> A.dot(B) #matrix product array([[5, 4], [3, 4]])

Some classic operations, such as sum up of all the elements in the array: >>> a = np.array([[ 0.2, 0.3, 0.4],[0.5, 0.6, 0.7]]) >>> a array([[ 0.2, 0.3, 0.4], [ 0.5, 0.6, 0.7]]) >>> >>> a.sum() 2.7000000000000002

Other classic functions are: >>> a.max() 0.69999999999999996 >>> a.mean() 0.45000000000000001 >>> a.min() 0.20000000000000001

Product is a relatively complex operation comparing to sum or max. What it is doing is: 0.2*0.3*0.4*0.5*0.6*0.7 >>> a.prod() 0.0050399999999999993

By default, these operations apply to the array as though it were a list of numbers, regardless of its shape. However, by specifying the axis parameter you can apply an operation along the specified axis of an array: >>> b = np.arange(12).reshape(3,4) >>> b array([[ 0, 1, 2, 3], [ 4, 5, 6, 7], [ 8, 9, 10, 11]])

Now we have a 2D ndarray and we try to do some operations on each column and row: >>> b.sum(axis=0) # sum of each column array([12, 15, 18, 21]) >>> >>> b.min(axis=1) # min of each row array([0, 4, 8]) >>> >>> b.cumsum(axis=1) # cumulative sum along each row array([[ 0, 1, 3, 6], [ 4, 9, 15, 22], [ 8, 17, 27, 38]]) >>> b.prod(axis = 0) # matrix product for each column array([ 0, 45, 120, 231])

Since the ndarray is a 2D, if we set axis = 2, it will return error: >>> b.prod(axis = 2) AxisError: axis 2 is out of bounds for array of dimension 2

NumPy also provides familiar mathematical functions such as sin, cos, exp and sqrt. In NumPy, these are called 'universal functions'(ufunc). >>> a = 2.4 >>> np.sqrt(a) 1.5491933384829668 >>> np.exp(a) 11.023176380641601

Within NumPy, these functions operate elementwise on an array, producing an array as output: >>> B = np.arange(3) >>> B array([0, 1, 2]) >>> np.exp(B) array([ 1., 2.71828183, 7.3890561 ]) >>> np.sqrt(B) array([ 0., 1., 1.41421356]) >>> C = np.array([2., -1., 4.]) >>> np.add(B, C) array([ 2., 0., 6.])

The same happens for multi-dimension array object. You will see more Numpy usage in various examples that will be introduced later in this book. One of the main uses of Numpy is the ndarray class for the Monte Carlo simulation approach.

2.2 Data Analysis / Science - Pandas Pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. It has nothing to do with the animal panda, the name is derived from the term panel data, an econometrics term for data sets that include observations over multiple periods for the same individuals. The library has the following features: DataFrame object for data manipulation with integrated indexing. Tools for reading and writing data between in-memory data structures and different file formats. Data alignment and integrated handling of missing data. Reshaping and pivoting of data sets. Label-based slicing and subsetting of large data sets. Data structure column insertion and deletion. Group by engine allowing split-apply-combine operations on data sets. Data set merging and joining. Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure. Time series-functionality: Date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.

2.2.1 An example of stock market analysis The DataFrame is quite powerful with all the functions it contains. The following example is an application of the Pandas package for time series of financial data. An input data file will need to be stored locally, for example, in C:\Temp, and the algorithm is to read the input data and manipulate them for different statistical purposes. Dow Jones Industrial average historical data can be downloaded from various internet sources for example yahoo finance. We save the CSV file into : C:\Temp\DJI.csv In this example, we look at the historical data range from 20/11/2007 to 20/11/2018 and the data looks like:

The first part of the code is to create a Pandas data framework and read this csv file into the data frame. from pandas import read_csv import pandas as pd import numpy as np location = r'C:/Temp/DJI.csv' df = pd.read_csv(location) # now we have created a pandas Data Frames print (df)

The df is printed as:

[2771 rows x 7 columns] As we can see from the data format, pandas DataFrame has an index at the beginning, which will be quite useful for data analysis. Secondly, we will calculate the maximum value of Adj Close column and get the entire row which has the maximum Adj Close value: df['Adj Close'].max()

26828.390630000002 Idxmax() locates the max value’s index position and loc[] takes the index and return the entire row: df.loc[df['Adj Close'].idxmax()]

We could also break the above code into steps which is easier to understand and gives the same results: id = df['Adj Close'].idxmax()

print (id )

2736 If we would like to know bit more information about that particular day when the maximum value of Adj close occurs, we could simply type: df.loc[id]

Date 03/10/2018 Open 26833.5 High 26951.8 Low 26789.1 Close 26828.4 Adj Close 26828.4 Volume 280130000 Name: 2736, dtype: object We can tell that the highest Adj Close value occurs on the 03/10/2018 and its Open, High, Low values and trading volume. Now, let’s get the top 10 highest volume data using the sort function sortValue = 'Volume' sortedDF = df.sort_values(sortValue, ascending = True) print(sortedDF[0:10])

We notice that although the top 10 trading volumes dates have been listed in ascending order, the original data frame index (the number at the very beginning) doesn’t change.

2.2.2

Further exercises on the example

Exercise 1 Let’s have a slight advanced exercise based on the example given. It is to calculate the daily return using this formula:

It involves the loop over the values in the column Adj Close. Pandas is based on NumPy arrays. The key to speed with NumPy arrays is to perform your operations on the whole array at once, never row-by-row or item-by-item. Hence we convert the pandas series into Numpy ndarray first to avoid iteration of items: adjClose = df['Adj Close'] adjCloseArray = adjClose.values

The output adjCloseArray looks like: array([13010.13965, 12799.04004, 12980.87988, ..., 25017.43945, 24465.6406

You probably noticed already the format of the code has changed since the example given becomes more complicated. Indeed, as I have switched from command line-based programming which is typically done in a console to an IDE environment PyCharm for example. Now we calculate the return using the given formula above: ret = (adjCloseArray[1:] - adjCloseArray[:-1])/ adjCloseArray[:-1]

In order to understand adjCloseArray[1:] and adjCloseArray[:-1], what do they actually stand for, I create Figure 2‑1 which hopefully is easy to understand from the visual perspective.

Figure 2‑1 visual representation of return calculation without For Loop

The daily return of the Dow Jones Industry average(ret) array is: array([-0.01622578, 0.0142073 , -0.01829148, ..., 0.00490133, -0.01557383])

Alternatively, we just iterate through (the loop slows down the algorithm) all the items within Pandas dataframe. First of all, we need to define the variable ret as an empty Numpy ndarray with the right size. You will find that it is not the same thing as define an all zero Numpy ndarray. It should be used by caution as defining a bigger empty array than needed, we will end up having an extra non-zero values towards the end of the array, which in return confuses or causes errors in our calculation results. ret = np.empty(adjClose.size-1) #define the right size for the empty array for i in range(adjClose.size-1): ret[i] = (adjCloseArray[i+1] - adjCloseArray[i])/ adjCloseArray[i] print (ret) array([-0.01622578, 0.0142073 , -0.01829148, ..., 0.00490133, -0.01557383])

The output of the return is the same as the first approach. the first method using Pandas series feature is much faster (probably less intuitive for people who are not familiar with the Pandas data series structure) and much fancier.

Exercise 2 Another exercise we can easily practise is to compute the mean and standard deviation of the daily return: mean = ret.mean() stdv = ret.std() print('the mean of the daily return is', mean) print('the standard deviation is', stdv)

the mean of the daily return is 0.0005977280989583951 the standard deviation is 0.007498175157038745

Exercise 3 After calculating the return, we will need to store the calculated daily return by writing them back into a new csv file. The first way is the traditional loop approach (slow!) which involves the insertion of the items one by one: adjClose = df['Adj Close'] adjCloseArray = adjClose.values ret = (adjCloseArray[1:] - adjCloseArray[:-1])/ adjCloseArray[:-1] i=0 while i < adjClose.size-1: df.loc[i, ('return')] = ret[i] i = i+1

loc[] works on the row level and to access the column, simply uses df[‘xxx’] where xxx is the column name in the data frame. Above is an example of the row and column combination: print (df)

[2771 rows x 8 columns] The second approach is to insert converted Pandas series directly into the whole DataFrame without using loop, which is faster and more intuitive. We convert the return array into a Pandas DataFrame first, then use the insert() function with relevant parameters: df_ret = pd.DataFrame(ret, dtype = 'float') df.insert(loc = 0, column = 'return', value = df_ret )

The outcome looks as follow:

We could also change the location of this newly created column by changing loc = 0 to some other values as long as 0