Data Pipelines¶

Covers how to collect, clean, and structure the data for your model in an automated way, using Python.

Resources¶

Creating a Data Loading Module for Capital IQ Debt Details¶

Notes¶

For the DCF model we have explored using the market value of individual debt instruments to estimate the market value of debt for the whole company
For Project 3, we looked at using Capital IQ as the source of those debt details
Here I examine how to automate the loading and cleaning of the debt details from Capital IQ, so they are ready to be worked with in the model
I also look at taking the resulting code and making a Python module, so it can be reused in future models
To clean up the data, I use a combination of Pandas methods, string methods, and regular expressions (regex)

Resources¶

Transcript¶

00:03: hey everyone this is nick dear burtis
00:05: teaching you financial modeling
00:07: today we're going to be looking at how
00:09: to load in capital iq
00:11: details into python to work with them
00:14: and also how to
00:16: structure our code into a python module
00:19: such that this code can be
00:20: reused anytime that we want to load in
00:22: debt details from capital iq
00:24: regardless of the model this is part of
00:27: our lecture segment
00:28: on data pipelines so
00:32: the data that we're looking at loading
00:33: in here
00:35: comes in the form of this excel
00:37: spreadsheet
00:38: and within that spreadsheet we have
00:41: basically some
00:42: header area which doesn't really matter
00:45: to us
00:45: we don't care about that what we want to
00:47: get out of this
00:49: is this table which in which is in here
00:53: and there's a few things that we're
00:54: going to have to do in order to clean up
00:56: this data so we can
00:58: actually work with it in our model
01:00: before we get to that
01:01: just want to note that there is
01:03: typically a second table
01:05: in here that represents older data so we
01:08: don't want to use
01:08: any of the data from that table only the
01:11: top table
01:13: and then some of the issues that we're
01:15: going to have to deal with is the fact
01:17: that
01:19: these coupon rates actually have the
01:22: percentage sign
01:23: in the string it's not
01:26: just formatted as a percent it actually
01:28: has the percent sign in there
01:31: uh so when we load it into python we
01:33: will also see that percent sign in there
01:38: and the um floating rates
01:42: we have live or plus some rate
01:46: so we're going to want to be able to
01:47: take that and
01:49: plug in a libor rate and get what the
01:51: resulting rate is
01:54: and then we also have these coupon rate
01:56: ranges
01:58: where for the purposes of our model
02:00: we'll just take the midpoint of that
02:02: range
02:03: and use that as the coupon for the bond
02:08: then one other thing is that in order to
02:13: do value calculations with these bonds
02:15: we're going to want to get
02:17: uh that into a number of years or
02:20: maturity
02:21: and so we're going to be converting
02:22: these dates into a number of years
02:24: from today um and then
02:27: some of the maturities only give a year
02:30: and so we're going to assume those are
02:32: at the end of the year
02:33: and handle those that way
02:37: so that's an overview of all the issues
02:40: that we face
02:41: in loading this in so now let's look at
02:44: some python code
02:46: to make that happen and um
02:50: throughout doing this then we're going
02:52: to try to make sure that it's standard
02:55: and reusable such that even though this
02:58: is for walmart we can then do the same
03:00: thing
03:01: or we can then use the same code to load
03:03: any other company's
03:05: details and have them all cleaned up and
03:06: ready to use for our model
03:11: so coming over to the jupiter notebook
03:15: we've talked about two approaches in the
03:17: course which will be useful
03:19: to load this information in we can load
03:21: it in using pandas
03:23: or we can use excel wings
03:28: i will generally go to pandas for any
03:31: usage that
03:32: doesn't require back and forth between
03:34: python
03:35: and excel wings for the most part unless
03:39: i'm just trying to grab a single value
03:41: out of excel then sometimes i'll still
03:43: use excel wings
03:45: but pandas generally is a lot more
03:46: compatible
03:48: you can run the code on linux or
03:50: somewhere else that excel is not
03:52: installed
03:54: and still get the same result so i
03:56: generally will prefer pandas and that's
03:57: what i'll use here in this video
04:02: so import pandas and then i'm going to
04:07: load uh
04:10: the file into a data frame
04:14: the walmart that details i have it here
04:17: in the same folder
04:19: um and then we'll see what we get
04:22: so we can see right away that this does
04:25: not look good right we got
04:27: a bunch of junk in here this is not at
04:29: all the table that we wanted to get
04:31: we've got all these on names we got a
04:33: bunch of missing values in here
04:35: um basically what we're seeing here
04:38: corresponds to
04:40: this top part where indeed there are no
04:43: column headers and we've just got some
04:44: junk up here we don't care about
04:46: the actual table here starts in row
04:50: 13. so
04:53: one thing i can do to
04:56: cut off this entire top area is there a
05:00: lot of
05:01: options for the read excel command as
05:04: you can see
05:05: um one such of those options is skip
05:07: rows
05:08: so we can tell it we want to skip
05:11: 12 rows while loading it in and just
05:13: start from the 13th row which is where
05:15: we know the data starts
05:18: so now we can see that we got the actual
05:22: names of the columns in that table that
05:24: looks good and it looks like we got the
05:26: values of that table as well
05:28: um now what we can also see is we have
05:33: you know these last two rows are junk so
05:35: let's say um
05:37: you know 127 rows counting the zero
05:41: and then let's hop over and look at how
05:44: many rows
05:46: are in this top table
05:49: and we have 62 rows so clearly we're
05:52: picking up
05:53: this bottom table as well which we don't
05:57: want
05:58: [Music]
06:00: so there's a number of ways that we can
06:02: choose to
06:04: handle that
06:05: [Music]
06:07: i one thing that we can do
06:11: i see
06:15: that well let's look at how the data
06:17: frame looks
06:18: in that middle area so
06:22: of course the with the data frame
06:23: display the middle gets cut out
06:25: so we can use iloke to
06:28: look at some range of rows so we know
06:31: that was
06:32: 63 rows so let's just look at row 60 to
06:36: 70.
06:38: um so now we can clearly see here
06:41: the table ending and then we can see
06:46: a number of um blank rows
06:50: um and then we have the next table
06:53: starting
06:56: um so there's a number of ways
06:59: we can deal with this to be able to say
07:03: let's cut off everything after
07:07: the table the first table
07:10: so ultimately it's really easy to drop
07:13: rows which have all missing values
07:15: so i'm not too worried about these ones
07:17: once i drop the missing values these
07:19: will be gone
07:20: um it's once we're getting to here
07:27: and we want to try and pick something
07:29: that's going to be consistent
07:31: as we load in data for different
07:33: companies or different time periods
07:36: so you know this is the first kind of
07:38: distinctive marker that we see
07:41: for the second table but we can see this
07:43: is very specific to the date
07:45: so perhaps not a good choice to try and
07:47: target
07:49: now we do have this description here in
07:51: the next row
07:53: signifying that we're starting the next
07:54: table
07:56: so that seems like an appropriate thing
07:58: to target
07:59: um because we can assume that that is
08:02: going to be the same
08:03: for other time periods and other
08:05: companies as well
08:07: um so we can try to target
08:10: that row
08:13: so that row is going to be
08:19: what is going to be where the
08:24: description
08:26: is equal to description
08:30: right um so by that
08:34: we're now able to target this second
08:36: header row
08:38: um and that way we know we've gotten
08:40: into the next table
08:43: but we don't want to cut off just that
08:45: we want to start cutting off
08:46: from the row before um
08:51: so if we
08:54: take the index from that we get
08:57: row 67
09:00: [Music]
09:02: we pull the first value out of that we
09:04: get actually the number 67
09:07: and then if we subtract 1 now we get row
09:10: 66
09:12: which is the first row that we don't
09:14: want to keep
09:17: so there then we can
09:19: [Music]
09:21: say that that's our
09:24: fiscal year row index
09:28: because now we're talking about row 66
09:30: here
09:33: um and now we can
09:35: [Music]
09:37: do we can get all of the data frame up
09:40: until
09:41: that fiscal year row index
09:44: so now we'll have those blanks on the
09:46: end but we do not have that fistful
09:48: year row and we do not have the second
09:50: table
09:51: at all
09:55: so that's a good step we've removed that
09:57: second table
09:59: but now we also want to remove these
10:01: blank rows thankfully pandas has
10:04: a convenient method for this drop n a
10:07: um but we want to make sure that we're
10:09: still keeping
10:10: these even if there's no floating rate
10:12: it can still be a valid
10:14: row so we only want to um drop
10:18: when all are missing so we put how
10:20: equals all
10:23: and now we can see that those other rows
10:26: get dropped as well and yet we are
10:27: keeping the
10:28: rows which just have some missing values
10:33: so we can do df equals
10:37: that in order to save it back into the
10:39: result
10:42: um and that way we've got
10:46: now just the first table
10:53: so
10:56: just kind of looking through um one
11:00: other issue that we see in here
11:02: is we have these dashes um
11:05: and pandas this nan represents a missing
11:08: value
11:08: a dash represents that's actually the
11:11: string dash
11:12: and so if we try and do any math with
11:15: these columns which have the dash in
11:17: them
11:18: it is not going to work for example
11:24: so this is the principal do usd column
11:32: so if we try to sum that we're going to
11:34: get an error
11:36: and it's actually going to say you can't
11:38: add a float in a string and that string
11:40: that it's talking about
11:41: is that dash which is in there
11:46: so in order to fix that we can use
11:48: replace
11:49: on the data frame so
11:52: with this same look at the data
11:55: if we do a replace of dash
11:59: with uh we do have to go import numpy as
12:03: mp
12:06: and have to uh
12:09: run that
12:13: um then
12:17: uh the missing value that pandas uses is
12:21: numpy dot nan
12:24: so we're replacing the dash which we're
12:26: seeing here in row
12:27: 16 with a nand let's see how that looks
12:31: now row 16 we have a man in each of
12:34: those spots which previously had dashes
12:38: so let's save that back into the data
12:41: frame
12:46: so df equals df.replace
12:50: um
12:53: let me just load it and then clean it
12:56: again
12:58: so now um that looks good we have the
13:00: right number of rows
13:02: and now let's try to do this sum again
13:05: now the sum actually
13:06: works because we replace the dash with
13:08: the missing
13:10: um so i think
13:14: this represents a nice concrete step
13:18: in the process um so let's go ahead and
13:21: wrap that up
13:22: into a function so
13:26: um load details
13:30: we can give that a file path
13:34: um and it's going to
13:37: use read excel and do all these other
13:40: things
13:40: which we just did
13:43: and ultimately return the data frame
13:46: instead of
13:47: that specific file we're going to use
13:49: the file path
13:51: um so now we can do df equals load
13:54: that details on that file path
13:57: and we should have the same thing which
13:59: we do
14:02: great so now we have to start tackling
14:06: some of those other cleanup pieces that
14:08: i mentioned
14:10: so
14:15: let's next tackle um
14:20: the um
14:24: coupons so
14:27: for the coupons we have
14:30: a couple things to deal with let's look
14:34: back at the
14:35: sheet over here so we have
14:39: the fact that we have this percent sign
14:40: on everything and then we also have
14:42: these
14:43: ranges to deal with
14:48: so there's a lot of different ways to
14:50: tackle
14:51: this problem um
14:56: you know you could just uh
14:59: use like string methods like we can
15:03: um we have a string
15:06: 4.0 percent we can do a strip
15:10: and strip off the percent sign and then
15:12: you're left without it
15:14: and then you can take float of
15:18: all that and we get back to the number
15:21: and then divide by 100
15:24: and we get that actual um percentage
15:29: so we can use this kind of approach but
15:30: this is not going to work
15:32: for the range uh where we have the range
15:36: like 4.5 percent to 5.5 percent
15:41: um so we could use similar
15:44: approaches along with split split
15:48: will let us take this string and split
15:51: it into a list
15:53: on a delimitator so here we're putting
15:55: the space dash space as a delimiter and
15:58: now we have these two
16:00: uh coupon rates which we can do that
16:01: kind of approach with
16:03: and then um take the midpoint
16:06: of the range um
16:11: so that works perfectly fine but then
16:13: you have to
16:14: come up with a way to detect which of
16:17: these two styles of strings you got
16:19: um you could do a split and then check
16:22: the length
16:23: uh because if you just split that you're
16:26: gonna
16:26: get a list of length one if you split
16:28: that you have a list of length two so
16:30: you could
16:30: compare off that to make the distinction
16:34: um so that's one possible approach
16:37: um the approach that i'm going to look
16:40: at
16:40: in this video is actually using regular
16:42: expressions
16:44: so regular expressions are kind of a
16:46: mini language within python
16:48: and they also exist in many other
16:51: programming languages
16:52: and actually have nearly the same syntax
16:54: across
16:55: different programming languages and
16:58: they're a way of
17:00: parsing strings they sleela you match
17:04: any kind of pattern and strings and
17:06: extract any parts out of them
17:08: that you want the downside being that
17:11: they do tend to be a little bit
17:12: complicated to write
17:14: and read because they're their own
17:17: little mini language um
17:20: but they're extremely powerful so i do
17:23: recommend
17:23: um taking some time to try and learn
17:25: them
17:28: and i think i will be releasing at some
17:30: point in the future a video
17:32: dedicated to a regular expression so i'm
17:34: not going to spend a whole lot of time
17:35: on them here
17:38: but we want to write a regular
17:39: expression to be able to match
17:41: this pattern of a percent
17:44: and a dash another percent and then
17:47: extract the two numbers
17:48: out of that um
17:52: so in order to write regular expressions
17:54: i like to use
17:55: a little tool to help me in that
17:58: tool so reg x is is the abbreviation for
18:02: regular expressions
18:04: and regexpal is the tool
18:07: that i like to use to help out
18:10: so this tool is nice because you put
18:14: your whatever string you're trying to
18:17: match
18:17: down here and then you start typing your
18:20: regular expression
18:21: up here and
18:25: it will show what you're matching so
18:27: again this is not
18:28: a video to teach you regular expressions
18:30: so the syntax is
18:32: um you know not going to make any sense
18:34: to you but
18:35: you'll see kind of how this works um
18:38: so backslash d will match numbers
18:42: and so now you can see it's highlighting
18:43: the numbers here
18:45: and if i put a dot
18:48: then you can see it's matching the dots
18:50: as well number
18:51: and then dot
18:55: well actually uh dot means any single
18:58: character
18:59: if i escape it with a backslash then it
19:01: means the literal dot
19:03: um and when you highlight um
19:06: over this and regexpal it also explains
19:09: to you what you have here
19:13: and so we then
19:16: [Music]
19:20: can make a character group out of this
19:24: and then say that it repeats one or more
19:26: times
19:27: and now we're matching uh you know
19:31: any number here regardless of how many
19:33: decimal places are there on it
19:36: and then we're going to have a percent
19:38: sign and then space dash space
19:42: and then uh you know basically the same
19:44: thing
19:45: over here yeah entire same thing over
19:48: here
19:50: um
19:52: so what am i missing here
19:59: there we go and then percent and now
20:01: it's matching clearly
20:03: so this is our regular expression
20:08: and then one more thing that we want to
20:10: do is put groups in here
20:12: and groups are going to allow us to
20:13: extract certain parts
20:15: of that string so ultimately you want to
20:17: extract
20:18: this here and this here so let's put
20:21: those into groups with parentheses
20:24: um so that whole thing is grouped not
20:27: the percent sign
20:28: and over here uh same thing not the
20:31: percent sign so now we have our regular
20:35: expression and we can bring that
20:36: back over into python
20:41: so this would be
20:44: the range pattern
20:48: um and i'm going to import re for
20:51: regular expressions
20:53: and it's going to be re.compile is what
20:55: we're going to do here
20:57: again i'm really glossing over the
20:58: regular expression stuff i plan to do a
21:00: whole video on that later
21:04: but you're just seeing quickly how it
21:05: can be used here
21:08: so then we have uh our string
21:12: so let's just find that as a and then
21:16: uh or let's say that's coupon string
21:19: give it a better name
21:21: um so then we can do
21:24: uh range pattern dot match
21:28: on the coupon string and
21:32: when we get a match we're gonna see
21:33: something like that
21:35: um when we don't
21:40: uh then you're going to get none
21:44: as your result um
21:48: so what you can do
21:52: is then say that's the match
21:55: um and then if there's a match um
21:59: then we can extract the result from it
22:02: so then
22:03: um the bottom percent uh
22:06: is going to be match.group uh one group
22:09: zero is the entire thing group one is
22:12: the first spot you put parentheses
22:14: um and then the top percent is going to
22:17: be match.group
22:19: two um and let's just make sure that we
22:23: have those
22:24: uh correct yep we got those
22:28: and we can then take the float
22:32: of each yep so now we have the numbers
22:38: and then the midpoint is going to be
22:40: equal to
22:42: the top percent plus the bottom uh
22:45: percent divided by two
22:49: so the midpoint we should get uh five
22:51: for that which we do
22:55: so now we can make this into a function
22:59: um so we can call this standardized
23:02: fixed
23:03: coupons fix the coupon string as an
23:06: argument
23:10: okay
23:14: and so we're going to return the
23:15: midpoint in that case
23:17: but what about in the case where we
23:18: don't have a match so we basically have
23:21: two alternatives here
23:22: and the other alternative is this where
23:25: um we already looked at
23:30: um for that we just do this kind of
23:32: approach so
23:34: um if it's not a match then we're going
23:37: to
23:38: turn this on the coupon string
23:44: let's just save the divided by 100 for
23:46: later
23:49: so now we have a function where we can
23:53: give it say 2.5 it gets the 2.5 out of
23:57: that or we can give it
23:59: a range and it's going to get the
24:01: midpoint
24:02: of that range
24:06: so now we can
24:08: [Music]
24:09: apply this to the correct column
24:13: so that's the coupon base right column
24:20: so we're going to apply standardizing
24:23: the fixed coupons and we get an
24:32: error um
24:34: [Music]
24:35: expected string or bytes like object
24:39: um so
24:40: [Music]
24:43: i'm just going to quickly look at what
24:45: that error was
24:47: using the built-in debugger uh and we
24:50: can see it's because we're getting a
24:52: nand here there is a missing value so we
24:55: do have to handle that
24:56: in here as well if pd dot
25:00: is null coupon string
25:03: then we want to return the coupon string
25:08: okay now this is working appropriately
25:13: okay um so
25:17: we can
25:18: [Music]
25:20: um then have this
25:24: um as part of our
25:28: cleanup the coupon base rate
25:33: we're going to apply this
25:38: okay so um
25:47: now we want to handle next
25:51: um let's look at the floating rate
25:55: coupons next so for the floating rate
25:59: we have a similar thing here
26:05: where for the floating rate
26:09: most the time here we just have this
26:12: um n a and
26:16: that has been converted to nam over here
26:18: so
26:20: it already has missing values
26:22: appropriately
26:26: but what we're going to need to handle
26:27: is these libor plus
26:31: so we can use a similar regular
26:34: expression
26:34: approach so i'll bring that over to
26:38: regex pal
26:40: um let's now put
26:44: that as our test string and so what
26:46: we're looking for
26:47: is the string libor plus
26:50: but plus is a special character in regex
26:53: so you gotta put a backslash
26:55: um and then we're going to have numbers
26:59: as well as uh decimals and
27:02: uh one or more of those and
27:06: then a percent sign so that's our
27:08: regular expression
27:10: for that um
27:14: so that's going to be the libor pattern
27:23: okay and i'll move this up here as well
27:26: since it's also done
27:29: um and now we
27:33: can use a similar approach so libor
27:35: let's
27:37: [Music]
27:40: give that example coupon string
27:43: live or plus
27:47: and try a similar approach so
27:50: we have the libor pattern that match
27:55: the coupon string we can see that's a
27:57: match
27:58: um so that's the match and then if match
28:03: um and we got to add our group in here
28:06: so here the group is is going to be
28:08: around the number again
28:12: um so if match
28:15: then the percentage
28:19: is going to be the float of match.group
28:22: one
28:25: and then let's just say that libor
28:30: was one percent
28:33: um then we're going to
28:38: return
28:42: the library rate
28:46: times 100 because this is in whole
28:48: percentage points
28:50: plus the percentage and
28:53: not quite returning yet we're not
28:55: writing our function
28:56: um so this would be the
29:00: you know the overall rate and then we
29:03: can look at that rate
29:07: libor just libor
29:11: and we get 1.75 so that appears to be
29:13: working correctly
29:16: so let's uh wrap that into a function
29:20: so standardize
29:24: loading coupons
29:27: takes the coupon spring as well
29:30: as libor um
29:34: and it does this and
29:38: you know just in case there are just
29:41: normal rates in there
29:42: we can throw this in there
29:45: and we'll also throw our catch for
29:48: missing values
29:49: as well
29:55: okay so now
29:58: we should be able to take our floating
30:00: rate column
30:05: and apply standardized floating coupons
30:14: and pass the libor rate
30:19: and it seems it did not work
30:22: appropriately um for some reason
30:27: our string was getting through without a
30:28: match oh because we never returned
30:31: okay so then we want to return the rate
30:33: and
30:34: now it seems to be working uh let's just
30:37: drop an a in there so we can see the two
30:40: yeah they're both live or plus
30:42: uh 0.75 and so that's correct if we put
30:46: a
30:46: two percent rate now we're getting 2.75
30:49: percent so
30:50: everything looks correct
30:54: um so we can now add this into our
30:57: cleaning approach
30:58: so yeah floating rate
31:02: he pulls that
31:06: um so now this is starting to get
31:10: fairly cleaned up we have the coupon and
31:12: the floating rates
31:13: both handled well
31:18: now we want to um
31:21: combine the two um
31:24: so we can create a coupon column
31:29: so that's going to be equal to
31:33: the coupon base rate column
31:36: and then we can say well wherever that's
31:38: missing we can fill it in
31:40: with the floating rate column
31:47: so now we have the coupon right here um
31:51: and it should be taking the values um
32:04: yeah so i got the rose or floating rates
32:06: not missing and we can see
32:07: it is indeed getting filled in for the
32:09: coupon um
32:11: so everything is working appropriately
32:15: um and then in order to do calculations
32:18: with it we want to get it to decimal
32:20: so df coupon equals df coupon
32:25: divided by 100.
32:30: and now that looks good
32:34: so now the coupons are cleaned up and we
32:37: just have the
32:38: maturities to deal with now
32:41: um let me just get all this group
32:44: together
32:47: and uh bring this up with my other
32:50: functions that are now
32:51: working well
32:56: okay um
32:57: [Music]
33:05: so now we've got to deal with the
33:09: maturities
33:11: um and so with the maturities uh we
33:14: talked about
33:15: how um well
33:19: one we got to get it to date time
33:22: instead of just a string is that how it
33:25: is now let me look at the data types
33:28: maturities object so yes it's coming as
33:30: a string
33:32: um and not a date time
33:35: um and then we've got to deal with this
33:38: year
33:38: issue as well um
33:42: so let's see
33:47: um what is the column name
33:52: so maturity
33:58: um
34:04: so
34:07: trying to look for
34:11: where one of these years
34:14: so let's grab 20 that's where we have a
34:18: year
34:18: so what is the type of that
34:22: that's actually coming as an integer
34:26: and then one that's a date
34:29: is actually that is actually coming as a
34:32: date time not as a string
34:36: so
34:40: we want to let these basically
34:44: pass through as they are they're already
34:47: in date formats that's what we want
34:50: um but if we're getting an integer that
34:53: means we have just the year
34:54: and then we've got to convert it um
34:58: so we can use pandas's two date time to
35:01: do that conversion
35:02: for us um so if we just say you know
35:05: december 31st
35:07: of may 20 it will be able to take that
35:10: and turn it into
35:12: a timestamp which will work for the
35:14: daytime purposes
35:17: and so if we you know just have our year
35:23: then we can just format the string to
35:25: drop in the year
35:27: and it will do that conversion for us
35:32: so we can do that along with some type
35:35: checking
35:36: in order to standardize the maturities
35:40: so um you know if
35:44: is instance that's a way we can check
35:46: the type
35:47: of um and here let's make this the more
35:51: general maturity because we're not going
35:53: to
35:54: um know coming in whether it's
35:58: a year or a date time so if it's an
36:01: integer
36:03: then um that's going to be our result
36:08: um and
36:11: otherwise basically it's passing through
36:13: let's just go ahead and make the
36:14: function for that
36:16: standardized maturity dates
36:22: um so if it's an editor then return the
36:26: date time of the end of the year
36:28: there needs to be maturity there
36:31: um if it's not
36:38: uh
36:40: you know i guess it could conceivably
36:42: conceivably get loaded
36:43: as a string so if it's not a string or
36:45: an integer then we're just going to
36:46: return the maturity so if it's a date
36:48: time or if it's missing then just return
36:50: it
36:51: um and then basically we get here if
36:55: it's just a regular string
36:56: and then we can just return it
37:00: um so now
37:06: we give it a year that will give us the
37:08: end of that year
37:09: um if we give it a date time
37:13: then well let's just throw it at the
37:15: data frame and see
37:17: um so df maturity
37:22: dot apply standardized maturity dates
37:26: and we can see this indeed has worked
37:28: for us
37:29: so we didn't get any errors uh we
37:32: remember row 20
37:33: had that year we can see now we're
37:35: getting the end of the year for that
37:38: um and the missing values are passing
37:40: through just fine
37:43: so that becomes the next step here in
37:45: our cleanup process
37:48: we're going to overwrite the maturity
37:49: column with that
37:55: [Music]
38:00: and that has allowed it to become a date
38:01: time type as well now that everything is
38:03: date
38:06: um so then we just have to
38:11: um convert that i'll just you know put
38:14: this over here because we're done with
38:16: it
38:18: then we have to convert the
38:22: um year
38:25: or the the maturity date into a number
38:29: of years from today
38:31: so i'm going to import date time and i
38:33: already covered this
38:34: um in a prior video
38:38: um on on handling the market value of
38:41: debt
38:43: in python so i'm just going to copy the
38:46: function that we had
38:47: from that this uh year until from date
38:51: so it gets today's date and then looks
38:53: the difference between
38:55: the date we're passing to it and today
38:57: and ultimately gives you a number of
38:58: years
39:01: so then we can just apply that
39:05: [Music]
39:07: so we take maturity now that it's
39:10: been all cleaned up and we apply the
39:14: year until from date um
39:18: then we get the number of years so we
39:20: can add that in here
39:22: as df maturity
39:25: years equals that
39:31: and now we should have that coming into
39:33: the data frame as well
39:36: so one last step is to determine
39:41: whether we have enough uh information to
39:44: be able to
39:45: value the bond um
39:49: so there's a few things that will make
39:51: it invalid
39:53: um so
39:57: um if
40:01: um
40:05: uh if the coupon
40:10: is missing um which
40:15: here we can see the rows that have a
40:16: missing coupon then we're not able to do
40:18: anything
40:19: um if the
40:22: maturity years is missing
40:28: that also is invalid we have no way to
40:31: know
40:32: the value if we don't know how long it's
40:33: outstanding
40:35: um and then uh
40:38: if the maturity years is less than
40:41: zero that means the bond is already um
40:44: expired and so those are going to be
40:48: invalid as well so we can
40:52: um use loc in order to
40:54: [Music]
40:55: catch all these three things at once um
41:00: so if we just
41:05: wrap this in parentheses and then put
41:07: this pipe
41:09: operator for or and then we can put
41:11: these other conditions in there as well
41:19: wrap it in parentheses pipe
41:24: and then if the coupon is null
41:31: so now that should catch all three of
41:34: those cases
41:37: um so then we can
41:41: use that to create a column
41:44: which represents whether the row is
41:47: valid
41:48: um so yeah we can see we've got the
41:50: negative maturities we can see we've got
41:52: the missing coupons
41:54: and we can see we've got the missing
41:55: maturities all caught here in one filter
42:01: so what we can do is we can say um
42:05: create a validity column and just set it
42:07: to true
42:08: but wherever any of those things uh
42:11: conditions are meant for it to be
42:12: invalid then we set it to false
42:18: so then we have this validity column
42:22: and you can
42:25: [Music]
42:28: oh i accidentally had it set
42:31: the entire row of false we want to then
42:33: put a comma
42:35: um and then i've just said only the
42:38: valid column
42:38: let me get a new data frame and then do
42:40: this again and
42:42: yes now it was false the rest of the row
42:45: still stays there so we can do other
42:46: things with it like just take it out as
42:48: principle instead of doing an evaluation
42:52: but now we can differentiate between the
42:54: two so let's just create a function out
42:56: of this
42:58: so we can call this a sine validity
43:01: of bonds
43:04: and this is going to just work in place
43:07: so we don't even have to return anything
43:10: um i'm going to put this up there
43:16: um and now we can just call this on the
43:19: data frame add in
43:22: um and now we have that in there
43:25: as well so this is all working
43:28: everything is all cleaned up now it's
43:30: ready for you to go do evaluation from
43:32: here
43:34: but let's just wrap all these steps into
43:37: one further function
43:39: so we can call this load and
43:43: clean uh statement or
43:46: that details and that's going to take
43:49: the file path it's also going to take
43:51: the libor rate
43:53: and it's going to do all these things
43:58: and this is going to be the file path
44:02: and it's already using a library in
44:03: there and then we're going to return the
44:06: data frame
44:07: at the end so now we can just do load
44:10: and clean.details
44:13: on the file path and give it a library
44:15: rate
44:18: and we'll have the data frame all
44:20: cleaned up and ready to go
44:24: so now we have our code working
44:28: to be able to do all this cleanup
44:32: but it's kind of you know a decent
44:35: number of lines of code here
44:37: it's definitely not ideal to to copy
44:39: paste this into every project and if you
44:41: did
44:43: and you needed to make some updates to
44:44: it
44:46: that could cause some issues because
44:49: then you've got to go
44:50: back and update it and every place that
44:52: you've copy pasted it so really you just
44:53: want to have it once
44:54: and be able to use it everywhere so
44:57: that's the perfect application
44:59: of a python module um so
45:04: we can create a module
45:08: um in order to
45:11: use this very easily in any of our
45:14: models
45:15: but real quick before we do that i do
45:18: want to try
45:18: just loading in different dead details
45:21: to make sure it works with pfizer's
45:23: details as well to see if anything is
45:26: different there
45:28: so let's give that a try
45:32: um and there did not work so what is
45:35: going on here
45:36: you can see you got a key error for the
45:38: floating rate um
45:40: so that suggests maybe that the um
45:43: floating rate column is not there
45:49: but let's see let me open these up
45:55: so we can take a look
45:59: and we can see in the pfizer data that
46:01: the floating rate column is there so why
46:03: are we getting an
46:04: error so let's go back over to python
46:08: and we can use the built-in debugger
46:12: in order to check this out so let's get
46:14: back up to our code
46:16: uh so let's look at the columns of the
46:19: data frame
46:20: loading rate is right there isn't that
46:23: weird
46:24: um so this is definitely
46:27: a gotcha that you may face and loading
46:29: in
46:29: [Music]
46:30: um data from uh
46:34: different sources now let me um
46:37: look at that directly oh it looks a
46:39: little bit different now that i pull it
46:41: out
46:42: um it's floating backslash xaoxa zero
46:46: rate what does that mean so uh
46:50: it's uh you know a little outside of the
46:53: score outside of the scope of the course
46:55: to talk in detail about what's going on
46:57: here but basically
46:58: this is another way that white space can
47:01: be represented
47:02: in a file and so
47:05: what looks to us like a space is being
47:07: represented by this backslash
47:09: xa0 so what we can do here while we're
47:13: loading in the columns
47:14: is just replace any of these xa zeros
47:18: with a space and then we shouldn't have
47:21: to worry about it
47:23: so i'm just going to try adding that in
47:24: here uh df.columns
47:28: equals um column.replace
47:32: uh this oh uh backslash xa0
47:36: with a space or call in the update
47:40: columns
47:42: and hopefully that will allow it to work
47:45: awesome
47:45: this time it loaded it and now
47:48: everything seems to be working properly
47:51: here as well
47:54: and then let's just check we can go back
47:55: to walmart and everything is still
47:58: working there as well
47:59: awesome so this is ready to go um
48:03: so now let's let's take this over and
48:05: put it into a python module
48:07: so that we can reuse this in multiple
48:10: different
48:10: models so
48:14: we can go and create a text file
48:17: and then i'm going to rename this file
48:20: uh you want to pick a name without
48:22: spaces
48:23: so i'm going to call this cap iq debt
48:25: and you want the file extension to be
48:27: py that's what makes it a python module
48:32: and then we can just uh take
48:36: all of this that's why it was good to
48:39: gather it all in one cell
48:42: and put it over here um
48:45: and now save this and so now this is
48:48: just a text file
48:50: or the dot py extension which is in the
48:52: same folder
48:56: and now let me totally
49:00: let me just totally start a new jupiter
49:02: notebook
49:03: um so over here we've never
49:07: defined anything um
49:10: and so we don't have any of those that
49:12: loading functions ready
49:14: um but let's try import app iq dat
49:20: so we do that and we can see it was
49:22: successful
49:23: and now when we do cap iq dot we can see
49:27: that we have
49:28: all these functions that we defined
49:31: in what was previously the jupiter cell
49:33: and we brought over to the python module
49:36: so now we can do load and clean debt
49:37: details um
49:39: so df equals and we're going to do that
49:43: on the pfizer
49:45: um that details
49:50: and say our libor rate let's just say
49:52: it's two percent now
49:54: um that's the live one
49:58: and there we go now we have our data
50:00: frame cleaned up ready to go
50:02: and we don't have all that code in this
50:05: jupyter notebook
50:06: so now um we're able to take all this
50:09: detail
50:10: of all this loading cleaning process for
50:13: the data pipeline
50:15: and just offload that so that the reader
50:17: of our model does not have to care
50:19: about how all this works it happens in
50:22: the background
50:24: through this cap iq depth module and now
50:27: this
50:27: module can be reused in as many uh
50:30: different um notebooks as you want
50:34: and if you go and make some changes like
50:37: we had to do
50:38: to be able to handle the father details
50:41: then that's going to go and apply for
50:43: all your different models
50:46: so maybe you decide okay well i'm going
50:48: to add some functionality which prints
50:50: out some summary statistics and
50:53: totals up the principle to make sure
50:55: that you have the right amount of
50:56: principle
50:57: well you just go and change it in this
50:59: module and then
51:00: all your different models which use this
51:02: module are going to get that
51:04: functionality
51:05: so there's a lot of power in
51:08: restructuring
51:09: your code to use modules
51:13: so that's an overview of how we can
51:16: clean up
51:16: capital iq debt details
51:20: 100 percent automated in python so now
51:23: it's ready to use to value
51:25: all the bonds and calculate the total
51:27: market value of debt
51:29: um and we also covered how to
51:32: take that code and offload it into a
51:35: module such that it can be easily reused
51:37: in multiple different projects so
51:41: thanks for listening and see you next
51:48: time