Almost every portfolio manager uses some form of quantitative analysis. Most still rely on Excel spreadsheets, but this popular tool constrains the creativity of analysis and struggles to cope with large data sets. Statistical programming in R and Python both facilitates and widens the scope of analysis. In particular, it allows using high-frequency data, alternative data sets, textual information and machine learning. And it greatly enhances the display and presentation of analytical findings.

The post ties in with the subject of information efficiency. View summary page here.
The is an original post, not a research summary.

Why statistical programming is for all

When we think of the fusion between statistics, programming and markets, the first image which pops into our head is that of a quant, tapping away on a keyboard into a machine firing off trades automatically. However, what about the rest of the investing community, who take a more traditional approach to markets and make discretionary trading decisions? It is a bit of misnomer to think that a “quant” approach is purely relevant for systematic trading. After all, analysis is at the heart of investment decisions.

Often the first port of call for analysis for many in the financial community is Excel. It provides a visual way to interact with data. Yet, this strength is also its weakness. As datasets become larger, using Excel alone becomes more difficult. You end up with multiple spreadsheets linked together that need to be opened and closed separately (otherwise you run out of memory). There are only so many rows and columns that we can fill up at any specific time, before Excel grinds to a juddering halt, and we frantically start pressing F9. Excel also tends to force your creativity in analysing data down specific paths. It’s like painting by numbers, you already have the painting there ready to be coloured in, but you don’t have much choice about changing the underlying picture.

Using statistical programming can help alleviate many of the problems associated with Excel. We could use R or Python, two most popular and freely available programming languages suited to data analysis. We can take in larger datasets (with the same amount of memory), do some analysis, and then output a summary. We can either choose to display this via some of the great open source visualisation tools, like plotly and ggplot, or output them to smaller more manageable Excel spreadsheets for further analysis. Or we can build Excel add-ins, so that Excel is a front end for our computations (there’s an interesting Python library xlwings which is quite nice for this, which I’ve used before). Best of all, given R and Python are open source and there is no yearly licence cost.

How statistical programming broadens the horizon

What type of analysis is easier to do using statistical programming, than in Excel?

• A key application is analysis on high frequency data. High frequency data can be very dense, even with 1 minute data for a single ticker, we could end up with over 1 million points for a few years. We can for example get richer estimates of realised volatility with higher frequency data, compared to simply using closer data, which we’d likely have to do with Excel. We also need high frequency data in order to do TCA (transaction cost analysis), which is becoming more important from a regulatory point of view, so that’s another thing we can’t really do with Excel that we can do more effectively with a bit of coding.
• Furthermore, it isn’t just price data that can be “big”, many alternative datasets can also be very big, such as those with mobile phone location data, satellite photos etc. If we insist on using Excel, we are effectively ignoring many datasets out there, which could be used to better understand the market.
• We can also use more exciting techniques such as machine learning (there are many libraries to do this in R and Python) to analyse our data, given the plethora of open source statistical libraries available for that, although, we need to be careful that we don’t introduce too much data-mining if we end up using such techniques.
• If we have unstructured datasets with text, we can also apply natural language processing to them to convert them into time series data, that is easier to analyse. This is something you can’t really do in Excel (traders read news all day to make trading decisions, which suggests there is a lot of information content there), which is primarily designed to analyse numerical data.

Freeing your analysis away from Excel also makes it easier to display in many ways, for example, through webpages. We can also automate the updating of our own signals and indices more easily when we code, rather than using a mishmash of Excel macros, to try to update them. I really wonder how much time is wasted in banks and funds on updating time series on spreadsheets (which can be automated)… time which could be spend on more productive tasks. I’ve certainty spend far too many hours opening and closing spreadsheets in the past, and updating them. Nowadays, I have setup automated data processing jobs on a Linux box on the cloud to calculate indices and publish them on my website (no more dragging down of cells!)

So even if you’re not a systematic trader, it’s worth venturing into the world of statistical programming. It will make it easier to number crunch data and help you to make decisions quicker. If you fancy a go at some financial statistical programming, I’ve built a few open source Python libraries too, which can be downloaded from GitHub:

• finmarketpy for market data analysis and backtesting of trading strategies (number 2 Python trading library on GitHub by number of stars, over 1300!)