News and comments are major drivers for asset prices, maybe more so than conventional price and economic data. Yet it is impossible for any financial professional to read and analyse the vast and growing flow of written information. This is becoming the domain of natural language processing; a technology that supports the quantitative evaluation of humans’ natural language. It delivers textual information in a structured form that makes it usable for financial market analysis. A range of useful tools is now available for extracting and analysing financial news and comments. Examples of application include machine-readable Bloomberg news, news analytics and market sentiment metrics by Refinitiv, and the Federal Reserve communication score by Cuemacro.
The below summary is based on a webinar/presentation and post by Cuemacro-founder Saeed Amen and a future book on alternative data by the same author.
with some quotes from Garbade, Michael (2018), “A Simple Introduction to Natural Language Processing”.
Emphasis and cursive text have been added.
What is natural language processing
“Natural Language Processing (NLP) is the technology used to aid computers to understand the human’s natural language… Most NLP techniques rely on machine learning to derive meaning from human languages… Natural Language Processing is the driving force behind…word Processors that check grammatical accuracy of texts…interactive voice response…[and] personal assistant applications such as Siri, Cortana, and Alexa…NLP entails applying algorithms to identify and extract the natural language rules such that the unstructured language data is converted into a form that computers can understand…When the text has been provided, the computer will utilize algorithms to extract meaning associated with every sentence and collect the essential data from them.” [Garbade]
“Syntactic analysis and semantic analysis are the main techniques used to complete Natural Language Processing tasks…
- Syntax refers to the arrangement of words in a sentence such that they make grammatical sense. In NLP, syntactic analysis is used to assess how the natural language aligns with the grammatical rules. Computer algorithms are used to apply grammatical rules to a group of words and derive meaning from them…
- Semantics refers to the meaning that is conveyed by a text. Semantic analysis is one of the difficult aspects of Natural Language Processing that has not been fully resolved yet. It involves applying computer algorithms to understand the meaning and interpretation of words and how sentences are structured.” [Garbade]
“Natural language processing converts various texts (unstructured) into an easier-to-use format (structured). In a structured form one can more easily use texts in the investment process.”
The relevance of natural language processing for financial markets
“One big driver for market sentiment is news. We might have company-specific news impacting a stock. Alternatively, it could be macro news, such as a Fed statement impacting markets more broadly. There could be news related to the politics, which has notably been a significant driver in recent years, as evidenced by the considerable market volatility around the US Presidential Election and Brexit. We can think of many examples of specific news events that impacted markets. If we ignore the news, we could be ignoring a critical factor that can be driving markets.”
“The sheer quantity of news that is generated every day is bigger than ever. It is impossible for one person to read all the headlines from news articles written daily. One solution is to get a computer to read the news articles in their entirety and aggregate that into a sentiment signal… large amount of data available is in text form, on the web, in social media, newswire, and in form of internal data, such as emails or memos.”
Useful tools for natural language processing of text
“Useful tools for extracting text include:  BeautifulSoup, which extracts text from webpages, stripping unnecessary tags,  selenium, a web browser emulator,  scrapy, a web scraping crawler,  Twython, a Python wrapper for Twitter’s API to read tweets,  search-tweets-python, a Python wrapper for enterprise Twitter,  tabula-py, a Python wrapper for Tabula (Java) to extract tables from PDF,  PDFMiner.six, which extracts text from PDF, and  newpaper, which extracts newspaper articles from the web.”
“Natural language processing tools for analysis include  NLTK, the best known NLP library for Python,  spaCy, which supports many NLP tasks like extracting entities from text,  textblob, an easy-to-use wrapper for NLTK,  genism, which does topic modelling,  Stanford OpenNLP, a natural language library, and  BERT, TensorFlow code and pre-trained models.”
Examples natural language processing for financial markets
“Bloomberg produce several datasets of machine-readable news, which can either be consumed via an API or as an end-of-day flat file. The news has been structured into a format to make it easier to consume, with a record for each article. Alongside, the headlines of an article and the body text, there is also additional metadata. There are various timestamps, as well as topic and ticker tags, which are consistent with those on the Bloomberg Terminal. The additional metadata makes it easier to filter news into topics that are most relevant to traders.”
“I examined those news articles specifically tagged with currency tickers. I applied natural language processing to their text to get a numerical sentiment score. These sentiment scores were then aggregated into a daily index for each currency. These were then used as building blocks to create sentiment indices for currency pairs.”
“Refinitiv have a number of data products based on text data that have already been structured. The products include  NewsAnalytics, machine-readable news from Reuters newswire, and  MarketPsych indices, which use social media and news data to create data indices to measure market sentiment.”
“Non-farm payrolls estimates can be improved with Twitter data as an additional factor.”
“Natural language processing can be applied to track Federal Reserve communication and to predict 10-year U.S. Treasury yields. The Cuemacro Fed communication score is based on speeches and statements that have been gathered from publicly available web sources.”