Unstructured text presents a vast resource of data for economists, but it's complexity requires the use of statistical methods that can help to build a bridge between textual and numeric economic data. This project will develop a general purpose approach for connecting text-based information to traditional economic measurement systems. This will expand the availability of 'real-time' information on economic activity by leveraging unstructured online and administrative data.
This project received funding from the Turing-HSBC-ONS Economic Data Science Awards 2018.
Explaining the science
Unstructured text can be useful for many administrative, legal, bureaucratic, and cultural processes. Textual data's high dimensionality (meaning data with a large number of features, attributes or characteristics) requires the use of statistical methods that can help to build a bridge between textual and numeric economic data.
This project we will develop a general purpose approach for connecting text-based information to traditional economic measurement systems. As an example, this approach will allow information such as online job vacancies, patent data, individual business activity descriptions and legal/regulatory information to be linked to official occupational and industry categories.
The tool produced aims to expand the availability of 'real-time' information on economic activity by leveraging unstructured online and administrative data. A comprehensive 'proof of concept' has already been demonstrated using the U.S. Dictionary of Occupational Titles (DOT).
Other text 'targets' for the classifier being produced include international trade treaties and text descriptions of individual business activity.
An important application of this research is the 'tagging' of complex official documents, such as legal statutes, with official occupational and industry information. This tagging will allow researchers to quantify the economic implications of legal and regulatory policies at a new level of depth.
The tool being produced will enhance the measurement capabilities of agencies such as the Office for National Statistics and provide material for real-time, short-run measurement of the economy, which would constitute important inputs for monetary and fiscal policy decisions.
The methodology has promise across a range of potential applications, including parliamentary debates, proposed bills, enacted statutes, judicial rulings, trade agreements, job ads, resumes, and newspaper articles.