AI, Machine Learning for Hybrid Structured-Unstructured Data

I recently completed a course at Stanford University: SCI 52 – Artificial Intelligence: An Introduction to Neural Networks & Deep Learning with RONJON NAG, Ph.D. & SOHILA ZADRAN, Ph.D. It was the most illuminating quarter of learning I’ve experienced since, well, for awhile. The culmination of the course was a written project, and part of the assignment was to write for a general audience such as you would find on LinkedIn.

If you’re looking for a super high level intro to Artificial Intelligence, read on, friends! 🙂
[And if you’re wondering my grade for the course…A+. Go me!]

An Election Prediction Machine: AI, Machine Learning for Hybrid Structured-Unstructured Data

In this article my goal is to introduce you to an Artificial Intelligence (AI) and Machine Learning (ML) project that will power an election prediction machine.

AI & ML – what are they

At its essence Artificial Intelligence (AI) is simply a rules-based system for processing inputs to provide usable outputs. AI is usually embodied in some type of complex machine whose work resembles human intelligence. Examples of AI include Siri, Alexa, Google Now, fingerprint recognition, collision avoidance systems, and more. AI is great for performing specific tasks very well, at least as good as humans. For Siri or Alexa that means speech recognition. Speech is a great example of how AI can be effective because it consists of very specific, predefined rules that can be (relatively) easily analyzed.

Machine Learning (ML) is the mechanism by which AI becomes “intelligent”. ML is the means by which Siri or Alexa are able to perform speech recognition leveraging algorithms to parse inputs (i.e. data), learn from it, and then make a determination about something in the world.

Currently, when people talk of AI and ML it will usually involve a neural network to execute the rule-based system for processing outputs to be acted upon. Neural networks – which may consist of a wide number of ML algorithms such as a Recurrent Neural Network, Convolutional Neural Network, etc. – attempt to mimic the neural structure of the human brain as its method for “thinking” or being “intelligent”.

Data Collection and Labeling

Any Artificial Intelligence is only as good as the data that is input into the Machine Learning algorithms and neural networks that are the heart of the AI. In building my machine (or any other AI) there are three components to data collection to take into consideration: data collection, data storage, and data labeling. For predicting elections we will be collecting data from users in the form of written, long-form essays.

This data will be submitted/collected by students, community members, and others online in much the same way that Turnitin.com accepts submitted papers. One major difference is there will be a series of tags that authors will be required to apply to the essays as part of the data labeling process.

As an AI and ML enthusiast, the way you should think about this data set is as a hybrid of structured and unstructured data. The dominant proportion of data will consist of unstructured data, i.e. text written by humans. Unstructured simply means that the data is not organized in a pre-defined manner. The structured data (which is highly organized and easily searchable) will come in the form of user-supplied tags that will be pre-determined via radio buttons or drop-down dialogue menus as part of the online data submission/collection process.

One important reason for the tagging/labeling process is to be able to train the ML model. Training is a concept whereby you teach the neural network how to recognize the outcomes you are looking for. You use a subset of your data to train the model. You use another subset to test and validate your model’s accuracy. Using a known set of data allows you to accurately score the effectiveness of your model.  

By supplying the election prediction machine with a data set of essays that are tagged by topic, demographics, sentiment, etc. the model will begin to “learn” what an argument for or against strong environmental protections (to use one example) looks like. So, once the model is deployed in the wilds of untagged data it will be able to make statistically likely assumptions with increasing accuracy about any range of political topic.

How to predict the future: the hypothesis

There are two things I have elided thus far, and they relate to (1) the training of the data and (2) the ability to predict.

First, after the papers have been submitted and after an election event (a primary, a caucus, a general election, etc.) transpires we will revisit the authors and ask them to tag their submissions with updated data. Essentially they will be asked: how did you / your household vote in the most recent election(s)? We will tag the data set accordingly with that those labels for training purposes.

Second, the initial phases of this proposal will be scaffolded with a sort of mechanical turk approach while the database is still small. There is a not insignificant assumption being made about our primary demographic, and that is the insights they will provide about their immediate household voting behaviors. We know that the 18-26 year old demographic is not a high percentage voting block. Their parents, however, are. The assumption is that they will be revealing not only their own positions, but that their positions are a relatively accurate reflection of their heads of household.

Model Definition (papers on medical health records)

Assuming all these data collection, labeling, and storage elements are executed flawlessly, what about an actual model? The model is the type of Machine Learning algorithm we will run to make this project “intelligent”. Above I alluded to the fact that the vast majority of the data will be unstructured data. As a core component of the overall machine, this, in itself, is no small matter. But not to worry, because the AI community is a collaborative one! Many of the models being developed and improved upon by some of the largest organizations (e.g. Google) are, for the most part, accessible to folks like me and you.

But to illustrate that the the election prediction machine is feasible, there are at least a couple models that I have come across thus far that we may be able to imitate in early iterations (Ravi, 2017; Jagannatha, 2016a; and Jagannatha, 2016b). The examples come from Machine Learning models deployed against medical health records and clinical texts. You can dive deeper into them via the references below, but what these researchers have found is that Recurrent Neural Networks (a sophisticated type of neural network) and Long-Short Term Memory Neural Networks (i.e. LSTMs, another sophisticate type of neural network) are very promising models for being able to analyze and predict unstructured natural language data sets.

To summarize, I am proposing building an Artificial Intelligence election prediction machine. The Machine Learning algorithms will consist, first, of a Recurrent Neural Network model that will process the structured and unstructured data provided by a large swath of student and adult authors. As a reader, my hope is that I’ve been able to explain a bit about the technology that will enable my election prediction machine and illuminated various AI and ML processes in a way that has demystified how exactly this will be technologically achievable.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: