TSM - Fraud detection with Titan

The recent boost in online gambling means that there will always be people who will try to bypass or completely avoid regular business behaviour and they will then use this in their advantage. I"m talking mainly about impersonification, taking unfair advantage of promotions, syndicates or just trying to find a leak in the system business flow.

Working at Betfair taught me that when you develop broad spectrum applications, you need to put that 10% (or more) extra care in order to secure your application. However, this often proves that it"s not enough and you need to take steps in order to mitigate fraudsters.

One such example is trying to reconcile new accounts and detect duplicate accounts. This step is where the magic begins. At Betfair, we use different mechanisms to achieve this, I will talk about one tool in particular that we recently developed. It"s called Spider and its job is to detect accounts that are linked based on the data used at registration time. It uses a set of match rules that combine factors such as fuzziness (Levenshtein distance), string operations (equality, inclusion, starts/ends with), and other combinations.

Choosing Titan

Technically, the problem we are trying to solve is to create a graph of all accounts ever created (represented by nodes) and try to draw edges between linked accounts. This boils down to a graph representation and a mechanism to quickly search through all the nodes. The candidates were Neo4J, OrientDB, Dex and Titan.

All have their strengths and weaknesses, but in the end we chose Titan. It"s a new graphDB, the project started in 2012 by Aurelius, and it was designed with scalability and performance in mind. It"s java based (yay) and some features of Titan include:

Easy integration - it"s basically a maven artifact that you include in your project and boom! there you have it (of course the tweakings come later, when you need specific behaviour)
Free (Apache license)
TinkerPop stack (since Titan is based on Blueprints API, you can easily plugin the TinkerPop stack to aid with things such as fast traversal, gremlin shell for graph query, rexter for graph visualization)
Good performance when using with Cassandra backend (scalability and decoupling)
High availability with no single point of failure and option to horizontally scale (vs Neo4J for example)
ElasticSearch used to index vertices and edges (and since ES is clustered, we can always scale horizontally by adding more VM"s)

To complete the picture we also have to mention the weaknesses of Titan. Some of them are:

New technology (first version in 2012)
Relies on heavy clustering (both Cassandra and ES work in their own cluster, thus giving you a different approach to your application infrastructure)
Not a lot of users, limited support from Aurelius (An semi-active community with answers from creators/developers - where even if they answer promptly, they need time to make all the changes)

Okay, we might have cheated a bit since we already worked with Elastic Search and we wanted to also try Cassandra, but overall Titan is a strong choice for graph representations and it suits our needs very well.

We found mid-project that PayPal just revealed on a press release that they too are using Titan for fraud purposes, so unfortunately we couldn"t brag about being the first to use Titan for this specific scenario. However, in the gaming industry, we can proudly say that we are the first company that scans its customers on a multitude of rules in order to prevent and detect fraud.

The only issue we face in the middle of implementation was when we realized that we need a specific version of Elasticsearch and we were forced to branch the Titan codebase in order to accommodate the version bump.

Spider Implementation

We decided on using Titan together with Cassandra (for persistence) and Elastic Search (for fast indexing/searching). The way we modeled our problem was to represent accounts as vertices (storing the registration data as vertex attributes), but because we had a requirement for fuzzy logic matching, we needed to store the attributes in clear text. This did not make us friends with the security department, as they had strict rules on storing Personal Identifiable Data in NoSQL, but it was a compromise we had to make.

We used multiple threads to initially populate the graph using data fetched from the main Oracle DB, and then we would rely on Titan to parallel search in Elastic Search based on our matching rules imposed by the fraud team and create the corresponding edges. One simple example of the graph representation could be the following picture (notice the nice chains when multiple accounts are matched):

One of the performance improvements we made that is worth mentioning is a lazy search option that relies on runtime edge creation when searching for new links. If we decided to create the graph completely at ramp-up time, we were talking about weeks of data indexing, which was unacceptable from a usability point of view.

Searching for links is basically just starting with a source node, and getting all the adjacent nodes (spider out) until a predefined depth of a maximum number of results is reached, storing the results in Cassandra, and emailing the requestor with a zip file containing all the csv files with the results.

Conclusion

Overall Titan is a tool I"d recommend if your business scenario calls for a java-based graph representation of your model data, and you need fast traversal and progressive scalability. It"s a fun framework to work with and the four-month project proved that it"s a feasible solution. So my conclusion is: Betfair+Titan=love!

TSM - Fraud detection with Titan

Florin Măguran - Senior Java Developer

Choosing Titan

Spider Implementation

Conclusion