16 September 2020

Graph Analytics

Introduction

The objective of this article is to explain the concept of Graph Analytics and to describe a proof-of-concept created using TigerGraph tool to, increase not only the visibility of an inventory network of surgery kits but also optimize their usage and need. During the description of the proof-of-concept, we will outline how to model the data on a graph schema, and the queries developed to answer business needs.

What is?

Graph Analytics is a concept commonly used to describe analytic tools used to determine the strength and direction of relationships between objects on a graph. It’s composed of a set of analytical techniques that allow for the exploration of relationships between entities of interest such as organizations, people, and transactions. Graphs contain nodes, edges, and properties, which we use to model the data based on relationships. This kind of model has many advantages over traditional relational databases, as we can do it more naturally and intuitively. The building pieces of a graph model contain:

  • Vertices/Vertexes – Entities or abstract concepts;
  • Edges – Relationships between vertices;
  • Properties – Part of the internal structure of vertices or edges.

Why and Why now?

Graph analytics is a market in constant growth. According to a recent graph analytics market report, the graph analytics market size was ~$600 million in 2019, and we can expect to reach $2.5 billion by 2024 (https://www.marketsandmarkets.com/Market-Reports/graph-analytics-market-10738263.html). Graph analytics is growing due to the need to answer complex questions across complex data, which is not always practical or even possible at scale using traditional SQL queries on conventional relational databases.

Graphs analytics allows companies to:

  • Accelerates data preparation and data science processes;
  • Use the power of relationships and in-depth analysis to provide insights, like finding structures and revealing patterns in connected data because graphs are built to operate on relationships;
  • It’s a better, faster a cheaper way to build a database.

But can’t I do the same on a traditional relational database?

A relational database is what most people are familiar with. It has a defined structure, that can be queried through SQL. It also involves a row table for each entry of data. Foreign-key constraints relate tables. When querying relational databases, slow multi-level joins are often involved.

When we talk about a graph database, we can answer any question as long as data exists and there is a path between them. A graph is designed to traverse indirect relationships, and we can also add more connections without jeopardizing the performance.

A graph database has far more advantages than a simple relational database, besides the performance mentioned above. With a graph database, it’s possible to perform real-time updates on big data while supporting queries at the same time. It also provides a flexible online schema that can quickly evolve over-time. You can continuously add and drop new vertexes, edge types or their attributes to extend or shrink the data model. As an example, when we are building a relational database, we build with the purpose of answers defined /direct questions. In the case of a graph, we can quickly evolve the models to answer business needs.

Use cases

Some examples where the graph database demonstrates a higher business value:

Fraud Detection:

Graphs have become a powerful tool in the finance industry as a means of detecting fraud. Pattern identification is the first line of defence. When something appears anomalous, it is a signal of concern. An example of a sign of alarm is when a questionable pattern appears, like when a person who stays within the San Francisco Bay Area most of the time suddenly is making a late-night purchase in New York. This raises a flag as a potentially fraudulent activity.

Social Media

Graphs are used to analyze relationships for social media. For example, on Facebook, one of the most widely used social media, a graph network is used to analyze friendships, pages, and groups in common to different people. Facebook social graph is considered one of the most significant social network datasets in the world. Some other examples include LinkedIn and Twitter.

Product Recommendation

Personalized product recommendations can increase conversions, improve sales rates, and provide a better experience for users. Graphs are used to understanding the behaviour and preference of the customers. Graphs can keep-up with the performance demands, receiving at the same time, new data in real-time to improve the recommendations. Some examples include large e-commerce retailers like Amazon, Wish, Zillow.

 

Graph Analytics POC

BI4ALL has a partnership with TigerGraph, one of the fastest and most scalable graph database analytics platform. The proof-of-concept created was developed using TigerGraph cloud.

This POC was applied to a business in which the main goal is the loaning of surgical kits to healthy entities, to be used in surgeries. A kit can be defined as a block of predefined components that will be used in surgeries. The kits are specially made or by catalogue. Among the different challenges, we have the visibility of all kits spread by different customers and the fact that they are in constant movement by different health entities. This constant use and movement raises the challenge of allocating kits. When the components of a certain kit have already been used in surgery, that is, if a hospital has already used a kit in orthopedics surgery, the components used cannot be supplied to another hospital/surgery, so this kit no longer goes with the complete stock. To this end, there is a responsible person who assesses the components used in each kit, and whose main function is to go to the warehouse to order the missing components to complete the kit, which can then be used in another hospital/surgery.

The purpose of the proof of concept was to build a system to increase and optimize the management of inventories and the distribution network of surgical kits. In summary, this POC allows, in an optimized way, the logistical management of the kits, with the necessary components, available to perform a surgery in a given hospital. With a graph approach we intend to: Have a complete network of the location of the kits in different medical institutions. Using the graph approach, we want to:

  • Have a complete map, with the location of the surgery kits on the different medical facilities.
  • Optimize the usage of different surgery kits.

With the objective to improve not only customer satisfaction, optimize the kits usage and reduce costs.

For this, we follow the below methodology:

  • Graph schema definition;
  • Data mapping;
  • Data load;
  • Developments GSQL Queries;
  • Results analysis.

 

Design Schema

On the design schema, we will add the vertexes and edges to design the entities and their connections. Below is an example of the schema of a medical company that loans surgery kits. The main problem was to understand the localization of each kit.

A kit is composed of several components that can cost thousands of dollars to produce. Given the costs associated and the fact that some of these components can have an expiration date, besides other less relevant, make it essential to have higher visibility of the localization and status of each kit, to maximize the resources usage and reduce potential waste/costs. On top it, we can also have a potential benefit on customer service and satisfaction.

On the below picture, you can find the schema defined for this problem:

Figure 1 – Data Model

 

On the Vertices and Edges we define attributes that can then be used for the queries used to answer the business questions.

We can also add attributes to the edges and vertexes.

Figure 2 – Edges and Vertex properties

 

The edges can be of three different types:

  • Undirected – The relationship between any two vertices goes both ways;
  • Directed– The relationship between any two vertices goes only one way;
  • Directed Edge + Reverse Edges– The relationship between two vertices is like: vertice A can go to vertice B via AB, vertice b can go to vertice A via BA.

Below is an example:

 

 

Map Data to Graph

Once we have the graph design closed, we need to upload the data to the defined schema. For this POC, we are going to use CSV files. We have several files that combine information about the localization of warehouses, kits details (if they are booking or not and the components belonging), and the medical facilities.

 

Figure 6 – Data Mapping

 

On the image above, we have connected each dataset to the respective entity. The next step is mapping the data to the Graph. We map the data from the CSV files to the attributes of the vertices or edges. Below is an example of this.

Figure 7 – Data Mapping – map datasets to the vertices and edges attributes

 

Load Data

Once the mapping is complete, the data load is a relatively easy step, is just hit the play button on the Load Data tab. Once it’s done, we can move to the next step that is data visualization part.

 

Figure 8 – Data Loading

 

Explore Graph

With the schema design created and the data loaded, we can proceed with the graph data exploration. We can search for a specific vertex using the vertex id, or we can find a random vertex of a particular type. For our example, we are going to pick three vertices of the type Warehouse.

Figure 9 – Data Exploration

 

We also have the option of expanding vertexes by clicking the triangle-looking symbol. This option allows us to expand from a vertex to other vertexes beyond the immediate connections. Like mentioned, in the example below, we are going to expand the three vertices warehouses.

Figure 10 – Data Exploration – Vertices details

 

Writing Queries

In the Write Queries page, we can design and run custom queries with TigerGraph’s powerful graph query language GSQL. This language has similarities with the traditional SQL. As a relevant note, at the moment, there is an initiative ​underway to define a new graph query language GQL (GQL is different from GSQL but should be more similar to it than other existing languages). This initiative is led by a group of different vendors like Neo4j or TigerGraph.

Below are some of the queries defined:

Clients with the highest kits volume

This query prints the top k costumers that have the most number of kits. If we want to know our top 3 clients.

Figure 11 – GSQL Top K customers with the highest number of kits

 

The result of the query:

Figure 12 – Top 3 clients with the highest number of kits

 

Kit Components Expiration Dates

To maximize resources and reduce waste, it is essential to understand the localization of the kits which have components that are going to expire. To accomplish that we have created the query A_ExpiringComponents. This query allows us to discover the localization of the surgery kits that have components that are going to expire in May

Figure 13 – GSQL – Kits with components that are going to expire

 

The result of the query:

Figure 14 – GSQL – Kits with components that are going to expire

 

We also made some other relevant analyses:

  • A_KitsLocationByKitFamily – This query gives a map of the localization of the kits of a particular kit family;
  • A_Kits_At_Medical_Facilities – This query provides all the kits that are in a specific medical facility;
  • A_SurgeryKitsLocationNeededByDate – This query gets all kits location for a kit family, of all the kits whose return date is before the date that the kit is needed in the warehouse.

 

Conclusion

Graphs offer unparalleled benefits compared to a traditional relational database, especially for mapping and analyzing highly connected data. Graph databases are an extremely flexible, scalable, and powerful tool, that enables us to quickly analyze complex relationships and behaviour among connected data. This is due to the fact that they are specially optimized to find structures and reveal patterns in related data. In graph databases, we can adapt/evolve our data model significantly easier than a traditional/relational schema, as the addition of new entities or connections is a lot more flexible. The potential benefits for the business of this kind of technology can be crucial to help enable organizations to make more informed business decisions, gain competitive advantages, and more importantly, create business value.

   José Oliveira BI4ALL
    Marta Barreto         
    Big Data Team

 

Blog