Asked  6 Months ago    Answers:  5   Viewed   44 times

Consider me as 'user' 1. The aim of the query is to get the posts 'posted' by the people I follow and for each of those posts check:

  1. Whether it has been liked by me
  2. Whether it has been liked by anyone else that I follow and if yes choose one of those users at random to return

Sample data:

g.addV('user').property('id',1).as('1')
  addV('user').property('id',2).as('2').
  addV('user').property('id',3).as('3').
  addV('user').property('id',4).as('4').
  addV('post').property('postId','post1').as('p1').
  addV('post').property('postId','post2').as('p2').
  addE('follow').from('1').to('2').
  addE('follow').from('1').to('3').
  addE('follow').from('1').to('4').
  addE('posted').from('2').to('p1').
  addE('posted').from('2').to('p2').
  addE('liked').from('1').to('p2').
  addE('liked').from('3').to('p2').
  addE('liked').from('4').to('p2').iterate()

Query: (As answered here: Graph/Gremlin for social media use case) g.V().has('id',1).as('me').out('follow').aggregate('followers').out('posted').group().by('postId').by(project('likedBySelf','likedByFollowing').by(__.in('liked').where(eq('me')).count()).by(__.in('liked').where(within('followers')).order().by(shuffle).values('id').fold()))

Output:

[post1:[likedBySelf:0,likedByFollowing:[]],post2:[likedBySelf:1,likedByFollowing:[4,3]]]

This query is able to shuffle the values but shows all of the 'id's, now I want to select only the first 'id'. Using .next() instead of .fold() causes exception java.util.NoSuchElementException Is it possible to choose randomly without having to evaluate all the traversals first and then shuffling them?

Desired Output:

[post1:[likedBySelf:0,likedByFollowing:[]],post2:[likedBySelf:1,likedByFollowing:[3]]]

Or

[post1:[likedBySelf:0,likedByFollowing:[]],post2:[likedBySelf:1,likedByFollowing:[4]]]

 Answers

34

You're pretty close to your answer:

gremlin> g = TinkerGraph.open().traversal()
==>graphtraversalsource[tinkergraph[vertices:0 edges:0], standard]
gremlin> g.addV('user').property('id',1).as('1').
......1>   addV('user').property('id',2).as('2').
......2>   addV('user').property('id',3).as('3').
......3>   addV('user').property('id',4).as('4').
......4>   addV('post').property('postId','post1').as('p1').
......5>   addV('post').property('postId','post2').as('p2').
......6>   addE('follow').from('1').to('2').
......7>   addE('follow').from('1').to('3').
......8>   addE('follow').from('1').to('4').
......9>   addE('posted').from('2').to('p1').
.....10>   addE('posted').from('2').to('p2').
.....11>   addE('liked').from('1').to('p2').
.....12>   addE('liked').from('3').to('p2').
.....13>   addE('liked').from('4').to('p2').iterate()
gremlin> g.V().has('id',1).as('me').
......1>   out('follow').
......2>   aggregate('followers').
......3>   out('posted').
......4>   group().
......5>     by('postId').
......6>     by(project('likedBySelf','likedByFollowing').
......7>          by(__.in('liked').where(eq('me')).count()).
......8>          by(__.in('liked').where(within('followers')).order().by('id',shuffle).values('id').limit(1).fold()))
==>[post2:[likedBySelf:1,likedByFollowing:[3]],post1:[likedBySelf:0,likedByFollowing:[]]]

I pretty much just added limit(1) so that only the first item after shuffle is selected. It took a few executions but I was able to see both of the outputs you were looking for with this method. As I suggested on your other question, you might also use sample(1):

gremlin> g.V().has('id',1).as('me').
......1>   out('follow').
......2>   aggregate('followers').
......3>   out('posted').
......4>   group().
......5>     by('postId').
......6>     by(project('likedBySelf','likedByFollowing').
......7>          by(__.in('liked').where(eq('me')).count()).
......8>          by(__.in('liked').where(within('followers')).sample(1).values('id').fold()))
==>[post2:[likedBySelf:1,likedByFollowing:[3]],post1:[likedBySelf:0,likedByFollowing:[]]]
gremlin> g.V().has('id',1).as('me').
......1>   out('follow').
......2>   aggregate('followers').
......3>   out('posted').
......4>   group().
......5>     by('postId').
......6>     by(project('likedBySelf','likedByFollowing').
......7>          by(__.in('liked').where(eq('me')).count()).
......8>          by(__.in('liked').where(within('followers')).sample(1).values('id').fold()))
==>[post2:[likedBySelf:1,likedByFollowing:[4]],post1:[likedBySelf:0,likedByFollowing:[]]]
Tuesday, June 1, 2021
 
Chvanikoff
answered 6 Months ago
72

Use reservoir sampling. It's a very simple algorithm that works for any N.

Here is one Python implementation, and here is another.

Saturday, June 5, 2021
 
Sujith
answered 6 Months ago
84

Use random.sample to choose random non-repeating elements:

>>> import random
>>> random.sample(glob.glob('*.jpg'), number_of_images_to_choose)

random.sample returns a list object.

Side note: there's no need in list comprehension, unless you're planning to filter the result of glob.glob.

Friday, August 6, 2021
 
jon333
answered 4 Months ago
16

Interesting question. I am on the same track.

First your question about MLlib. I assume that you mean Apache Spark MLlib, the machine learning (ML) implementation on top of Apache Spark. So my conclusion is: you want to run ML algorithms for purposes such as clustering and classification using the data in your Titan/Cassandra based graph database. Please note that you could also use graph processing algorithms like Page Rank mentioned by spidy to do things like clustering on top of your Titan/Cassandra graph database. In other words: you don't need ML to do clustering when your starting point is a graph database.

Apache Spark MLlib seems to be future proof and widely supported, their most recent announcements were regarding new ML algorithms, although Apache Mahout, another Apache ML project, is more mature regarding the amount of supported ML algorithms. Apache Mahout has also adopted Apache Spark as their data storage layer, so I therefore mention it in this post. Apache Spark offers, in addition to in-memory computing, the mentioned MLlib for machine learning, Spark SQL which is like Hive on Spark, GraphX which is a graph processing system as explained by spidy and Spark Streaming for processing of streaming data.

I consider Apache Spark itself as a logical data layer, represented as RDDs (Resilient Distributed Datasets) on top of storage layers such as Cassandra, Hadoop/Hcatalog and HBase. Apache Spark offers a connector to Cassandra. Note that RDDs are immutable, you cannot alter data using Spark, you can only process and analyze the data in Spark. Regarding the Apache Spark logical storage layer RDD: You could compare an RDD as a view in the good old SQL times, RDDs give you a view on for example a table in Cassandra of HBase. Note also that Apache Spark offers an API for 3 development environments: Scala, Java and Python.

Apache Giraph is also a graph processing toolset, functional equivalent to Apache Spark GraphX. Apache Giraph uses Hadoop as the data storage layer. You are using Titan/Cassandra so you will probably enter data migration tasks when you select Apache Giraph as your solution. Secondly, you started your post with a question regarding ML using MLlib and Apache Giraph is not a ML solution.

Your conclusion regarding Giraph and Gremlin is not correct: they are not the same although both are using a graph database. Giraph is a solution for graph processing as spidy explained. Using Giraph you can execute graph analysis algorithms such as Page Rank, e.g. who has the most followers, whilst Gremlin is meant for traversing e.g. query the graph database using the complex relationships (edges) between entities (vertices) obtaining result sets of vertex and edge properties.

Wednesday, September 29, 2021
 
Anton Barinov
answered 2 Months ago
28

Gremlin Console has built in support for this and it is described in detail here. The basic connection command is:

gremlin> :remote connect tinkerpop.server conf/remote.yaml
==>Configured localhost/127.0.0.1:8182

at which point you can issue traversals against the remote graph:

gremlin> :> g.V().values('name')
==>marko
==>vadas
==>lop
==>josh
==>ripple
==>peter

If you'd like to drop the :> syntax you can put the REPL in "console" mode and that prefix will no longer be necessary:

gremlin> :remote console
==>All scripts will now be sent to Gremlin Server - [localhost/127.0.0.1:8182]-[5ff68eac-5af9-4140-b3b8-d9311f30c053] - type ':remote console' to return to local mode
Friday, October 15, 2021
 
Miguel Ping
answered 2 Months ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share