Asked  7 Months ago    Answers:  5   Viewed   51 times

Suppose I have pandas DataFrame like this:

>>> df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4],'value':[1,2,3,1,2,3,4,1,1]})
>>> df
   id  value
0   1      1
1   1      2
2   1      3
3   2      1
4   2      2
5   2      3
6   2      4
7   3      1
8   4      1

I want to get a new DataFrame with top 2 records for each id, like this:

   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

I can do it with numbering records within group after group by:

>>> dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
>>> dfN
   id  level_1  index  value
0   1        0      0      1
1   1        1      1      2
2   1        2      2      3
3   2        0      3      1
4   2        1      4      2
5   2        2      5      3
6   2        3      6      4
7   3        0      7      1
8   4        0      8      1
>>> dfN[dfN['level_1'] <= 1][['id', 'value']]
   id  value
0   1      1
1   1      2
3   2      1
4   2      2
7   3      1
8   4      1

But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).

 Answers

100

Did you try df.groupby('id').head(2)

Ouput generated:

>>> df.groupby('id').head(2)
       id  value
id             
1  0   1      1
   1   1      2 
2  3   2      1
   4   2      2
3  7   3      1
4  8   4      1

(Keep in mind that you might need to order/sort before, depending on your data)

EDIT: As mentioned by the questioner, use df.groupby('id').head(2).reset_index(drop=True) to remove the multindex and flatten the results.

>>> df.groupby('id').head(2).reset_index(drop=True)
    id  value
0   1      1
1   1      2
2   2      1
3   2      2
4   3      1
5   4      1
Tuesday, June 1, 2021
 
juanrpozo
answered 7 Months ago
50

Here is one way to do this, using UNION ALL (See SQL Fiddle with Demo). This works with two groups, if you have more than two groups, then you would need to specify the group number and add queries for each group:

(
  select *
  from mytable 
  where `group` = 1
  order by age desc
  LIMIT 2
)
UNION ALL
(
  select *
  from mytable 
  where `group` = 2
  order by age desc
  LIMIT 2
)

There are a variety of ways to do this, see this article to determine the best route for your situation:

http://www.xaprb.com/blog/2006/12/07/how-to-select-the-firstleastmax-row-per-group-in-sql/

Edit:

This might work for you too, it generates a row number for each record. Using an example from the link above this will return only those records with a row number of less than or equal to 2:

select person, `group`, age
from 
(
   select person, `group`, age,
      (@num:=if(@group = `group`, @num +1, if(@group := `group`, 1, 1))) row_number 
  from test t
  CROSS JOIN (select @num:=0, @group:=null) c
  order by `Group`, Age desc, person
) as x 
where x.row_number <= 2;

See Demo

Tuesday, June 1, 2021
 
VostanAzatyan
answered 7 Months ago
49

Assuming you want a single row per thread and not all rows for all posts.

DISTINCT ON is still the most convenient tool. But the leading ORDER BY items have to match the expressions of the DISTINCT ON clause. If you want to order the result some other way, you need to wrap it into a subquery and add another ORDER BY to the outer query:

SELECT *
FROM  (
   SELECT DISTINCT ON (t.id)
          t.id, u.username, p.updated_at, t.title
   FROM   forum_threads      t
   LEFT   JOIN forum_posts   p ON p.thread_id = t.id
   LEFT   JOIN users         u ON u.id = p.user_id
   WHERE  t.forum_id = 3
   ORDER  BY t.id, p.updated_at DESC
   ) sub
ORDER  BY updated_at DESC;

If you are looking for a query without subquery for some unknown reason, this should work, too:

SELECT DISTINCT
       t.id
     , first_value(u.username)   OVER w AS username
     , first_value(p.updated_at) OVER w AS updated_at
     , t.title
FROM   forum_threads      t
LEFT   JOIN forum_posts   p ON p.thread_id = t.id
LEFT   JOIN users         u ON u.id = p.user_id
WHERE  t.forum_id = 3
WINDOW w AS (PARTITION BY t.id ORDER BY p.updated_at DESC)
ORDER  BY updated_at DESC;

There is quite a bit going on here:

  1. The tables are joined and rows are selected according to JOIN and WHERE clauses.

  2. The two instances of the window function first_value() are run (on the same window definition) to retrieve username and updated_at from the latest post per thread. This results in as many identical rows as there are posts in the thread.

  3. The DISTINCT step is executed after the window functions and reduces each set to a single instance.

  4. ORDER BY is applied last and updated_at references the OUT column (SELECT list), not one of the two IN columns (FROM list) of the same name.

Yet another variant, a subquery with the window function row_number():

SELECT id, username, updated_at, title
FROM  (
   SELECT t.id
        , u.username
        , p.updated_at
        , t.title
        , row_number() OVER (PARTITION BY t.id
                             ORDER BY p.updated_at DESC) AS rn
   FROM   forum_threads      t
   LEFT   JOIN forum_posts   p ON p.thread_id = t.id
   LEFT   JOIN users         u ON u.id = p.user_id
   WHERE  t.forum_id = 3
   ) sub
WHERE  rn = 1
ORDER  BY updated_at DESC;

Similar case:

  • Return records distinct on one column but order by another column

You'll have to test which is faster. Depends on a couple of circumstances.

Saturday, August 7, 2021
 
bux
answered 4 Months ago
bux
12

Try

MATCH (o:Order)-[r:ADDED]->(a:Article)
WITH o, r, a
ORDER BY o.oid, r.t
WITH o, COLLECT(a)[..2] AS topArticlesByOrder UNWIND topArticlesByOrder AS a
RETURN a.aid AS articleId, COUNT(*) AS count

Results look like

articleId    count
   8           6
   2           2
   4           5
   7           2
   3           3
   6           5
   0           7

on this sample graph created with

FOREACH(opar IN RANGE(1,15) |
    MERGE (o:Order {oid:opar})
    FOREACH(apar IN RANGE(1,5) |
        MERGE (a:Article {aid:TOINT(RAND()*10)})
        CREATE o-[:ADDED {t:timestamp() - TOINT(RAND()*1000)}]->a
    )
)
Wednesday, September 29, 2021
 
Camille Laborde
answered 2 Months ago
23

This can be achieved by embedding your existing query inside a CTE using WITH, then using it in a WHERE ... IN subquery. The subquery selects LIMIT 2 product IDs from the CTE that match category_name, city, and gender, ordered by product count.

WITH order_groups AS (
  SELECT
    category_name,
    city, gender,
    product_id,
    COUNT(product_id) AS product_count
  FROM orders OO
  LEFT JOIN products ON product_id = products.id
  LEFT JOIN clients ON client_id = clients.id
  GROUP BY product_id, category_name, city, gender
  ORDER BY category_name, city, gender, COUNT(product_id) DESC
)
SELECT * FROM order_groups OG_outer
WHERE OG_outer.product_id IN (
  SELECT product_id
  FROM order_groups OG_inner
  WHERE 
    OG_outer.category_name = OG_inner.category_name AND
    OG_outer.city = OG_inner.city AND
    OG_outer.gender = OG_inner.gender
  ORDER BY OG_inner.product_count DESC LIMIT 2
)
ORDER BY category_name, city, gender, product_count DESC

This outputs the following rows as requested:

Furniture|London|Female|4|2
Furniture|London|Female|3|1
Furniture|London|Male|4|3
Furniture|London|Male|3|2
Furniture|NY|Female|5|2
Furniture|NY|Female|4|1
Furniture|NY|Male|3|3
Furniture|NY|Male|4|1
Kitchen|London|Female|9|2
Kitchen|London|Female|8|1
Kitchen|London|Male|9|3
Kitchen|London|Male|8|1
Kitchen|NY|Female|9|4
Kitchen|NY|Female|10|2
Kitchen|NY|Male|1|1
Kitchen|NY|Male|8|1
Sport|London|Female|7|2
Sport|London|Female|2|1
Sport|London|Male|7|2
Sport|London|Male|6|1
Sport|NY|Female|2|2
Sport|NY|Female|6|2
Sport|NY|Male|7|3
Sunday, November 7, 2021
 
dnoeth
answered 3 Weeks ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :  
Share