Asked  7 Months ago    Answers:  5   Viewed   35 times

There is a big database, 1,000,000,000 rows, called threads (these threads actually exist, I'm not making things harder just because of I enjoy it). Threads has only a few stuff in it, to make things faster: (int id, string hash, int replycount, int dateline (timestamp), int forumid, string title)


select * from thread where forumid = 100 and replycount > 1 order by dateline desc limit 10000, 100

Since that there are 1G of records it's quite a slow query. So I thought, let's split this 1G of records in as many tables as many forums(category) I have! That is almost perfect. Having many tables I have less record to search around and it's really faster. The query now becomes:

select * from thread_{forum_id} where replycount > 1 order by dateline desc limit 10000, 100

This is really faster with 99% of the forums (category) since that most of those have only a few of topics (100k-1M). However because there are some with about 10M of records, some query are still to slow (0.1/.2 seconds, to much for my app!, I'm already using indexes!).

I don't know how to improve this using MySQL. Is there a way?

For this project I will use 10 Servers (12GB ram, 4x7200rpm hard disk on software raid 10, quad core)

The idea was to simply split the databases among the servers, but with the problem explained above that is still not enought.

If I install cassandra on these 10 servers (by supposing I find the time to make it works as it is supposed to) should I be suppose to have a performance boost?

What should I do? Keep working with MySQL with distributed database on multiple machines or build a cassandra cluster?

I was asked to post what are the indexes, here they are:

mysql> show index in thread;

Select explain:

mysql> explain SELECT * FROM thread WHERE forumid = 655 AND visible = 1 AND open <> 10 ORDER BY dateline ASC LIMIT 268000, 250;
| id | select_type | table  | type | possible_keys | key     | key_len | ref         | rows   | Extra                       |
|  1 | SIMPLE      | thread | ref  | forumid       | forumid | 4       | const,const | 221575 | Using where; Using filesort | 



You should read the following and learn a little bit about the advantages of a well designed innodb table and how best to use clustered indexes - only available with innodb !

then design your system something along the lines of the following simplified example:

Example schema (simplified)

The important features are that the tables use the innodb engine and the primary key for the threads table is no longer a single auto_incrementing key but a composite clustered key based on a combination of forum_id and thread_id. e.g.

threads - primary key (forum_id, thread_id)

forum_id    thread_id
========    =========
1                   1
1                   2
1                   3
1                 ...
1             2058300  
2                   1
2                   2
2                   3
2                  ...
2              2352141

Each forum row includes a counter called next_thread_id (unsigned int) which is maintained by a trigger and increments every time a thread is added to a given forum. This also means we can store 4 billion threads per forum rather than 4 billion threads in total if using a single auto_increment primary key for thread_id.

forum_id    title   next_thread_id
========    =====   ==============
1          forum 1        2058300
2          forum 2        2352141
3          forum 3        2482805
4          forum 4        3740957
64        forum 64       3243097
65        forum 65      15000000 -- ooh a big one
66        forum 66       5038900
67        forum 67       4449764
247      forum 247            0 -- still loading data for half the forums !
248      forum 248            0
249      forum 249            0
250      forum 250            0

The disadvantage of using a composite key is that you can no longer just select a thread by a single key value as follows:

select * from threads where thread_id = y;

you have to do:

select * from threads where forum_id = x and thread_id = y;

However, your application code should be aware of which forum a user is browsing so it's not exactly difficult to implement - store the currently viewed forum_id in a session variable or hidden form field etc...

Here's the simplified schema:

drop table if exists forums;
create table forums
forum_id smallint unsigned not null auto_increment primary key,
title varchar(255) unique not null,
next_thread_id int unsigned not null default 0 -- count of threads in each forum

drop table if exists threads;
create table threads
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
reply_count int unsigned not null default 0,
hash char(32) not null,
created_date datetime not null,
primary key (forum_id, thread_id, reply_count) -- composite clustered index

delimiter #

create trigger threads_before_ins_trig before insert on threads
for each row
declare v_id int unsigned default 0;

  select next_thread_id + 1 into v_id from forums where forum_id = new.forum_id;
  set new.thread_id = v_id;
  update forums set next_thread_id = v_id where forum_id = new.forum_id;

delimiter ;

You may have noticed I've included reply_count as part of the primary key which is a bit strange as (forum_id, thread_id) composite is unique in itself. This is just an index optimisation which saves some I/O when queries that use reply_count are executed. Please refer to the 2 links above for further info on this.

Example queries

I'm still loading data into my example tables and so far I have a loaded approx. 500 million rows (half as many as your system). When the load process is complete I should expect to have approx:

250 forums * 5 million threads = 1250 000 000 (1.2 billion rows)

I've deliberately made some of the forums contain more than 5 million threads for example, forum 65 has 15 million threads:

forum_id    title   next_thread_id
========    =====   ==============
65        forum 65      15000000 -- ooh a big one

Query runtimes

select sum(next_thread_id) from forums;

539,155,433 (500 million threads so far and still growing...)

under innodb summing the next_thread_ids to give a total thread count is much faster than the usual:

select count(*) from threads;

How many threads does forum 65 have:

select next_thread_id from forums where forum_id = 65

15,000,000 (15 million)

again this is faster than the usual:

select count(*) from threads where forum_id = 65

Ok now we know we have about 500 million threads so far and forum 65 has 15 million threads - let's see how the schema performs :)

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 64 order by thread_id desc limit 32;

runtime = 0.022 secs

select forum_id, thread_id from threads where forum_id = 65 and reply_count > 1 order by thread_id desc limit 10000, 100;

runtime = 0.027 secs

Looks pretty performant to me - so that's a single table with 500+ million rows (and growing) with a query that covers 15 million rows in 0.02 seconds (while under load !)

Further optimisations

These would include:

  • partitioning by range

  • sharding

  • throwing money and hardware at it


hope you find this answer helpful :)

Tuesday, June 1, 2021
answered 7 Months ago

Many, many, many people use a client in front of Thrift. For PHP, the most common is PHPCassa

  • Which PHP client library to use with Cassandra?
  • Cassandra via PHP
Saturday, May 29, 2021
answered 7 Months ago

Update: The OP clarified his original question to indicate that the project in question is dead.

Given that a project is dead and that it has N forks, the usual strategy for picking the "best fork" depends on a number of factors:

  • Has anyone announced that they're taking over maintenance of the project? Search for news on the mailing lists, public forums, et cetera (not just on GitHub). If so, start following that fork.

  • Are there forks on the network graph that have commits coming from another fork (rather than, say, master)? If so, that indicates support is coalescing around this other fork as a possible replacement for the dead project.

  • If either of these don't work, do some more digging: Are there variants of the project that split off some time ago and around which support is now organizing? If so, maybe that other project will suit your needs.

Failing all that, sometimes projects just don't have the support to keep going. If so, perhaps it's time to scratch your own itch!

Original answer:

A few comments on this:

  • In general, you should use released versions of a project, rather than choosing a particular fork or branch. If you don't, your dependencies will change every time the project updates with a new commit, which could lead to difficult-to-find bugs in your own code.

  • If you decide you always want the latest version and you're okay with the risks of that, then use the project's master branch, because that's the official one. That'll be the default branch you're looking at when you visit the repository's main page.

  • People usually fork repositories to make and solicit feedback on some kind of modification to the original codebase.

  • So, unless you're collaborating on that modification, or unless you have a particular interest in getting early access to that specific modification, you don't need to worry about the other forks.

On rare occasions, a project's official mainline development branch will be called something other than master. In that case you should consult the README to see which branch you should look at.

Monday, October 11, 2021
answered 2 Months ago

I see 3 issues with your approach here which I will address below:

  • super column families,
  • thrift vs cql3,
  • json data as cell values.

Before you go ahead: the use super column families is discouraged. Read more here. Composite keys (as described below) are the way to go.

Also, you might need to read up on CQL3, since thrift is a legacy API since 1.2.

Instead of storing json data, you may make use of native collection data types like lists, and maps etc. If you still want to work with JSON, there is improved JSON support in in Cassandra since version 2.2.

In general, it is pretty straightforward to query per device and per timeperiod:

  • you row key would be the device id and the column key a timeuuid
  • To avoid hot spots, you could add "bucket" counters to the row key (create a composite row/partition key) to rotate the nodes
  • You can then query for time ranges if you know the row/device id.

Alternatively you could use your signal type as a row key (and timeuuid/timestamp as a column key) if you want to query data for multiple devices (but one event type) at once. Read more on timeseries data in cassandra in this blog entry.

Hope that helps!

Saturday, October 23, 2021
answered 1 Month ago

You can't truncate a table that is locked for writing. This is because "truncate" means "destroy the table, and recreate a new one with the same schema."

You can however, empty the table. Instead of TRUNCATE TABLE asin_one_time_only use DELETE FROM asin_one_time_only. Note that this will not reset the autoincrement numbering. If you want to reset it as well, use ALTER TABLE asin_one_time_only auto_increment=1

I suggest doing this:

LOCK TABLES asin_one_time_only READ;
SELECT asin FROM asin_one_time_only;
-- minimize the possibility of someone writing to the table in-between
-- an "UNLOCK TABLES" and a "LOCK TABLES" by just issuing a new LOCK TABLES
-- I am not 100% sure that MySQL will do this atomically, so there is a
-- possibility that you may delete a row that was not read.
-- If this is unacceptable, then use a "LOCK TABLES asin_one_time_only WRITE"
-- from the very beginning.
LOCK TABLES asin_one_time_only WRITE;
DELETE FROM asin_one_time_only;
ALTER TABLE asin_one_time_only auto_increment=1;
Saturday, November 6, 2021
answered 4 Weeks ago
Only authorized users can answer the question. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :