The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data.
Cassandra is a partitioned row store. Rows are organized into tables with a required primary key. Partitioning means that Cassandra can distribute your data across multiple machines in an application-transparent matter. Cassandra will automatically re-partition as machines are added/removed from the cluster. Row store means that like relational databases, Cassandra organizes data by rows and columns. The Cassandra Query Language (CQL) is a close relative of SQL.
Cassandra installation and service initialization on Fedora box:
$ sudo dnf install cassandra cassandra-server # install client/server
$ sudo systemctl start cassandra # initialize Cassandra server
To enable Cassandra to automatically start at boot time, run:
$ sudo systemctl enable cassandra
Basic command-line usage:
$ cqlsh # run client tool
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.1 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh> CREATE KEYSPACE k1 WITH REPLICATION = { 'class' : 'SimpleStrategy' , 'replication_factor' : 1 };
cqlsh> USE k1;
cqlsh:k1> CREATE TABLE users (user_name varchar, password varchar, gender varchar, PRIMARY KEY (user_name));
cqlsh:k1> INSERT INTO users (user_name, password, gender) VALUES ('John', 'test123', 'male');
cqlsh:k1> SELECT * from users;
user_name | gender | password
-----------+--------+----------
John | male | test123
(1 rows)
Server configuration is done by editing the file /etc/cassandra/cassandra.yaml
. For more
informations how to change configuration, see the the upstream documentation.
By default, authentication is disabled and to enable it you have to do the following steps:
PasswordAuthenticator
:authenticator: PasswordAuthenticator
By default, the authenticator option is set to AllowAllAuthenticator
.
Restart Cassandra.
Start cqlsh
using the default superuser name and password:
$ cqlsh -u cassandra -p cassandra
cqlsh> CREATE ROLE <new_super_user> WITH PASSWORD = '<some_secure_password>'
AND SUPERUSER = true
AND LOGIN = true;
$ cqlsh -u <new_super_user> -p <some_secure_password>
cqlsh> ALTER ROLE cassandra WITH PASSWORD='SomeNonsenseThatNoOneWillThinkOf'
AND SUPERUSER=false;
Edit the /etc/cassandra/cassandra.yaml
file, changing the following parameters:
listen_address: external_ip
rpc_address: external_ip
seed_provider/seeds: "<external_ip>"
Setting the name of the cluster. This needs to be consistent for all nodes that you intend to have in this cluster:
cluster_name: 'Test Cluster'
Set the directory where cassandra will write too, below is the default that will be used if unset. If possible set this to a disk used only for storing cassandra data:
data_file_directories:
- /var/lib/cassandra/data
Set which type of disk cassandra is using to store data on ssd or spinning:
disk_optimization_strategy: ssd|spinning
In order to run Cassandra in production environment, you should pay extra attention to setting up the service in order to minimize the risk of being exploited. Among other things, this means:
By default, Cassandra cannot be accessed from another computer. In order to allow access from another computer, we need to do the following things.
Allow to accept connections on port 9042:
$ sudo firewall-cmd --permanent --zone=public --add-port=9042/tcp
Follow the previous instructions to “Enabling remote access to server”
The package logs to /var/log/cassandra/system.log
by default.
The nodetool utility is a command-line interface for monitoring Cassandra and performing routine database operations.
$ sudo nodetool status
Datacenter: datacenter1
=======================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID Rack
UN 127.0.0.1 143.43 KiB 256 100.0% 94becf56-8a2a-4935-bdf0-79e24694d207 rack1
One of the main features of Apache Cassandra is its ability to run in a multi-node setup, providing the following benefits:
Fault tolerance: Data is automatically replicated to multiple nodes for fault-tolerance. Replication across multiple data centers is supported. Failed nodes can be replaced with no downtime.
Decentralization: There are no single points of failure. There are no network bottlenecks. Every node in the cluster is identical.
Scalability & Elasticity: Capability to run with dozens of thousands of nodes with petabytes of data. Read and write throughput both increase linearly as new machines are added, with no downtime or interruption to applications.
If the server was started before or it’s already running, all the existing data must be deleted,
the reason for this is because all nodes must have the same cluster_name
and it’s better to
choose a different one from the default Test cluster
name.
The following commands must be executed on each node:
$ sudo systemctl stop cassandra
$ sudo rm -rf /var/lib/cassandra/data/system/*
The Cassandra main configuration file /etc/cassandra/cassandra.yaml
needs to be edited to setup the
cluster. The parameters that need to be modified are:
cluster_name
: Name of your cluster.
num_tokens
: Number of virtual nodes within a Cassandra instance. This is used to partition the data
and spread the data throughout the cluster. The recommended value is 256.
seeds
: Comma-delimited list of the IP address of each node in the cluster.
listen_address
: The IP address or hostname that Cassandra binds to for connecting to other Cassandra
nodes. It defaults to localhost and needs to be changed to the IP address of the node.
rpc_address
: The listen address for client connections (CQL protocol).
endpoint_snitch
: Set to a class that implements the IEndpointSnitch. Cassandra uses snitches for
locating nodes and routing requests. The default one is SimpleSnitch
but we will change to
GossipingPropertyFileSnitch
which is more suitable for production environments:
SimpleSnitch
: Used for single-datacenter deployments or single-zone in public clouds. Does not
recognize datacenter or rack information. It treats strategy order as proximity, which can improve cache
locality when disabling read repair.
GossipingPropertyFileSnitch
: Recommended for production. The rack and datacenter for the local
node are defined in the cassandra-rackdc.properties file and propagated to other nodes via gossip.
auto_bootstrap
: This parameter is not present in the configuration file, so it has to be added and
set to false. It makes new (non-seed) nodes automatically migrate the right data to themselves.
Configuration files for a 2-node cluster:
Node 1:
cluster_name: 'My Cluster'
num_tokens: 256
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: 10.0.0.1, 10.0.0.2
listen_address: 10.0.0.1
rpc_address: 10.0.0.1
endpoint_snitch: GossipingPropertyFileSnitch
auto_bootstrap: false
Node 2:
cluster_name: 'My Cluster'
num_tokens: 256
seed_provider:
- class_name: org.apache.cassandra.locator.SimpleSeedProvider
- seeds: 10.0.0.1, 10.0.0.2
listen_address: 10.0.0.2
rpc_address: 10.0.0.2
endpoint_snitch: GossipingPropertyFileSnitch
auto_bootstrap: false
The final step is to start each instance of the cluster, the seed instances must be started first, then the remaing nodes.
$ sudo systemctl start cassandra
The cluster status can be checked with nodetool utility:
$ sudo nodetool status
Datacenter: datacenter1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns Host ID Rack
UN 10.0.0.2 147.48 KB 256 ? f50799ee-8589-4eb8-a0c8-241cd254e424 rack1
UN 10.0.0.1 139.04 KB 256 ? 54b16af1-ad0a-4288-b34e-cacab39caeec rack1
Note: Non-system keyspaces don't have the same replication settings, effective ownership information is meaningless
Cassandra Datastax documentation
The Cassandra Query Language (CQL)
Recommended Production Settings For Apache Cassandra
There is a Cassandra image available for OpenShift.
Authors: Augusto Mecking Caringi